Building Interview Companion Live

Most interview prep tools still feel like chatbots. They are useful for generating questions, but they do not create the pressure, pacing, interruption, and delivery feedback that make real interviews difficult.

That was the idea behind Interview Companion Live: build a voice-first interview agent that can hear you, see you, speak back in real time, and coach you on both what you say and how you say it.

I built it for the Gemini Live Agent Challenge, but the interesting part was not the challenge itself. The interesting part was figuring out how to make a live multimodal system feel less like a demo and more like an actual interview room.

The product goal

I wanted the experience to have four properties that text-based interview tools usually miss:

voice-first interaction so answers feel spoken, not composed
interruptible conversation so the exchange feels natural rather than turn-based
job-specific grounding so the interviewer asks about the actual role, not generic trivia
coaching during and after the session so the product is useful beyond novelty

That combination pushed the architecture in a very different direction from a normal chatbot.

Why a live agent was the right fit

For interview practice, latency and timing matter almost as much as content.

If the system pauses too long, the conversation feels fake. If it cannot handle interruptions, it feels scripted. If it only evaluates the transcript after the fact, it misses the delivery signals that people actually struggle with: pacing, hesitation, filler words, long pauses, or whether an answer had a clear STAR structure.

Gemini Live API made this possible because it supports a genuinely multimodal loop instead of a stitched-together pipeline of speech-to-text, text prompting, and text-to-speech.

In practice that meant the app could:

listen through the microphone
observe periodic webcam frames
respond with streamed audio
keep the conversation grounded in a job description and optional resume

System architecture

The final system ended up as a fairly simple three-layer design:

flowchart TB
    subgraph Browser["Browser Client"]
        UI[Setup wizard + interview UI]
        MIC[Microphone capture]
        CAM[Webcam capture]
        PLAYER[Streamed AI audio playback]
        COACH[Live signals + coach panel]
    end

    subgraph Backend["Bun + TypeScript Backend"]
        API[/REST session APIs/]
        WS[/WebSocket media bridge/]
        SESSION[Session lifecycle + reconnect]
        REPORT[Report generation]
    end

    subgraph Vertex["Google Cloud"]
        LIVE[Gemini Live API\ngemini-live-2.5-flash-native-audio]
        TEXT[Report model\ngemini-3-flash]
        RUN[Cloud Run]
    end

    UI --> API
    MIC --> WS
    CAM --> WS
    WS --> LIVE
    LIVE --> WS
    WS --> PLAYER
    API --> SESSION
    SESSION --> REPORT
    REPORT --> TEXT
    Backend --> RUN
    COACH --> UI

The browser handles capture and playback. The backend does the orchestration work: create sessions, validate inputs, bridge media over WebSocket, recover dropped sessions, and generate the final report. Cloud Run keeps the deployment simple and scales the service up only when someone is actively using it.

The real-time interaction loop

The media path is where most of the engineering complexity lives.

In the browser, microphone input is captured through AudioContext, converted to PCM16, base64-encoded, and streamed over WebSocket. Webcam input is sampled into JPEG frames and sent on the same live session. On the way back, AI audio is streamed in chunks and queued on the client so playback stays smooth even when the network is slightly uneven.

sequenceDiagram
    autonumber
    participant U as Candidate
    participant B as Browser
    participant S as Bun Server
    participant G as Gemini Live API

    U->>B: Speak answer + webcam video
    B->>S: PCM16 audio + sampled JPEG frames
    S->>G: Forward multimodal stream
    G-->>S: Streamed audio response + transcript events
    S-->>B: Audio chunks + session status
    B-->>U: Low-latency voice playback

    Note over B,S: WebSocket stays open for the live session
    Note over B,G: Browser keeps transcript, live metrics, and UI state in sync

This is the main reason I liked the Live API approach. The architecture stays focused on a single live conversational loop instead of constantly translating between separate subsystems.

In my testing, the audio path was usually in the rough ~200-400ms range end to end, which is the difference between something that feels usable and something that feels like talking through a wall.

Grounding the interviewer

Raw real-time conversation is not enough on its own. A generic interviewer is interesting for five minutes and then becomes repetitive.

So the app starts with structured grounding:

required job description
optional resume/background
difficulty mode: warmup, standard, or grilling
focus areas for the interview
language selection
interviewer persona selection

That context gets folded into the system prompt before the live session starts, so the interviewer can ask questions that are actually about the target role rather than drifting into generic interview filler.

I also added a job-post URL import flow and some role templates to reduce setup friction. That turned out to matter a lot. If the setup is annoying, users never get to the live part.

The feature I wanted: Super Practice Live Coach

One of the best features came from asking a simple question: what if the product did not only simulate an interview, but also helped you answer the current question and help you learn while you were still in the session?

That became Super Practice Live Coach.

The coach panel takes the latest interviewer question and turns it into a compact guidance layer with:

the real intent behind the question
a suggested answer approach
key points to hit
common pitfalls to avoid
an example of a stronger answer

The following panel is a good example of the direction I wanted: structured coaching that is useful in the exact moment the user is thinking through an answer.

Simplified layout of the Super Practice Live Coach panel

A simplified view of the Super Practice Live Coach layout: each interviewer question fans out into a focused coaching surface instead of a generic wall of feedback.

flowchart TB
    Q[Latest interviewer question] --> F[Question focus]
    Q --> A[Quick approach]
    Q --> K[Key points to hit]
    Q --> P[Common pitfalls]
    Q --> E[Example strong answer]

    F --> C[Candidate answers with better structure]
    A --> C
    K --> C
    P --> C
    E --> C

That feature also changed the product from a passive evaluator into something closer to a real practice environment. You are not only judged after the interview. You are actively coached through it.

Live coaching plus deterministic metrics

I did not want to rely only on model (text only) output for feedback, so I combined the live model with deterministic client-side metrics.

During the interview, the app tracks signals like:

speaking pace
response latency after a question
filler words
hedging language
talk ratio between candidate and interviewer
STAR completeness
long pauses
Body language
topic coverage against the job description
an audio-confidence score derived from vocal steadiness

That split worked well architecturally too. The model handles interpretation and conversational intelligence. The client handles simpler metrics that are cheap, fast, and predictable.

What was hard

The hard part was not prompting. The hard part was orchestration.

1. Smooth audio is harder than getting audio at all

Streaming audio back from the model is easy in principle. Making it sound smooth is different.

The browser receives AI audio as chunks, but the chunks do not always arrive at perfectly even intervals. I had to add sequencing plus a small client-side jitter buffer so playback would not stutter or underrun every time the network jittered.

That kind of work is invisible when it goes well, which usually means it is the real product work.

2. Interruption handling needs state discipline

For a live interview, you want users to speak naturally, including over the AI when needed. That sounds simple, but it means the client, backend, and model session all need a consistent understanding of who is currently speaking and which turn is active.

I ended up spending a lot more time on turn state, buffering, and transcript synchronization.

3. Reconnect and resume matter more than you think

Live apps break in live ways: laptops sleep, Wi-Fi drops, tabs reload, permissions fail.

So I added server-issued session tickets, reconnect/resume logic, abandoned-session cleanup, and client-side transcript snapshots. Without that, even a short connection hiccup makes the whole experience feel fragile.

4. Safety still matters in a hackathon project

The backend validates payload sizes, rate-limits expensive endpoints, and limits session scope. The report UI renders generated content through safe DOM creation.

I also added report-model fallback behavior so if the preferred report model is unavailable, the app degrades into a transcript-based fallback instead of simply failing at the end of the session.

Deployment choices

I didn’t want to overengineer this, so I kept the frontend intentionally simple: plain HTML, vanilla JavaScript, and a zero-build setup. The backend is Bun + TypeScript and deploys cleanly to Cloud Run with a small Docker image.

That gave me a nice tradeoff:

fast local iteration
simple Cloud Run deployment
automatic scaling from zero when idle
no unnecessary frontend toolchain complexity for a real-time prototype

flowchart LR
    DEV[Local Bun dev server] --> TEST[Realtime browser testing]
    TEST --> IMAGE[Container build]
    IMAGE --> CR[Cloud Run deployment]
    CR --> LIVEAPP[Public live agent app]

I like this setup for agent prototypes. The infrastructure stays simple, which lets the complexity live where it actually belongs: the interaction loop.

What I learned

The biggest lesson from this build is that multimodal UX quality is mostly a systems problem.

Prompting matters, but once you move into real-time audio and video, the product quality is dominated by:

latency
buffering
session state
grounding quality
recovery from dropped connections
how feedback is surfaced in the UI

I also came away more convinced that the best Live Agent products are not just voice wrappers around a model. They need a clear environment, a specific job to do, and a feedback loop that improves the user while the interaction is happening.

Interview prep happens to be a very good fit for that. But the same pattern would work in other high-stakes coaching workflows too. By doing this challange, many ideas and other usecases for live agents came to me that I would explore further in the near future. For example for language learning, making it a real-time interactive learning experience with real-time feedback.

Closing

Interview Companion Live started as a challenge project for the Gemini Live Agent Challenge, but it turned into a much better exploration of what a practical live agent can look like.

For me, the interesting result is not just that the AI can see, hear, and speak. It is that once the system is grounded, interruption-aware, and paired with useful coaching, it starts to feel like a real product instead of a flashy demo. I would definitely use it for myself for interview prep and I will explore ways for this to reach more people, developing the product further into a coaching / interviewing assistant tool to help people prepare for interviews.

To extend this further, I would probably push in three directions:

stronger replay and review tools for previous sessions
deeper coaching around role-specific expectations
better visual summaries of progress across multiple practice interviews
Scrape online job postings and interview sites to provide real-world context.

This post was written as part of the Gemini Live Agent Challenge.