AuTuber - AI VTuber Automation

Tagline

Let your avatar react, emote, and manage the stream while you focus on creating.

Inspiration

VTubing is expressive, but running a VTuber stream is still extremely manual. A creator has to talk, perform, manage OBS, trigger avatar expressions, monitor camera/audio, and stay in character all at the same time. The best reactions are spontaneous, but if the creator has to stop and remember which hotkey makes their avatar shocked, excited, sad, or away, the moment is already gone. We wanted to build an AI stagehand for VTubers: not a chatbot, and not a replacement for the creator, but a local assistant that watches the live moment and helps the stream react naturally. That became AuTuber.

What It Does

AuTuber is a local Electron desktop agent that observes live streaming context, asks a model what cue is happening, validates the model result, and safely executes approved actions through VTube Studio and OBS. A creator launches AuTuber alongside OBS and VTube Studio. AuTuber connects to local services, reads the current avatar hotkeys, monitors camera/audio/stream context, detects cues, maps those cues to safe local actions, and logs what happened. Current features include:

Dashboard for OBS, VTube Studio, capture, service activation, and model monitor status.
VTube Studio connection, authentication, hotkey loading, and manual hotkey testing.
VTS Catalog manager for cue labels, hotkey classification, manual overrides, and deactivation behavior.
Live cue detection from camera/audio/stream context.
Local cue-label resolver that maps model intent to the current safe VTube Studio hotkey.
OBS status awareness and configurable AFK overlay handling.
Action validation with cooldowns, allowlists, blocked actions, and safety checks.
Provider routing for LM Studio, OpenAI-compatible endpoints, OpenRouter, self-hosted providers, and mock development paths.
Experimental secondary model path for deeper video/audio analysis and transcription. The core loop is:

Watch live context -> detect cue -> validate locally -> trigger safe stream action

How We Built It

We built AuTuber as a TypeScript Electron app with a strict main-process and renderer-process boundary.

The Electron main process owns privileged work: OBS and VTube Studio clients, model provider calls, settings, secrets, capture orchestration, validation, execution, cooldowns, service activation, and structured logs.

The React renderer owns the UI: Dashboard, VTS Catalog, manual controls, status panels, and operator feedback. It talks to the main process only through a typed preload IPC bridge.

Shared contracts are enforced with TypeScript and Zod schemas. The automation pipeline is:

ObservationBuilder -> PromptBuilder -> ModelRouter -> ActionPlanParser -> ActionValidator -> ActionExecutor

The most important design decision is that the model never directly controls OBS or VTube Studio. The model detects intent. AuTuber owns local state, safety policy, cue mapping, cooldowns, and execution.

For VTube Studio, we learned that raw hotkey IDs are too fragile because every avatar model can have different hotkeys. AuTuber loads the current model’s hotkeys, builds a cue catalog, lets the operator override mappings, and only allows safe catalog entries during live automation.

For OBS, we kept automation conservative. AFK/vacant handling is a deterministic local workflow: the model can emit a vacant cue, but AuTuber owns the configured scene/source, debounce timing, validation, and diagnostics.

Challenges We Ran Into

The hardest part was that AuTuber is not just an AI app. It is a real-time desktop automation system involving Electron, local capture, OBS, VTube Studio, model providers, IPC, validation, and live UI state.

Media capture was tricky. Camera, screen, and microphone capture behave differently across platforms and permission systems. We ran into screen-capture backend issues, flat microphone levels, unreliable MP4 exports, and UI lag from heavy preview polling.

We also hit a real creator-workflow problem: multiple apps cannot always open the same physical webcam at once. Our setup needed camera input for OBS, VTube Studio, and AuTuber. The solution was to make OBS own the physical webcam and use OBS Virtual Camera as the shared feed. OBS becomes the camera fan-out layer, while VTube Studio and AuTuber consume the virtual camera without fighting over hardware.

Model provider support was another challenge. Some OpenAI-compatible endpoints exposed the model but rejected certain media shapes like video_url, input_audio, or non-image payloads. That forced us to separate “is the media valid?” from “does this provider support this media shape?” Our practical direction became fast frame/image input for live reactions, with separate audio/transcript support where provider capability allows it.

Latency was a major tradeoff. Short video clips gave richer context, but reactions could take around 4 to 4.5 seconds, which is too slow for live avatar emotes. We optimized the primary reaction path to use the latest buffered frame, making emotes feel much more immediate.

Faster reactions exposed accuracy problems. The model could overreact, such as reading a normal smile as heart eyes or a neutral face as shock. Prompting alone was not enough, so we added confidence thresholds, cooldowns, action validation, and deterministic cue-specific evidence gates.

The biggest architecture breakthrough was cue labels. At first, the model could see raw VTS hotkey IDs and catalog details. That was fragile. We redesigned the system so the model only detects cue labels, while AuTuber locally resolves those labels into exactly one current safe action.

One of our favorite discoveries was accidental: the model recognized a peace sign as “peace out” without us explicitly programming that gesture. It treated the gesture as a leaving cue and activated AFK mode. At first it looked like a bug, but it revealed a feature direction: natural gesture-based stream controls.

Accomplishments We Are Proud Of

We are proud that AuTuber is more than a chatbot. It observes real live context and performs real local actions.

Our biggest accomplishment is the full observe -> model -> validate -> execute loop. AuTuber can connect to VTube Studio, load avatar hotkeys, detect a cue, validate the action, trigger an emote, and log the result.

We are also proud of the safety-first architecture. Since OBS and VTube Studio can affect a live broadcast, the model cannot directly execute arbitrary actions. Every action goes through local validation, cooldowns, and policy checks.

The VTS Catalog is another major win because it makes the system adaptable to different avatar models. Creators can inspect hotkeys, manage cue labels, override classifications, and manually test actions.

We are especially excited by the peace-sign discovery because it makes AuTuber feel like a real stagehand: something that can understand the performer’s intent from natural visual cues, not just rigid hotkeys.

What We Learned

We learned that the most important part of an AI automation tool is the boundary around the AI.

For live creator tools, the model call is only one piece. The real product is built from structured observations, constrained outputs, runtime validation, safe local execution, logs, status panels, and operator control.

We also learned that provider capability detection matters. A model being available through an endpoint does not mean the endpoint supports every audio, image, or video format.

Most importantly, we learned that an AI stagehand is a better framing than an AI chatbot. The value is not conversation. The value is helping the creator keep the live moment moving without losing control of the stream.

What Is Next

Next, we want to turn AuTuber into a more polished creator-ready assistant.

Near-term goals include:

Guided first-run setup for OBS, VTube Studio, camera, model provider, and automation profiles.
Better provider capability detection for image, audio, video, JSON mode, tool calling, and transcription.
More visible action timelines showing cue detected, validation result, resolver result, and executed action.
Stronger OBS confirmation workflows for scene/source changes.
More natural gesture controls, such as waving, peace signs, thumbs up, covering the camera, or stepping away.
More automation profiles for gaming, chatting, teaching, podcasting, and live events.
More local acceleration so fast reactions and deeper video/audio analysis can run closer to the creator.

Long term, AuTuber can become a general AI stagehand for live digital performance. VTubers are the perfect starting point, but the same architecture could support streamers, esports broadcasts, classrooms, podcasts, online events, and anyone performing live while managing complex software.

Built With

Electron, React, Vite, TypeScript, Zod, OBS WebSocket, VTube Studio WebSocket API, LM Studio, self-hosted Nemotron 3 Nano Omni, pnpm workspaces, and Turbo.