Requirements:
It can listen 24/7 without you needing to manually start a new session or refresh it
Its recognition accuracy is at least as good as ChatGPT Voice Mode
Its latency is human level. For example if you say "run foo", it should execute the command without taking 5 seconds; it should be as fast as a typist who was listening for what you would say next.
It should be able to be interrupted while speaking. Voice Mode already does this.
It can call arbitrarily defined external tools, like remote MCP servers.
The model is at least as smart as Sonnet 4.5, which I think is already satisfied by ChatGPT Voice Mode
After making another market I decided that was too specific, so this question tracks what the real goal is.
Update 2025-12-11 (PST) (AI summary of creator comment): Regarding the latency requirement: The "essentially no delay" between saying "run foo" and execution should be measured relative to how long a human would take to perform the same action (e.g., a typist listening and executing the command).