It's impressive how the MCP example in https://docs.vlm.run/mcp/examples/template-search search retains visual context across multiple images and tool calls. Unlike most chat interfaces, it enables seamless multi-step reasoning—like finding a logo in one image and tracking it in another—without losing state. This makes it ideal for building stateful, iterative visual workflows.
We’ve been building agentic VLMs that operate over visual data (i.e. images, PDFs, videos), and were surprised at how underdeveloped the current infrastructure is for multi-modal tool-calling. MCP is all the rage these days, but it sidesteps a fundamental issue that no one seems to talk about - especially in multimodal contexts.
Some of the pain points we ran into when building our MCP server:
- LLMs call tools by-value. That’s fine for text and JSON arguments, but completely breaks down for visual inputs.
- You can’t pass images or videos as base64 - it kills context limits and latency, and leads to poor-dev experience.
- Most “multimodal” MCP servers out there are single-turn demos. They assume local files and don’t support remote or persistent objects, making it impossible to build real workflows that operate on intermediate visual state - which is the core of most computer vision tasks.
So we built our remotely-hosted MCP server (https://docs.vlm.run/mcp/) that makes it trivial for agents to see, understand, and act on visual content using a suite of computer vision tools. We expose these tools (face detection, redaction, captioning, tracking, etc.) through a clean MCP-compatible API. Any agent that can hook into remote MCP servers - Claude, OpenAI, Cursor - can use it out of the box.
Here are a few end-to-end examples (orchestrated by Claude, using our tools):
It's impressive how the MCP example in https://docs.vlm.run/mcp/examples/template-search search retains visual context across multiple images and tool calls. Unlike most chat interfaces, it enables seamless multi-step reasoning—like finding a logo in one image and tracking it in another—without losing state. This makes it ideal for building stateful, iterative visual workflows.
Hi HN,
We’ve been building agentic VLMs that operate over visual data (i.e. images, PDFs, videos), and were surprised at how underdeveloped the current infrastructure is for multi-modal tool-calling. MCP is all the rage these days, but it sidesteps a fundamental issue that no one seems to talk about - especially in multimodal contexts.
Some of the pain points we ran into when building our MCP server: - LLMs call tools by-value. That’s fine for text and JSON arguments, but completely breaks down for visual inputs. - You can’t pass images or videos as base64 - it kills context limits and latency, and leads to poor-dev experience. - Most “multimodal” MCP servers out there are single-turn demos. They assume local files and don’t support remote or persistent objects, making it impossible to build real workflows that operate on intermediate visual state - which is the core of most computer vision tasks.
So we built our remotely-hosted MCP server (https://docs.vlm.run/mcp/) that makes it trivial for agents to see, understand, and act on visual content using a suite of computer vision tools. We expose these tools (face detection, redaction, captioning, tracking, etc.) through a clean MCP-compatible API. Any agent that can hook into remote MCP servers - Claude, OpenAI, Cursor - can use it out of the box.
Here are a few end-to-end examples (orchestrated by Claude, using our tools):
[1] Document Redaction: https://docs.vlm.run/mcp/examples/document-redaction [2] Face Detection + Blurring: https://docs.vlm.run/mcp/examples/face-redaction [3] Template Matching + Visual Search: https://docs.vlm.run/mcp/examples/template-search [4] Video editing: https://docs.vlm.run/mcp/examples/video-captioning
We’d love to hear what workflows you’re building - and what visual tools you'd want your agents to build on.
Shocking how poor frontier models perform on simple visual tasks. Best-in-domain tool calling will Become the norm
Very interesting. Document redaction is definitely a great use case. Gotta check this out
Noiceee!
Impressive!