Show HN: Bhumi–OSS Python Library w Rust Underhead for 2.5x Faster LLM Inference

bhumi.trilok.ai

8 points by rachpradhan 4 months ago

Read the full blogpost at https://rach.codes/blog/Introducing-Bhumi

(click on reader to see the technical breakdown!) AI inference should be fast, but in practice it’s painfully slow. Inference bottlenecks slow down LLM-powered chatbots and AI workflows everywhere.

I built Bhumi to fix that. Bhumi is a Python library designed for developers, yet its performance-critical core is implemented in Rust (via PyO3) for near-native speed. This hybrid approach delivers up to 2.5x faster response times across providers like OpenAI, Anthropic, and Gemini—without changing the underlying model.

THE PROBLEM: SLOW AI INFERENCE

Most LLM clients suffer from three main issues: 1. Batch Processing Overhead – Clients wait for the full response instead of streaming data as it’s ready.

2. Inefficient Buffers – Default buffer sizes aren’t tuned for AI-generated text.

3. Validation Bottlenecks – Tools like Pydantic slow down structured response handling.

Bhumi tackles these challenges with a smarter architecture that blends Python’s ease of use with Rust’s raw speed.

HOW BHUMI MAKES AI FASTER 1. Rust-Based Streaming: Python’s async is useful, but integrating Rust through PyO3 brings near-native performance. Streaming inference starts instantly, cutting response times by up to 2.5x.

2. Smarter Buffer Management: Quality-Diversity algorithms (like MAP-Elites) dynamically discover optimal buffer sizes, boosting throughput by roughly 40%.

3. Replacing Pydantic with Satya: Pydantic was a performance sink. I built Satya—a Rust-backed validation library—that accelerates structured outputs dramatically.

PERFORMANCE BENCHMARKS:

• OpenAI: 2.5x faster response times

• Anthropic: 1.8x faster

• Gemini: 1.6x faster

• Minimal extra memory overhead

Bhumi is provider-agnostic, allowing you to switch between OpenAI, Anthropic, Groq, and more with a simple config change.

USING BHUMI (WITH TOOL USE & STRUCTURED OUTPUTS)

Bhumi makes tool integration effortless. For example, here’s how you can register a weather tool in Python:

import asyncio from bhumi.base_client import BaseLLMClient, LLMConfig

async def get_weather(location: str) -> str: return f”The weather in {location} is 75°F”

async def main(): config = LLMConfig(api_key=“sk-…”, model=“openai/gpt-4o-mini”) client = BaseLLMClient(config) client.register_tool(name=“get_weather”, func=get_weather) response = await client.completion([{“role”: “user”, “content”: “What’s the weather in SF?”}]) print(response[“text”])

asyncio.run(main())

WHAT’S NEXT? I’m actively working on:

• Supporting More Providers & Models

• Adaptive Streaming Optimizations

• Advanced Structured Outputs & Tooling

Bhumi is a Python-first library powered by a Rust underhead for performance.

Check out Bhumi on GitHub at https://github.com/justrach/bhumi or reach out at me@rachit.ai.