The best way to use text embeddings portably is with Parquet and Polars

247 points by minimaxir 5 months ago

intalentive 5 months ago

The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.

If you host your own embedding model, then you can transmit numpy float32 compressed arrays as bytes, then decode back into numpy arrays.

Personally I prefer using SQLite with usearch extension. Binary vectors then rerank top 100 with float32. It’s about 2 ms for ~20k items, which beats LanceDB in my tests. Maybe Lance wins on bigger collections. But for my use case it works great, as each user has their own dedicated SQLite file.

For portability there’s Litestream.

dijksterhuis 5 months ago

> The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates.
parquet is columnar storage, so it’s use case is lots of heavy filtering/aggregation within analytical workloads (OLAP).
consistent writes / updates, i.e. basically transactional (OLTP), use cases are never going to have great performance in columnar storage. its the wrong format to use for that.
for faster writes/updates you’d want row-based, i.e. CSV or an actual database. which i’m glad to see is where you kind of ended up anyway.
- yorwba 5 months ago
  
  There's no reason why an update query that doesn't change the file layout and only twiddles some values in place couldn't be made fast with columnar storage.
  When you run a read query, there's one phase that determines the offsets where values are stored and another that reads the value at a given offset. For an update query that doesn't change the offsets, you can change the direction from reading the value at an offset to writing a new value to that location instead, and it should be plenty fast.
  Parquet libraries just don't seem to consider that use case worth supporting for some reason and expect people to generate an entire new file with mostly the same content instead. Which definitely doesn't have great performance!
  - rbetts 5 months ago
    
    Columnar storage systems rarely store the raw value at fixed position. They store values as run length encoded, dictionary encoded, delta encoded, etc... and then store metadata about chunk of values for pruning at query time. So rarely can you seek to an offset and update a value. The compression achieved means less data to read from disk when doing large scans and lower storage costs for very-large-datasets that are largely immutable - some of the important benefits of columnar storage.
    Also, many applications that require updates also update conditionally (update a where b = c). This requires re-synthesizing (at least some of) the row to make a comparison, another relatively expensive operation for a column store.
    
    lmeyerov 5 months ago
    
    Also typically stored with binary compression (snappy, lib) after the snappy compression. In-memory might only be semantic, eg, arrow.
    But it's... Fine? Batch writes and rewrite dirty parts. Most of our cases are either appending events, or enriching with new columns, which can be modeled columnarly. It is a bit more painful in GPU land bc we like big chunks (250MB-1GB) for saturating reads, but CPU land is generally fine for us.
    We have been eyeing iceberg and friends as a way to automate that, so I've been curious how much of the optimization, if any, they take for us
csunbird 5 months ago

Parquet files being immutable is not a bug, it is a feature. That is how you accomplish good compression and keep the columnar data organized.
Yes, it is not useful for continuous writes and updates, but it is not what it is designed for. Use a database (e.g. SQLite just like you suggested) if you want to ingest real time/streaming data.
pantsforbirds 5 months ago

I've had great luck using either Athena or DuckDB with parquet files in s3 using a few partitions. You can query across the partitions pretty efficiently and if date/time is one of your partitions, then it's very efficient to add new data.
jt_b 5 months ago

> The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.
You can use glob patterns in DuckDB to query remote parquets though to get around this? Maybe break things up using a hive partitioning scheme or similar.
- memhole 5 months ago
  
  I like the pattern described too. Only snag is deletes and updates. Ime, you have to delete the underlying file or create and maintain a view that handles the data you want visible.

banku_brougham 5 months ago

Really cool article, I've enjoyed your work for a long time. You might add a note for those jumping into a sqlite implementation, that duckdb reads parquet and launched a few vector similarity functions which cover this use-case perfectly:

https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...

jt_b 5 months ago

I have tinkered with using DuckDB as a poor man's vector database for a POC and had great results.
One thing I'd love to see is being able to do some sort of row group level metadata statistics for embeddings within a parquet file - something that would allow various readers to push predicates down to an HTTP request metadata level and completely avoid loading in non-relevant rows to the database from a remote file - particularly one stored on S3 compatible storage that supports byte-range requests. I'm not sure what the implementation would look like to define sorting the algorithm to organize the "close" rows together, how the metadata would be calculated, or what the reader implementation would look like, but I'd love to be able to implement some of the same patterns with vector search as with geoparquet.
- jt_b 5 months ago
  
  I thought about this some more and did some research - and found an indexing approach using HNSW, serialized to parquet, and queried from the browser here:
  https://github.com/jasonjmcghee/portable-hnsw
  Opens up efficient query patterns for larger datasets for RAG projects where you may not have the resources to run an expensive vector database
  - jasonjmcghee 5 months ago
    
    Hey that's my little research project- lmk if you're interested in chatting about this stuff.
    As others have mentioned in other threads, parquet isn't a great tool for the job here, but you could theoretically build a different file format that lends itself better to the problem of static file(s) representing a vector database.

mhh__ 5 months ago

I still don't like dataframes but oh my God Polars is so much better than pandas.

I was doing some time series calculations, simple equity price adjustments basically, in Polars and my two thoughts were:

- WTF, I can actually read the code and test it.

- it's running so fast it seems like it's broken.

eskaytwo 5 months ago

There’s some nice plugins too, some are finance related: https://github.com/ddotta/awesome-polars
- mhh__ 5 months ago
  
  The one thing I really want is for someone to make it so I can use it in F#. Presumably it's possible given how the python bit is implemented under the hood?
  - whyever 5 months ago
    
    It uses pyo3 to generate the bindings, so you would have to find a similar crate for F#/.NET and port the polars Python FFI to it. If such a crate does not exist, it will be even more work.
LaurensBER 5 months ago

Yeah, the readability difference is immense. I worked for years with Pandas and I still cannot "scan" it as quickly as with a "normal" programming language or SQL. Then there's the whole issue with (multi)-indexes, serialisation, etc.
Polars makes programming fun again instead of a chore.

stephantul 5 months ago

Check out Unum’s usearch. It beats anything, and is super easy to use. It just does exactly what you need.

https://github.com/unum-cloud/usearch

esafak 5 months ago

Have you tested it against Lance? Does it do predicate pushdown for filtering?
- ashvardanian 5 months ago
  
  USearch author here :)
  The engine supports arbitrary predicates for C, C++, and Rust users. In higher level languages it’s hard to combine callbacks and concurrent state management.
  In terms of scalability and efficiency, the only tool I’ve seen coming close is Nvidia’s cuVS if you have GPUs available. FAISS HNSW implementation can easily be 10x slower and most commercial & venture-backed alternatives are even slower: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...
  In this use-case, I believe SimSIMD raw kernels may be a better choice. Just replace NumPy and enjoy speedups. It provides hundreds of hand-written SIMD kernels for all kinds of vector-vector operations for AVX, AVX-512, NEON, and SVE across F64, F32, BF16, F16, I8, and binary vectors, mostly operating in mixed precision to avoid overflow and instability: https://github.com/ashvardanian/SimSIMD
- stephantul 5 months ago
  
  Usearch is a vector store afaik, not a vector db. At least that’s how I use it.
  I haven’t compared it to lancedb, I reached for it here because the author mentioned Faiss being difficult to use and install. usearch is a great alternative to Faiss.
  But thanks for the suggestion, I’ll check it out

dwagnerkc 5 months ago

If you want to try it out. Can lazily load from HF and apply filtering this way.

  df = (
    pl.scan_parquet('hf://datasets/minimaxir/mtg-embeddings/mtg_embeddings.parquet')
    .filter(
        pl.col("type").str.contains("Sorcery"),
        pl.col("manaCost").str.contains("B"),
    )
    .collect()

)

Polars is awesome to use, would highly recommend. Single node it is excellent at saturating CPUs, if you need to distribute the work put it in a Ray Actor with some POLARS_MAX_THREADS applied depending on how much it saturates a single node.

thomasfromcdnjs 5 months ago

Lots of great findings

---

I'm curious if anyone knows whether it is better to pass structured data or unstructured data to embedding api's? If I ask ChatGPT, it says it is better to send unstructured data. (looking at the authors github, it looks like he generated embeddings from json strings)

My use case is for jsonresume, I am creating embeddings by sending full json versions as strings, but I've been experimenting with using models to translate resume.json's into full text versions first before creating embeddings. The results seem to be better but I haven't seen any concrete opinions on this.

My understanding is that unstructured data is better because it contains textual/semantic meaning because of natural lanaguage aka

  skills: ['Javascript', 'Python']

is worse than;

  Thomas excels at Javascript and Python

Another question: What if the search was also a json embedding? JSON <> JSON embeddings could also be great?

minimaxir 5 months ago

In general I like to send structured data (see the input format here: https://github.com/minimaxir/mtg-embeddings), but the ModernBERT base for the embedding model used here specifically has better benefits implicitly for structured data compared to previous models. That's worth another blog post explaining why.
- notpublic 5 months ago
  
  please do explain why
  - minimaxir 5 months ago
    
    tl;dr the base ModernBERT was trained with code in mind unlike most encoder-only models (therefore assuming it was also trained on JSON/YAML objects) and also includes a custom tokenizer to support that, which is why I mention that indentation is important since different levels of indentation have different single tokens.
    This is mostly theoetical and does require a deeper dive to confirm.
vunderba 5 months ago

I'd say the more important consideration is "consistency" between incoming query input and stored vectors.
I have a huge vector database that gets updated/regenerated from a personal knowledge store (markdown library). Since the user is most likely to input a comparison query in the form of a question "Where does X factor into the Y system?" - I use a small 7b parameter LLM to pregenerate a list of a dozen possible theoretical questions a user might pose to a given embedding chunk. These are saved as 1536 dimension sized embeddings into the vector database (Qdrant) and linked to the chunks.
The real question you need to ask is - what's the input query that you'll be comparing to the embeddings? If it's incoming as structured, then store structured, etc.
I've also seen (anecdotally) similarity degradation for smaller chunks as well - so keep that in mind as well.

k2so 5 months ago

A neat trick in Vespa (vectors DB among other things) documentation is to use hex representation of vectors after converting them to binary.

This trick can be used to reduce your payload sizes. In Vespa, they support this format which is particularly useful when the same vectors are referenced multiple times in a document. For ColBERT or ColPaLi like cases (where you have many embedding vectors), this can reduce the size of the vectors stored on disk massively.

https://docs.vespa.ai/en/reference/document-json-format.html...

Not sure why this is not more commonly adopted though

jtrueb 5 months ago

Polars + Parquet is awesome for portability and performance. This post focused on python portability, but Polars has an easy-to-use Rust API for embedding the engine all over the place.

blooalien 5 months ago

Gotta love stuff that has multiple language bindings. Always really enjoyed finding powerful libraries in Python and then seeing they also have matching bindings for Go and Rust. Nice to have easy portability and cross-language compatibility.

rcarmo 5 months ago

I'm a huge fan of polars, but I hadn't considered using it to store embeddings in this way (I've been fiddling with sqlite-vec). Seems like an interesting idea indeed.

kernelsanderz 5 months ago

For another library that has great performance and features like full text indexing and the ability to version changes I’d recommend lancedb https://lancedb.github.io/lancedb/

Yes, it’s a vector database and has more complexity. But you can use it without creating indexes and it has excellent polars and pandas zero copy arrow support also.

daveguy 5 months ago

Since a lot of ML data is stored as parquet, I found this to be a useful tidbit from lancedb's documentation:
> Data storage is columnar and is interoperable with other columnar formats (such as Parquet) via Arrow
https://lancedb.github.io/lancedb/concepts/data_management/
Edit: That said, I am personally a fan of parquet, arrow, and ibis. So many data wrangling options out there it's easy to get analysis paralysis.
esafak 5 months ago

Lance is made for this stuff; parquet is not.
3abiton 5 months ago

How well does it scale?

robschmidt90 5 months ago

Nice read. I agree that for a lot of hobby use cases you can just load the embeddings from parquet and compute the similarities in-memory.

To find similarity between my blogposts [1] I wanted to experiment with a local vector database and found ChromaDB fairly easy to use (similar to SQLite just a file on your machine).

[1] https://staticnotes.org/posts/how-recommendations-work/

PaulHoule 5 months ago

In 2017 I was working on a model trainer for text classification and sequence labeling [1] that had limited success because the models weren't good enough.

I have a minilm + pooling + svm classifier which works pretty well for some things (topics, "will I like this article?") but doesn't work so well for sentiment, emotional tone and other things where the order of the words matter. I'm planning to upgrade my current classifier's front end to use ModernBert and add an LSTM-based back end that I think will equal or beat fine-tuned BERT and, more importantly, can be trained reliably with early stopping. I'd like to open source the thing, focused on reliability, because I'm an application programmer at heart.

I want it to provide an interface which is text-in and labels-out and hide the embeddings from most users but I'm definitely thinking about how to handle them, and there's the worse problem here that the LSTM needs a vector for each token, not each document, so text gets puffed up by a factor of 1000 or so which is not insurmountable (1 MB of training text puffs up to 1 GB of vectors)

Since it's expensive to compute the embeddings and expensive to store them I'm thinking about whether and how to cache them, considering that I expect to present the same samples to the trainer multiple times and to do a lot of model selection in the process of model development (e.g. what exact shape LSTM to to use) and in the case of end-user training (it will probably try a few models, not least do a shootout between the expensive model and a cheap model)_

[1] think of a "magic magic marker" which learns to mark up text the same way you do; this could mark "needless words" you could delete from a title, parts of speech, named entities, etc.

thelastbender12 5 months ago

This is pretty neat.

IMO a hindrance to this was lack of built-in fixed-size list array support in the Arrow format, until recently. Some implementations/clients supported it, while others didn't. Else, it could have been used as the default storage format for numpy arrays, torch tensors, too.

(You could always store arrays as variable length list arrays with fixed strides and handle the conversion).

kipukun 5 months ago

To the second footnote: you could utilize Polar's lazyframe API to do that cosine similarity in a streaming fashion for large files.

minimaxir 5 months ago

That would get around memory limitations but I still think that would be slow.
- kipukun 5 months ago
  
  You'd be surprised. As long as your query is using Polars natives and not a UDF (which drops it down to Python), you may get good results.
  - jononor 5 months ago
    
    A (simple) benchmark would be great to figure out where the practical limits of such an approach are. Runtime is expected to grow with O(n*2) which will get painful at some point.

jononor 5 months ago

At 33k items in memory is quite fast, 10 ms is very responsive. With 10x/330k items given same hardware the expected time is 1 second. That might be too slow for some applications (but not all). Especially if one just does retrieval of a rather small amount of matches, an index will help a lot for 100k++ datasets.

noahbp 5 months ago

Wow! How much did this cost you in GPU credits? And did you consider using your MacBook?

minimaxir 5 months ago

It took 1:17 to encode all ~32k cards using a preemptible L4 GPU on Google Cloud Platform (g2-standard-4) at ~$0.28/hour, costing < $0.01 overall: https://github.com/minimaxir/mtg-embeddings/blob/main/mtg_em...
The base ModernBERT uses CUDA tricks not available in MPS, so I suspect it would take much longer.
For the 2D UMAP, it took 3:33 because I wanted to do 1 million epochs to be thorough: https://github.com/minimaxir/mtg-embeddings/blob/main/mtg_em...

octernion 5 months ago

or you could just use postgres + pgvector? which many apps already have installed by default.

jononor 5 months ago

Many ways to skin a cat. At least of this size (33k items). And at the size given, string up a database would have no advantages. Which I believe is the main point of the post! If you have a simple problem, use a simple solution.
If one had instead 1M items, the situation would be completely different.

th24o3j4324234 5 months ago

The trouble with Parquet (and columnar storage) in ML is,

1. You don't really care too-much about accessing subsets of columns

2. You can't easily append stuff to closed Parquet files.

3. Batched-row access is presumably slower due to lower cache-hits.

It's okay for map-reduce style stuff where this doesn't matter, but in ML these limitations are an annoyance.

HDF5 (or Zarr, less portably) solves some/many of these issues but it's not quite a settled affair.

ismailmaj 5 months ago

Parquet is only a mess if you try to mutate it, usually you consider them as immutable and have the data stored across many files.
Also batched-row access is negligible given the compression benefits you get with the columnar format, which is probably why it's still king in ML; I think given what I'm seeing in the industry and recent trends (e.g. Velox).
jononor 5 months ago

Re 2. Parquet can easily be used with chunked/partitioned files. Then appending is just adding another file/chunk.
The case of 1. really depends on the workload. For embeddings etc selecting column subsets is rare. In order cases, where one has a a bunch of separate features, doing column subsetting might be rather common. But yes, it is far from every case.

banku_brougham 5 months ago

Is your example of a float32 number correct, holding 24 ascii char representation? I had thought single-precision gonna be 7 digits and the exponent, sign and exp sign. Something like 7+2+1+1 or 10 char ascii representation? Rather than the 24 you mentioned?

PaulHoule 5 months ago

One of the things I remember from my PhD work is that you can do a stupendous number of FLOPs on floating point numbers in the time it takes to serialize/deserialize them to ASCII.
minimaxir 5 months ago

It depends on the default print format. The example string I mentioned is pulled from what np.savetxt() does (fmt='%.18e') and there isn't any precision loss in that number. But I admit I'm not a sprintf() guru.
In practice numbers with that much precision is overkill and verbose so tools don't print float32s to that level of precision.

whinvik 5 months ago

Since we are talking about an embedded solution shouldn't the benchmark be something like sqlite with a vector extension or lancedb?

0cf8612b2e1e 5 months ago

My natural point of comparison without actually be DuckDB plus their vector search extension.
minimaxir 5 months ago

I mention sqlite + sqlite-vec at the end, noting it requires technical overhead and it's not as easy as read_parquet() and write_parquet().
I just became aware of lancedb and am looking into that, although from glancing at the README it has similar issues to faiss with regards to usability for casual use, although much better than faiss in that it can work with colocated metadata.

WatchDog 5 months ago

Parquet is fine and all, but I love the simplicity and simple interoperability of CSV.

You can save a huge amount of overhead just by base64 encoding the vectors, they aren't exactly human readable anyway.

I imagine the resulting file would only be approximately 33% larger than the pickle version.

llm_trw 5 months ago

>The second incorrect method to save a matrix of embeddings to disk is to save it as a Python pickle object [...] But it comes with two major caveats: pickled files are a massive security risk as they can execute arbitrary code, and the pickled file may not be guaranteed to be able to be opened on other machines or Python versions. It’s 2025, just stop pickling if you can.

Security: absolutely.

Portability: who cares? Frameworks move so quickly that unless you carry your whole dependency graph between machines you will not get bit compatible results with even minor version changes. It's a dirty secret that no one seems to want to fix or care about.

In short: everything is so fucked that pickle + conda is more than good enough for whatever project you want to serve to >10,000 users.