@as-cle-bert on Hugging Face: "Ever dreamt of ingesting into a vector DB that pile of CSVs, Word documents…"

as-cle-bert

posted an update 2 days ago

Post

2595

Ever dreamt of ingesting into a vector DB that pile of CSVs, Word documents and presentations laying in some remote folders on your PC?🗂️
What if I told you that you can do it within three to six lines of code?🤯
Well, with my latest open-source project, 𝐢𝐧𝐠𝐞𝐬𝐭-𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 (https://github.com/AstraBert/ingest-anything), you can take all your non-PDF files, convert them to PDF, extract their text, chunk, embed and load them into a vector database, all in one go!🚀
How? It's pretty simple!
📁 The input files are converted into PDF by PdfItDown (https://github.com/AstraBert/PdfItDown)
📑 The PDF text is extracted using LlamaIndex readers
🦛 The text is chunked exploiting Chonkie
🧮 The chunks are embedded thanks to Sentence Transformers models
🗄️ The embeddings are loaded into a Qdrant vector database

And you're done!✅
Curious of trying it? Install it by running:

𝘱𝘪𝘱 𝘪𝘯𝘴𝘵𝘢𝘭𝘭 𝘪𝘯𝘨𝘦𝘴𝘵-𝘢𝘯𝘺𝘵𝘩𝘪𝘯𝘨

And you can start using it in your python scripts!🐍
Don't forget to star it on GitHub and let me know if you have any feedback! ➡️ https://github.com/AstraBert/ingest-anything

JLouisBiz

1 day ago

That sounds like the very needed thing. How can I use my own embedder?

as-cle-bert

1 day ago

So, there are two possibilities:

If you mean customizing the embedder among the ones available within Sentence Transformers, it is very possible, you just have to change the embedding_model parameter when calling the ingest method
If you mean that you have your own embedding model (like saved on your PC), that is a tad more difficult. I think Sentence Transformer might allow loading the model from your PC as long as it is compatible with the package. I think that this guide might be useful in that regard

For now the package only supports Sentence Transformers models, in the future it will probably extend its support to other embedding models as well :)

Join the conversation