Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
as-cle-bertย 
posted an update 2 days ago
Post
2595
Ever dreamt of ingesting into a vector DB that pile of CSVs, Word documents and presentations laying in some remote folders on your PC?๐Ÿ—‚๏ธ
What if I told you that you can do it within three to six lines of code?๐Ÿคฏ
Well, with my latest open-source project, ๐ข๐ง๐ ๐ž๐ฌ๐ญ-๐š๐ง๐ฒ๐ญ๐ก๐ข๐ง๐  (https://github.com/AstraBert/ingest-anything), you can take all your non-PDF files, convert them to PDF, extract their text, chunk, embed and load them into a vector database, all in one go!๐Ÿš€
How? It's pretty simple!
๐Ÿ“ The input files are converted into PDF by PdfItDown (https://github.com/AstraBert/PdfItDown)
๐Ÿ“‘ The PDF text is extracted using LlamaIndex readers
๐Ÿฆ› The text is chunked exploiting Chonkie
๐Ÿงฎ The chunks are embedded thanks to Sentence Transformers models
๐Ÿ—„๏ธ The embeddings are loaded into a Qdrant vector database

And you're done!โœ…
Curious of trying it? Install it by running:

๐˜ฑ๐˜ช๐˜ฑ ๐˜ช๐˜ฏ๐˜ด๐˜ต๐˜ข๐˜ญ๐˜ญ ๐˜ช๐˜ฏ๐˜จ๐˜ฆ๐˜ด๐˜ต-๐˜ข๐˜ฏ๐˜บ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜จ

And you can start using it in your python scripts!๐Ÿ
Don't forget to star it on GitHub and let me know if you have any feedback! โžก๏ธ https://github.com/AstraBert/ingest-anything

That sounds like the very needed thing. How can I use my own embedder?

ยท

So, there are two possibilities:

  • If you mean customizing the embedder among the ones available within Sentence Transformers, it is very possible, you just have to change the embedding_model parameter when calling the ingest method
  • If you mean that you have your own embedding model (like saved on your PC), that is a tad more difficult. I think Sentence Transformer might allow loading the model from your PC as long as it is compatible with the package. I think that this guide might be useful in that regard

For now the package only supports Sentence Transformers models, in the future it will probably extend its support to other embedding models as well :)