One of the biggest challenges I've been facing since I started developing [๐๐๐๐๐ญ๐๐จ๐ฐ๐ง](https://github.com/AstraBert/PdfItDown) was handling correctly the conversion of files like Excel sheets and CSVs: table conversion was bad and messy, almost unusable for downstream tasks๐ซฃ
That's why today I'm excited to introduce ๐ซ๐๐๐๐๐ซ๐ฌ, the new feature of PdfItDown v1.4.0!๐
With ๐ณ๐ฆ๐ข๐ฅ๐ฆ๐ณ๐ด, you can choose among three (for now๐) flavors of text extraction and conversion to PDF:
- ๐๐ผ๐ฐ๐น๐ถ๐ป๐ด, which does a fantastic work with presentations, spreadsheets and word documents๐ฆ
- ๐๐น๐ฎ๐บ๐ฎ๐ฃ๐ฎ๐ฟ๐๐ฒ by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables๐ฆ
- ๐ ๐ฎ๐ฟ๐ธ๐๐๐๐ผ๐๐ป by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)โ๏ธ
You can use this new feature in your python scripts (check the attached code snippet!๐) and in the command line interface as well!๐
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to: ๐ค Think about the problem setting ๐งฌ Generate data ๐ค Choose the right base model ๐ Design reward functions (and experiencing reward hacking) ๐ Run multiple rounds of training, hoping that my model would learn something.
Ever dreamt of ingesting into a vector DB that pile of CSVs, Word documents and presentations laying in some remote folders on your PC?๐๏ธ What if I told you that you can do it within three to six lines of code?๐คฏ Well, with my latest open-source project, ๐ข๐ง๐ ๐๐ฌ๐ญ-๐๐ง๐ฒ๐ญ๐ก๐ข๐ง๐ (https://github.com/AstraBert/ingest-anything), you can take all your non-PDF files, convert them to PDF, extract their text, chunk, embed and load them into a vector database, all in one go!๐ How? It's pretty simple! ๐ The input files are converted into PDF by PdfItDown (https://github.com/AstraBert/PdfItDown) ๐ The PDF text is extracted using LlamaIndex readers ๐ฆ The text is chunked exploiting Chonkie ๐งฎ The chunks are embedded thanks to Sentence Transformers models ๐๏ธ The embeddings are loaded into a Qdrant vector database
And you're done!โ Curious of trying it? Install it by running:
And you can start using it in your python scripts!๐ Don't forget to star it on GitHub and let me know if you have any feedback! โก๏ธ https://github.com/AstraBert/ingest-anything