Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.29.0
metadata
title: Markit
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.14.0
app_file: app.py
build_script: build.sh
startup_script: setup.sh
pinned: false
Markit: Document to Markdown Converter
Author: Anse Min | GitHub | LinkedIn
Project Links
- GitHub Repository: github.com/ansemin/Markit_HF
- Hugging Face Space: huggingface.co/spaces/Ansemin101/Markit
Overview
Markit is a powerful tool that converts various document formats (PDF, DOCX, images, etc.) to Markdown format. It uses different parsing engines and OCR methods to extract text from documents and convert them to clean, readable Markdown formats.
Key Features
- Multiple Document Formats: Convert PDFs, Word documents, images, and other document formats
- Versatile Output Formats: Export to Markdown, JSON, plain text, or document tags format
- Advanced Parsing Engines:
- PyPdfium: Fast PDF parsing using the PDFium engine
- Docling: Advanced document structure analysis
- Marker: Specialized for markup and formatting
- Gemini Flash: AI-powered conversion using Google's Gemini API
- OCR Integration: Extract text from images and scanned documents using Tesseract OCR
- Interactive UI: User-friendly Gradio interface with page navigation for large documents
- AI-Powered Chat: Interact with your documents using AI to ask questions about content
System Architecture
The application is built with a modular architecture:
- Core Engine: Handles document conversion and processing workflows
- Parser Registry: Central registry for all document parsers
- UI Layer: Gradio-based web interface
- Service Layer: Handles AI chat functionality and external services integration
Installation
For Local Development
Clone the repository
Install dependencies:
pip install -r requirements.txt
Install Tesseract OCR (required for OCR functionality):
- Windows: Download and install from GitHub
- Linux:
sudo apt-get install tesseract-ocr libtesseract-dev
- macOS:
brew install tesseract
Run the application:
python app.py
API Keys Setup
Gemini Flash Parser
To use the Gemini Flash parser, you need to:
- Install the Google Generative AI client:
pip install google-genai
- Set the API key environment variable:
# On Windows set GOOGLE_API_KEY=your_api_key_here # On Linux/Mac export GOOGLE_API_KEY=your_api_key_here
- Alternatively, create a
.env
file in the project root with:GOOGLE_API_KEY=your_api_key_here
- Get your Gemini API key from Google AI Studio
Deploying to Hugging Face Spaces
Environment Configuration
- Go to your Space settings:
https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME/settings
- Add the following repository secrets:
- Name:
GOOGLE_API_KEY
- Value: Your Gemini API key
- Name:
Space Configuration
Ensure your Hugging Face Space configuration includes:
build:
dockerfile: Dockerfile
python_version: "3.10"
system_packages:
- "tesseract-ocr"
- "libtesseract-dev"
How to Use
Document Conversion
- Upload your document using the file uploader
- Select a parser provider:
- PyPdfium: Best for standard PDFs with selectable text
- Docling: Best for complex document layouts
- Marker: Best for preserving document formatting
- Gemini Flash: Best for AI-powered conversions (requires API key)
- Choose an OCR option based on your selected parser:
- None: No OCR processing (for documents with selectable text)
- Tesseract: Basic OCR using Tesseract
- Advanced: Enhanced OCR with layout preservation (available with specific parsers)
- Select your desired output format:
- Markdown: Clean, readable markdown format
- JSON: Structured data representation
- Text: Plain text extraction
- Document Tags: XML-like structure tags
- Click "Convert" to process your document
- Navigate through pages using the navigation buttons for multi-page documents
- Download the converted content in your selected format
Document Chat
- After converting a document, switch to the "Chat with Document" tab
- Type your questions about the document content
- The AI will analyze the document and provide context-aware responses
- Use the conversation history to track your Q&A session
- Click "Clear" to start a new conversation
Troubleshooting
OCR Issues
- Ensure Tesseract is properly installed and in your system PATH
- Check the TESSDATA_PREFIX environment variable is set correctly
- Verify language files are available in the tessdata directory
Gemini Flash Parser Issues
- Confirm your API key is set correctly as an environment variable
- Check for API usage limits or restrictions
- Verify the document format is supported by the Gemini API
General Issues
- Check the console logs for error messages
- Ensure all dependencies are installed correctly
- For large documents, try processing fewer pages at a time
Development Guide
Project Structure
markit/
βββ app.py # Main application entry point
βββ setup.sh # Setup script
βββ build.sh # Build script
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ .env # Environment variables
βββ .gitignore # Git ignore file
βββ .gitattributes # Git attributes file
βββ src/ # Source code
β βββ __init__.py # Package initialization
β βββ main.py # Main module
β βββ core/ # Core functionality
β β βββ __init__.py # Package initialization
β β βββ converter.py # Document conversion logic
β β βββ parser_factory.py # Parser factory
β βββ parsers/ # Parser implementations
β β βββ __init__.py # Package initialization
β β βββ parser_interface.py # Parser interface
β β βββ parser_registry.py # Parser registry
β β βββ docling_parser.py # Docling parser
β β βββ marker_parser.py # Marker parser
β β βββ pypdfium_parser.py # PyPDFium parser
β βββ ui/ # User interface
β β βββ __init__.py # Package initialization
β β βββ ui.py # Gradio UI implementation
β βββ services/ # External services
β βββ __init__.py # Package initialization
β βββ docling_chat.py # Chat service
βββ tests/ # Tests
βββ __init__.py # Package initialization
Adding a New Parser
- Create a new parser class implementing the
DocumentParser
interface - Register the parser with the
ParserRegistry
- Implement the required methods:
get_name()
,get_supported_ocr_methods()
, andparse()
- Add your parser to the imports in
src/parsers/__init__.py
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is open source and available under the MIT License.