Spaces:

hellorahulk
/

docling_free

Running

App Files Files Community

hellorahulk commited on Jan 25

Commit

97c779b

1 Parent(s): 3985ce1

roolback

Browse files

Files changed (10) hide show

.cursorrules +1 -17
.gitignore +0 -43
.vercel/README.txt +11 -0
.vercel/project.json +1 -0
README.md +1 -155
api.py +0 -230
api/index.py +0 -201
requirements-prod.txt +0 -11
requirements.txt +13 -4
vercel.json +0 -25

.cursorrules CHANGED Viewed

@@ -98,21 +98,5 @@ If needed, you can further use the `web_scraper.py` file to scrape the web page
 - Add debug information to stderr while keeping the main output clean in stdout for better pipeline integration
 - When using seaborn styles in matplotlib, use 'seaborn-v0_8' instead of 'seaborn' as the style name due to recent seaborn version changes
 - Use 'gpt-4o' as the model name for OpenAI's GPT-4 with vision capabilities
-- For Vercel deployments with FastAPI and Mangum, use older stable versions (FastAPI 0.88.0, Mangum 0.15.0, Pydantic 1.10.2) to avoid compatibility issues
-- Keep Vercel configuration simple and avoid unnecessary configuration options that might cause conflicts
-# Scratchpad
-Current Task: Fix Vercel deployment issues with FastAPI and Mangum
-Progress:
-[X] Identified issue with newer versions of FastAPI and Mangum
-[X] Updated dependencies to use older, stable versions
-[X] Simplified FastAPI configuration
-[X] Simplified Vercel configuration
-[X] Successfully deployed to production
-Next Steps:
-[ ] Test all API endpoints
-[ ] Add more functionality if needed
-[ ] Consider adding monitoring and logging

 - Add debug information to stderr while keeping the main output clean in stdout for better pipeline integration
 - When using seaborn styles in matplotlib, use 'seaborn-v0_8' instead of 'seaborn' as the style name due to recent seaborn version changes
 - Use 'gpt-4o' as the model name for OpenAI's GPT-4 with vision capabilities
+# Scratchpad

.gitignore DELETED Viewed

@@ -1,43 +0,0 @@
-# Python
-__pycache__/
-*.py[cod]
-*$py.class
-*.so
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-# Virtual Environment
-venv/
-env/
-ENV/
-# IDE
-.idea/
-.vscode/
-*.swp
-*.swo
-# Vercel
-.vercel/
-.env
-.env.local
-# Temporary files
-*.tmp
-tmp/
-temp/
-.vercel

.vercel/README.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+> Why do I have a folder named ".vercel" in my project?
+The ".vercel" folder is created when you link a directory to a Vercel project.
+> What does the "project.json" file contain?
+The "project.json" file contains:
+- The ID of the Vercel project that you linked ("projectId")
+- The ID of the user or team your Vercel project is owned by ("orgId")
+> Should I commit the ".vercel" folder?
+No, you should not share the ".vercel" folder with anyone.
+Upon creation, it will be automatically added to your ".gitignore" file.

.vercel/project.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"projectId":"prj_kf78qZ09e6hVKbBJmTbRQxdvMb23","orgId":"team_FMqHvtRPYOsRJleJqSK4Itlg"}

README.md CHANGED Viewed

@@ -54,158 +54,4 @@ Built with:
 ## 📝 License
-MIT License
-# Document Parser API
-A scalable FastAPI service for parsing various document formats (PDF, DOCX, TXT, HTML, Markdown) with automatic information extraction.
-## Features
-- 📄 Multi-format support (PDF, DOCX, TXT, HTML, Markdown)
-- 🔄 Asynchronous processing with background tasks
-- 🌐 Support for both file uploads and URL inputs
-- 📊 Structured information extraction
-- 🔗 Webhook support for processing notifications
-- 🚀 Highly scalable architecture
-- 🛡️ Comprehensive error handling
-- 📝 Detailed logging
-## Quick Start
-### Prerequisites
-- Python 3.8+
-- pip (Python package manager)
-### Installation
-1. Clone the repository:
-```bash
-git clone https://github.com/yourusername/document-parser-api.git
-cd document-parser-api
-```
-2. Install dependencies:
-```bash
-pip install -r requirements.txt
-```
-3. Run the API server:
-```bash
-python api.py
-```
-The API will be available at `http://localhost:8000`
-## API Documentation
-### Endpoints
-#### 1. Parse Document from File Upload
-```http
-POST /parse/file
-```
-- Upload a document file for parsing
-- Optional callback URL for processing notification
-- Returns a job ID for status tracking
-#### 2. Parse Document from URL
-```http
-POST /parse/url
-```
-- Submit a document URL for parsing
-- Optional callback URL for processing notification
-- Returns a job ID for status tracking
-#### 3. Check Processing Status
-```http
-GET /status/{job_id}
-```
-- Get the current status of a parsing job
-- Returns processing status and results if completed
-#### 4. Health Check
-```http
-GET /health
-```
-- Check if the API is running and healthy
-### Example Usage
-#### Parse File
-```python
-import requests
-url = "http://localhost:8000/parse/file"
-files = {"file": open("document.pdf", "rb")}
-response = requests.post(url, files=files)
-print(response.json())
-```
-#### Parse URL
-```python
-import requests
-url = "http://localhost:8000/parse/url"
-data = {
-    "url": "https://example.com/document.pdf",
-    "callback_url": "https://your-callback-url.com/webhook"
-}
-response = requests.post(url, json=data)
-print(response.json())
-```
-## Error Handling
-The API implements comprehensive error handling:
-- Invalid file formats
-- Failed URL downloads
-- Processing errors
-- Invalid requests
-- Server errors
-All errors return appropriate HTTP status codes and detailed error messages.
-## Scaling Considerations
-For production deployment, consider:
-1. **Job Storage**: Replace in-memory storage with Redis or a database
-2. **Load Balancing**: Deploy behind a load balancer
-3. **Worker Processes**: Adjust number of workers based on load
-4. **Rate Limiting**: Implement rate limiting for API endpoints
-5. **Monitoring**: Add metrics collection and monitoring
-## Development
-### Running Tests
-```bash
-pytest tests/
-```
-### Local Development
-```bash
-uvicorn api:app --reload --port 8000
-```
-### API Documentation
-- Swagger UI: `http://localhost:8000/docs`
-- ReDoc: `http://localhost:8000/redoc`
-## Contributing
-1. Fork the repository
-2. Create a feature branch
-3. Commit your changes
-4. Push to the branch
-5. Create a Pull Request
-## License
-This project is licensed under the MIT License - see the LICENSE file for details.
-## Support
-For support, please open an issue in the GitHub repository or contact the maintainers.


54
55	## 📝 License
56
57	+ MIT License

api.py DELETED Viewed

@@ -1,230 +0,0 @@
-import os
-from fastapi import FastAPI, HTTPException, UploadFile, File, BackgroundTasks
-from fastapi.middleware.cors import CORSMiddleware
-from pydantic import BaseModel, HttpUrl
-import tempfile
-import requests
-from typing import Optional, List, Dict, Any
-from dockling_parser import DocumentParser
-from dockling_parser.exceptions import ParserError, UnsupportedFormatError
-from dockling_parser.types import ParsedDocument
-import logging
-import aiofiles
-import asyncio
-from urllib.parse import urlparse
-from mangum import Mangum
-import httpx
-# Configure logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-# Initialize FastAPI app
-app = FastAPI(
-    title="Document Parser API",
-    description="A scalable API for parsing various document formats",
-    version="1.0.0"
-)
-# Add CORS middleware
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],
-    allow_credentials=True,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-# Initialize document parser
-parser = DocumentParser()
-class URLInput(BaseModel):
-    url: HttpUrl
-    callback_url: Optional[HttpUrl] = None
-class ErrorResponse(BaseModel):
-    error: str
-    detail: Optional[str] = None
-    code: str
-class ParseResponse(BaseModel):
-    job_id: str
-    status: str
-    result: Optional[ParsedDocument] = None
-    error: Optional[str] = None
-# In-memory job storage (replace with Redis/DB in production)
-jobs = {}
-async def process_document_async(job_id: str, file_path: str, callback_url: Optional[str] = None):
-    """Process document asynchronously"""
-    try:
-        # Update job status
-        jobs[job_id] = {"status": "processing"}
-        # Parse document
-        result = parser.parse(file_path)
-        # Update job with result
-        jobs[job_id] = {
-            "status": "completed",
-            "result": result
-        }
-        # Call callback URL if provided
-        if callback_url:
-            try:
-                await notify_callback(callback_url, job_id, result)
-            except Exception as e:
-                logger.error(f"Failed to notify callback URL: {str(e)}")
-    except Exception as e:
-        logger.error(f"Error processing document: {str(e)}")
-        jobs[job_id] = {
-            "status": "failed",
-            "error": str(e)
-        }
-    finally:
-        # Cleanup temporary file
-        try:
-            if os.path.exists(file_path):
-                os.unlink(file_path)
-        except Exception as e:
-            logger.error(f"Error cleaning up file: {str(e)}")
-async def notify_callback(callback_url: str, job_id: str, result: ParsedDocument):
-    """Notify callback URL with results"""
-    try:
-        async with httpx.AsyncClient() as client:
-            await client.post(
-                callback_url,
-                json={
-                    "job_id": job_id,
-                    "result": result.dict()
-                }
-            )
-    except Exception as e:
-        logger.error(f"Failed to send callback: {str(e)}")
-@app.post("/parse/file", response_model=ParseResponse)
-async def parse_file(
-    background_tasks: BackgroundTasks,
-    file: UploadFile = File(...),
-    callback_url: Optional[HttpUrl] = None
-):
-    """
-    Parse a document from file upload
-    """
-    try:
-        # Create temporary file in /tmp for Vercel
-        suffix = os.path.splitext(file.filename)[1]
-        tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
-        tmp_path = os.path.join(tmp_dir, f"upload_{os.urandom(8).hex()}{suffix}")
-        content = await file.read()
-        with open(tmp_path, "wb") as f:
-            f.write(content)
-        # Generate job ID
-        job_id = f"job_{len(jobs) + 1}"
-        # Start background processing
-        background_tasks.add_task(
-            process_document_async,
-            job_id,
-            tmp_path,
-            str(callback_url) if callback_url else None
-        )
-        return ParseResponse(
-            job_id=job_id,
-            status="queued"
-        )
-    except Exception as e:
-        logger.error(f"Error handling file upload: {str(e)}")
-        raise HTTPException(
-            status_code=500,
-            detail=str(e)
-        )
-@app.post("/parse/url", response_model=ParseResponse)
-async def parse_url(input_data: URLInput, background_tasks: BackgroundTasks):
-    """
-    Parse a document from URL
-    """
-    try:
-        # Download file
-        async with httpx.AsyncClient() as client:
-            response = await client.get(str(input_data.url), follow_redirects=True)
-            response.raise_for_status()
-        # Get filename from URL or use default
-        filename = os.path.basename(urlparse(str(input_data.url)).path)
-        if not filename:
-            filename = "document.pdf"
-        # Save to temporary file in /tmp for Vercel
-        tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
-        tmp_path = os.path.join(tmp_dir, f"download_{os.urandom(8).hex()}{os.path.splitext(filename)[1]}")
-        with open(tmp_path, "wb") as f:
-            f.write(response.content)
-        # Generate job ID
-        job_id = f"job_{len(jobs) + 1}"
-        # Start background processing
-        background_tasks.add_task(
-            process_document_async,
-            job_id,
-            tmp_path,
-            str(input_data.callback_url) if input_data.callback_url else None
-        )
-        return ParseResponse(
-            job_id=job_id,
-            status="queued"
-        )
-    except httpx.RequestError as e:
-        logger.error(f"Error downloading file: {str(e)}")
-        raise HTTPException(
-            status_code=400,
-            detail=f"Error downloading file: {str(e)}"
-        )
-    except Exception as e:
-        logger.error(f"Error processing URL: {str(e)}")
-        raise HTTPException(
-            status_code=500,
-            detail=str(e)
-        )
-@app.get("/status/{job_id}", response_model=ParseResponse)
-async def get_status(job_id: str):
-    """
-    Get the status of a parsing job
-    """
-    if job_id not in jobs:
-        raise HTTPException(
-            status_code=404,
-            detail="Job not found"
-        )
-    job = jobs[job_id]
-    return ParseResponse(
-        job_id=job_id,
-        status=job["status"],
-        result=job.get("result"),
-        error=job.get("error")
-    )
-@app.get("/health")
-async def health_check():
-    """
-    Health check endpoint
-    """
-    return {"status": "healthy"}
-# Handler for Vercel
-handler = Mangum(app, lifespan="off")

api/index.py DELETED Viewed

@@ -1,201 +0,0 @@
-import json
-import os
-import tempfile
-import magic
-import requests
-import datetime
-def is_valid_file(file_data):
-    """Check if file type is allowed using python-magic"""
-    try:
-        mime = magic.from_buffer(file_data, mime=True)
-        allowed_mimes = [
-            'application/pdf',
-            'text/plain',
-            'text/html',
-            'text/markdown',
-            'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
-        ]
-        return mime in allowed_mimes
-    except Exception:
-        return False
-def download_file(url):
-    """Download file from URL and save to temp file"""
-    try:
-        response = requests.get(url, stream=True, timeout=10)
-        response.raise_for_status()
-        # Get content type
-        content_type = response.headers.get('content-type', '').split(';')[0]
-        if content_type not in [
-            'application/pdf',
-            'text/plain',
-            'text/html',
-            'text/markdown',
-            'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
-        ]:
-            raise ValueError(f"Unsupported content type: {content_type}")
-        # Create temp file with proper extension
-        ext = {
-            'application/pdf': '.pdf',
-            'text/plain': '.txt',
-            'text/html': '.html',
-            'text/markdown': '.md',
-            'application/vnd.openxmlformats-officedocument.wordprocessingml.document': '.docx'
-        }.get(content_type, '')
-        fd, temp_path = tempfile.mkstemp(suffix=ext)
-        os.close(fd)
-        # Download file
-        with open(temp_path, 'wb') as f:
-            for chunk in response.iter_content(chunk_size=8192):
-                f.write(chunk)
-        return temp_path, content_type
-    except Exception as e:
-        raise ValueError(f"Failed to download file: {str(e)}")
-def handle_root():
-    return {
-        "status": "ok",
-        "message": "Document Processing API",
-        "version": "1.0.0"
-    }
-def handle_health():
-    return {
-        "status": "healthy",
-        "timestamp": str(datetime.datetime.now(datetime.UTC))
-    }
-def handle_parse_file(file_data):
-    if not file_data:
-        raise ValueError("No file provided")
-    if not is_valid_file(file_data):
-        raise ValueError("Invalid file type")
-    fd, temp_path = tempfile.mkstemp()
-    os.close(fd)
-    try:
-        with open(temp_path, 'wb') as f:
-            f.write(file_data)
-        return {
-            "status": "success",
-            "message": "File processed successfully",
-            "metadata": {
-                "size": os.path.getsize(temp_path),
-                "mime_type": magic.from_file(temp_path, mime=True)
-            }
-        }
-    finally:
-        try:
-            os.unlink(temp_path)
-        except:
-            pass
-def handle_parse_url(url):
-    if not url:
-        raise ValueError("No URL provided")
-    if not url.startswith(('http://', 'https://')):
-        raise ValueError("Invalid URL")
-    temp_path, content_type = download_file(url)
-    try:
-        return {
-            "status": "success",
-            "message": "URL processed successfully",
-            "metadata": {
-                "url": url,
-                "content_type": content_type,
-                "size": os.path.getsize(temp_path)
-            }
-        }
-    finally:
-        try:
-            os.unlink(temp_path)
-        except:
-            pass
-def handler(request, context):
-    """Vercel serverless handler"""
-    # Add CORS headers to all responses
-    cors_headers = {
-        "Access-Control-Allow-Origin": "*",
-        "Access-Control-Allow-Headers": "Content-Type, Authorization",
-        "Access-Control-Allow-Methods": "GET, POST, OPTIONS",
-        "Content-Type": "application/json"
-    }
-    # Handle OPTIONS request for CORS
-    if request.get('httpMethod') == 'OPTIONS':
-        return {
-            "statusCode": 204,
-            "headers": cors_headers,
-            "body": ""
-        }
-    try:
-        # Get path and method
-        path = request.get('path', '/').rstrip('/')
-        method = request.get('httpMethod', 'GET')
-        # Route request
-        response_data = None
-        if path == '' or path == '/':
-            response_data = handle_root()
-        elif path == '/health':
-            response_data = handle_health()
-        elif path == '/parse/file' and method == 'POST':
-            file_data = request.get('body', '')
-            if request.get('isBase64Encoded', False):
-                import base64
-                file_data = base64.b64decode(file_data)
-            response_data = handle_parse_file(file_data)
-        elif path == '/parse/url' and method == 'POST':
-            try:
-                body = json.loads(request.get('body', '{}'))
-            except:
-                raise ValueError("Invalid JSON")
-            response_data = handle_parse_url(body.get('url'))
-        else:
-            return {
-                "statusCode": 404,
-                "headers": cors_headers,
-                "body": json.dumps({
-                    "error": "Not Found",
-                    "details": f"Path {path} not found"
-                })
-            }
-        return {
-            "statusCode": 200,
-            "headers": cors_headers,
-            "body": json.dumps(response_data)
-        }
-    except ValueError as e:
-        return {
-            "statusCode": 400,
-            "headers": cors_headers,
-            "body": json.dumps({
-                "error": "Bad Request",
-                "details": str(e)
-            })
-        }
-    except Exception as e:
-        return {
-            "statusCode": 500,
-            "headers": cors_headers,
-            "body": json.dumps({
-                "error": "Internal Server Error",
-                "details": str(e)
-            })
-        }

requirements-prod.txt DELETED Viewed

@@ -1,11 +0,0 @@
-docling>=0.2.0
-pydantic>=2.0.0
-python-magic>=0.4.27
-PyPDF2>=3.0.0
-beautifulsoup4>=4.12.0
-lxml>=4.9.0
-requests>=2.31.0
-fastapi>=0.104.0
-python-multipart>=0.0.6
-httpx>=0.25.0
-mangum>=0.17.0

requirements.txt CHANGED Viewed

@@ -1,4 +1,13 @@
-flask==2.0.1
-werkzeug==2.0.1
-python-magic==0.4.27
-requests==2.31.0

+docling>=0.2.0
+pydantic>=2.0.0
+python-magic>=0.4.27
+python-docx>=0.8.11
+PyPDF2>=3.0.0
+beautifulsoup4>=4.12.0
+lxml>=4.9.0
+gradio>=4.44.1
+pandas>=1.5.0
+huggingface-hub>=0.19.0
+python-magic-bin>=0.4.14; platform_system == "Windows"
+libmagic; platform_system == "Linux"
+requests>=2.31.0

vercel.json DELETED Viewed

@@ -1,25 +0,0 @@
-{
-    "version": 2,
-    "builds": [
-        {
-            "src": "api/index.py",
-            "use": "@vercel/python",
-            "config": {
-                "maxLambdaSize": "10mb",
-                "runtime": "python3.9"
-            }
-        }
-    ],
-    "routes": [
-        {
-            "src": "/(.*)",
-            "dest": "api/index.py",
-            "headers": {
-                "Access-Control-Allow-Origin": "*",
-                "Access-Control-Allow-Headers": "Content-Type, Authorization",
-                "Access-Control-Allow-Methods": "GET, POST, OPTIONS",
-                "Access-Control-Max-Age": "86400"
-            }
-        }
-    ]
-}