Spaces:
Running
Running
Commit
·
4e0ee33
0
Parent(s):
Uploading from VS
Browse files- .gradio/certificate.pem +31 -0
- README.md +82 -0
- __pycache__/utils.cpython-312.pyc +0 -0
- app.py +54 -0
- requirements.txt +13 -0
- utils.py +141 -0
.gradio/certificate.pem
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
-----BEGIN CERTIFICATE-----
|
2 |
+
MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
|
3 |
+
TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
|
4 |
+
cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
|
5 |
+
WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
|
6 |
+
ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
|
7 |
+
MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
|
8 |
+
h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
|
9 |
+
0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
|
10 |
+
A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
|
11 |
+
T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
|
12 |
+
B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
|
13 |
+
B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
|
14 |
+
KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
|
15 |
+
OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
|
16 |
+
jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
|
17 |
+
qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
|
18 |
+
rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
|
19 |
+
HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
|
20 |
+
hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
|
21 |
+
ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
|
22 |
+
3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
|
23 |
+
NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
|
24 |
+
ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
|
25 |
+
TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
|
26 |
+
jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
|
27 |
+
oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
|
28 |
+
4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
|
29 |
+
mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
|
30 |
+
emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
|
31 |
+
-----END CERTIFICATE-----
|
README.md
ADDED
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ChatDocxAI
|
2 |
+
|
3 |
+
ChatDocxAI is an advanced document analysis and question-answering system that leverages the power of Google's Gemini AI to provide intelligent insights from your documents. This versatile tool allows you to upload various document formats and engage in natural conversations about their contents. Whether you're analyzing research papers, business reports, or any other document type, ChatDocxAI helps you extract meaningful information through simple questions.
|
4 |
+
|
5 |
+
The system employs sophisticated natural language processing techniques to:
|
6 |
+
- Break down complex documents into digestible chunks
|
7 |
+
- Maintain context awareness across the entire document
|
8 |
+
- Provide accurate, contextually relevant answers
|
9 |
+
- Handle multiple document formats seamlessly
|
10 |
+
|
11 |
+
Perfect for researchers, business analysts, students, and anyone who needs to quickly extract information from documents without manually reading through them.
|
12 |
+
|
13 |
+
## Features
|
14 |
+
|
15 |
+
- Support for multiple document formats:
|
16 |
+
- PDF (.pdf)
|
17 |
+
- Text (.txt)
|
18 |
+
- Word Documents (.docx)
|
19 |
+
- Excel Spreadsheets (.xlsx)
|
20 |
+
- PowerPoint Presentations (.pptx)
|
21 |
+
- XML files (.xml)
|
22 |
+
- CSV files (.csv)
|
23 |
+
- JSON files (.json)
|
24 |
+
- Interactive web interface using Gradio
|
25 |
+
- Intelligent document processing and chunking
|
26 |
+
- Advanced context retrieval for accurate answers
|
27 |
+
- Train on multiple documents at once
|
28 |
+
|
29 |
+
## Installation
|
30 |
+
|
31 |
+
1. Clone this repository:
|
32 |
+
```bash
|
33 |
+
git clone https://github.com/yourusername/ChatDocxAI.git
|
34 |
+
cd ChatDocxAI
|
35 |
+
```
|
36 |
+
|
37 |
+
2. Install the required dependencies:
|
38 |
+
```bash
|
39 |
+
pip install -r requirements.txt
|
40 |
+
```
|
41 |
+
|
42 |
+
## Usage
|
43 |
+
|
44 |
+
1. Run the application:
|
45 |
+
```bash
|
46 |
+
python app.py
|
47 |
+
```
|
48 |
+
|
49 |
+
2. The application will start and provide you with two URLs:
|
50 |
+
- A local URL (http://127.0.0.1:7860) for local access
|
51 |
+
- A public URL (https://xxx.gradio.live) that anyone can access
|
52 |
+
|
53 |
+
You can share the public URL with others to let them interact with your document Q&A system.
|
54 |
+
|
55 |
+
3. Upload your documents using the file upload interface
|
56 |
+
|
57 |
+
4. Click "Process Document" to analyze your uploaded files
|
58 |
+
|
59 |
+
5. Ask questions about your documents in the question box
|
60 |
+
|
61 |
+
## How It Works
|
62 |
+
|
63 |
+
1. **Document Processing**: When you upload documents, they are processed and split into manageable chunks
|
64 |
+
2. **Vector Storage**: The chunks are converted into vectors and stored for efficient retrieval
|
65 |
+
3. **Question Handling**: When you ask a question, the system:
|
66 |
+
- Retrieves relevant context from your documents
|
67 |
+
- Builds an intelligent prompt
|
68 |
+
- Uses Gemini AI to generate accurate answers
|
69 |
+
|
70 |
+
## Dependencies
|
71 |
+
|
72 |
+
- gradio: For the web interface
|
73 |
+
- google-genai: For Gemini AI integration
|
74 |
+
- langchain: For document processing and chain operations
|
75 |
+
- faiss-cpu: For vector storage and retrieval
|
76 |
+
- sentence-transformers: For text embeddings
|
77 |
+
- unstructured: For document parsing
|
78 |
+
|
79 |
+
## Contributing
|
80 |
+
|
81 |
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
82 |
+
|
__pycache__/utils.cpython-312.pyc
ADDED
Binary file (7.99 kB). View file
|
|
app.py
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
from utils import (
|
3 |
+
authenticate,
|
4 |
+
split_documents,
|
5 |
+
build_vectorstore,
|
6 |
+
retrieve_context,
|
7 |
+
retrieve_context_approx,
|
8 |
+
build_prompt,
|
9 |
+
ask_gemini,
|
10 |
+
load_documents_gradio, # Import the new function
|
11 |
+
)
|
12 |
+
|
13 |
+
client = authenticate()
|
14 |
+
store = {"value": None}
|
15 |
+
|
16 |
+
|
17 |
+
def upload_and_process(files):
|
18 |
+
if files is None:
|
19 |
+
return "Please upload a file!"
|
20 |
+
|
21 |
+
raw_docs = load_documents_gradio(files)
|
22 |
+
chunks = split_documents(raw_docs)
|
23 |
+
store["value"] = build_vectorstore(chunks)
|
24 |
+
return "Document processed successfully! You can now ask questions."
|
25 |
+
|
26 |
+
|
27 |
+
def handle_question(query):
|
28 |
+
if store["value"] is None:
|
29 |
+
return "Please upload and process a document first."
|
30 |
+
|
31 |
+
if store["value"]["chunks"] <= 50:
|
32 |
+
top_chunks = retrieve_context(query, store["value"])
|
33 |
+
else:
|
34 |
+
top_chunks = retrieve_context_approx(query, store["value"])
|
35 |
+
|
36 |
+
prompt = build_prompt(top_chunks, query)
|
37 |
+
answer = ask_gemini(prompt, client)
|
38 |
+
return f"### My Insights :\n\n{answer.strip()}"
|
39 |
+
|
40 |
+
|
41 |
+
with gr.Blocks() as demo:
|
42 |
+
gr.Markdown("## Ask Questions from Your Uploaded Documents")
|
43 |
+
file_input = gr.File(label="Upload Your File", file_types=['.pdf', '.txt', '.docx', '.csv', '.json', '.pptx', '.xml', '.xlsx'], file_count='multiple')
|
44 |
+
|
45 |
+
process_btn = gr.Button("Process Document")
|
46 |
+
status = gr.Textbox(label="Processing Status")
|
47 |
+
|
48 |
+
question = gr.Textbox(label="Ask a Question")
|
49 |
+
answer = gr.Markdown()
|
50 |
+
|
51 |
+
process_btn.click(upload_and_process, inputs=file_input, outputs=status)
|
52 |
+
question.submit(handle_question, inputs=question, outputs=answer)
|
53 |
+
|
54 |
+
demo.launch(share=True) # Or demo.deploy(hf_space="your-username/your-space-name")
|
requirements.txt
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
gradio
|
2 |
+
google-genai
|
3 |
+
langchain
|
4 |
+
langchain-community
|
5 |
+
langchain-google-genai
|
6 |
+
google-generativeai
|
7 |
+
faiss-cpu
|
8 |
+
sentence-transformers
|
9 |
+
unstructured[pdf]
|
10 |
+
unstructured[docx]
|
11 |
+
unstructured[ppt]
|
12 |
+
unstructured[excel]
|
13 |
+
unstructured[xml]
|
utils.py
ADDED
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import getpass
|
3 |
+
import faiss
|
4 |
+
import numpy as np
|
5 |
+
import warnings
|
6 |
+
import logging
|
7 |
+
|
8 |
+
# Suppress warnings
|
9 |
+
logging.getLogger("pdfminer").setLevel(logging.ERROR)
|
10 |
+
warnings.filterwarnings("ignore")
|
11 |
+
|
12 |
+
from google import genai
|
13 |
+
from google.genai import types
|
14 |
+
from sentence_transformers import SentenceTransformer
|
15 |
+
from langchain_community.document_loaders import(
|
16 |
+
UnstructuredPDFLoader,
|
17 |
+
TextLoader,
|
18 |
+
CSVLoader,
|
19 |
+
JSONLoader,
|
20 |
+
UnstructuredPowerPointLoader,
|
21 |
+
UnstructuredExcelLoader,
|
22 |
+
UnstructuredXMLLoader,
|
23 |
+
UnstructuredWordDocumentLoader,
|
24 |
+
)
|
25 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
26 |
+
|
27 |
+
|
28 |
+
def authenticate():
|
29 |
+
"""Authenticates with the Google Generative AI API using an API key."""
|
30 |
+
api_key = os.environ.get("GOOGLE_API_KEY")
|
31 |
+
if not api_key:
|
32 |
+
api_key = getpass.getpass("Enter your API Key: ")
|
33 |
+
|
34 |
+
client = genai.Client(api_key=api_key)
|
35 |
+
return client
|
36 |
+
|
37 |
+
|
38 |
+
def load_documents_gradio(uploaded_files):
|
39 |
+
docs = []
|
40 |
+
for file in uploaded_files:
|
41 |
+
file_path = file.name
|
42 |
+
# Detect type and load accordingly
|
43 |
+
if file_path.lower().endswith('.pdf'):
|
44 |
+
docs.extend(UnstructuredPDFLoader(file_path).load())
|
45 |
+
elif file_path.lower().endswith('.txt'):
|
46 |
+
docs.extend(TextLoader(file_path).load())
|
47 |
+
elif file_path.lower().endswith('.csv'):
|
48 |
+
docs.extend(CSVLoader(file_path).load())
|
49 |
+
elif file_path.lower().endswith('.json'):
|
50 |
+
docs.extend(JSONLoader(file_path).load())
|
51 |
+
elif file_path.lower().endswith('.pptx'):
|
52 |
+
docs.extend(UnstructuredPowerPointLoader(file_path).load())
|
53 |
+
elif file_path.lower().endswith('.xlsx'):
|
54 |
+
docs.extend(UnstructuredExcelLoader(file_path).load())
|
55 |
+
elif file_path.lower().endswith('.xml'):
|
56 |
+
docs.extend(UnstructuredXMLLoader(file_path).load())
|
57 |
+
elif file_path.lower().endswith('.docx'):
|
58 |
+
docs.extend(UnstructuredWordDocumentLoader(file_path).load())
|
59 |
+
else:
|
60 |
+
print(f'Unsupported File Type: {file_path}')
|
61 |
+
return docs
|
62 |
+
|
63 |
+
|
64 |
+
def split_documents(docs, chunk_size=500, chunk_overlap=100):
|
65 |
+
"""Splits documents into smaller chunks using RecursiveCharacterTextSplitter."""
|
66 |
+
splitter = RecursiveCharacterTextSplitter(
|
67 |
+
chunk_size=chunk_size, chunk_overlap=chunk_overlap
|
68 |
+
)
|
69 |
+
return splitter.split_documents(docs)
|
70 |
+
|
71 |
+
|
72 |
+
def build_vectorstore(docs, embedding_model_name="all-MiniLM-L6-v2"):
|
73 |
+
"""Builds a FAISS vector store from the document chunks."""
|
74 |
+
texts = [doc.page_content.strip() for doc in docs if doc.page_content.strip()]
|
75 |
+
if not texts:
|
76 |
+
raise ValueError("No valid text found in the documents.")
|
77 |
+
|
78 |
+
print(f"No. of Chunks: {len(texts)}")
|
79 |
+
|
80 |
+
model = SentenceTransformer(embedding_model_name)
|
81 |
+
embeddings = model.encode(texts)
|
82 |
+
print(embeddings.shape)
|
83 |
+
|
84 |
+
index = faiss.IndexFlatL2(embeddings.shape[1])
|
85 |
+
index.add(np.array(embeddings).astype("float32"))
|
86 |
+
|
87 |
+
return {
|
88 |
+
"index": index,
|
89 |
+
"texts": texts,
|
90 |
+
"embedding_model": model,
|
91 |
+
"embeddings": embeddings,
|
92 |
+
"chunks": len(texts),
|
93 |
+
}
|
94 |
+
|
95 |
+
|
96 |
+
def retrieve_context(query, store, k=6):
|
97 |
+
"""Retrieves the top-k context chunks most similar to the query."""
|
98 |
+
query_vec = store["embedding_model"].encode([query])
|
99 |
+
k = min(k, len(store["texts"]))
|
100 |
+
distances, indices = store["index"].search(query_vec, k)
|
101 |
+
return [store["texts"][i] for i in indices[0]]
|
102 |
+
|
103 |
+
|
104 |
+
def retrieve_context_approx(query, store, k=6):
|
105 |
+
"""Retrieves context chunks using approximate nearest neighbor search."""
|
106 |
+
ncells = 50
|
107 |
+
D = store["index"].d
|
108 |
+
index = faiss.IndexFlatL2(D)
|
109 |
+
nindex = faiss.IndexIVFFlat(index, D, ncells)
|
110 |
+
nindex.nprobe = 10
|
111 |
+
|
112 |
+
if not nindex.is_trained:
|
113 |
+
nindex.train(np.array(store["embeddings"]).astype("float32"))
|
114 |
+
|
115 |
+
nindex.add(np.array(store["embeddings"]).astype("float32"))
|
116 |
+
query_vec = store["embedding_model"].encode([query])
|
117 |
+
k = min(k, len(store["texts"]))
|
118 |
+
_, indices = nindex.search(np.array(query_vec).astype("float32"), k)
|
119 |
+
return [store["texts"][i] for i in indices[0]]
|
120 |
+
|
121 |
+
|
122 |
+
def build_prompt(context_chunks, query):
|
123 |
+
"""Builds the prompt for the Gemini API using context and query."""
|
124 |
+
context = "\n".join(context_chunks)
|
125 |
+
return f"""You are a highly knowledgeable and helpful assistant. Use the following context to generate a **detailed and step-by-step** answer to the user's question. Include explanations, examples, and reasoning wherever helpful.
|
126 |
+
|
127 |
+
Context:
|
128 |
+
{context}
|
129 |
+
|
130 |
+
Question: {query}
|
131 |
+
Answer:"""
|
132 |
+
|
133 |
+
|
134 |
+
def ask_gemini(prompt, client):
|
135 |
+
"""Calls the Gemini API with the given prompt and returns the response."""
|
136 |
+
response = client.models.generate_content(
|
137 |
+
model="gemini-2.0-flash", # Or your preferred model
|
138 |
+
contents=[prompt],
|
139 |
+
config=types.GenerateContentConfig(max_output_tokens=2048, temperature=0.5, seed=42),
|
140 |
+
)
|
141 |
+
return response.text
|