Codegeass321 commited on
Commit
4e0ee33
·
0 Parent(s):

Uploading from VS

Browse files
Files changed (6) hide show
  1. .gradio/certificate.pem +31 -0
  2. README.md +82 -0
  3. __pycache__/utils.cpython-312.pyc +0 -0
  4. app.py +54 -0
  5. requirements.txt +13 -0
  6. utils.py +141 -0
.gradio/certificate.pem ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ -----BEGIN CERTIFICATE-----
2
+ MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
3
+ TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
4
+ cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
5
+ WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
6
+ ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
7
+ MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
8
+ h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
9
+ 0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
10
+ A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
11
+ T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
12
+ B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
13
+ B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
14
+ KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
15
+ OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
16
+ jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
17
+ qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
18
+ rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
19
+ HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
20
+ hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
21
+ ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
22
+ 3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
23
+ NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
24
+ ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
25
+ TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
26
+ jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
27
+ oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
28
+ 4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
29
+ mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
30
+ emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
31
+ -----END CERTIFICATE-----
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ChatDocxAI
2
+
3
+ ChatDocxAI is an advanced document analysis and question-answering system that leverages the power of Google's Gemini AI to provide intelligent insights from your documents. This versatile tool allows you to upload various document formats and engage in natural conversations about their contents. Whether you're analyzing research papers, business reports, or any other document type, ChatDocxAI helps you extract meaningful information through simple questions.
4
+
5
+ The system employs sophisticated natural language processing techniques to:
6
+ - Break down complex documents into digestible chunks
7
+ - Maintain context awareness across the entire document
8
+ - Provide accurate, contextually relevant answers
9
+ - Handle multiple document formats seamlessly
10
+
11
+ Perfect for researchers, business analysts, students, and anyone who needs to quickly extract information from documents without manually reading through them.
12
+
13
+ ## Features
14
+
15
+ - Support for multiple document formats:
16
+ - PDF (.pdf)
17
+ - Text (.txt)
18
+ - Word Documents (.docx)
19
+ - Excel Spreadsheets (.xlsx)
20
+ - PowerPoint Presentations (.pptx)
21
+ - XML files (.xml)
22
+ - CSV files (.csv)
23
+ - JSON files (.json)
24
+ - Interactive web interface using Gradio
25
+ - Intelligent document processing and chunking
26
+ - Advanced context retrieval for accurate answers
27
+ - Train on multiple documents at once
28
+
29
+ ## Installation
30
+
31
+ 1. Clone this repository:
32
+ ```bash
33
+ git clone https://github.com/yourusername/ChatDocxAI.git
34
+ cd ChatDocxAI
35
+ ```
36
+
37
+ 2. Install the required dependencies:
38
+ ```bash
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ## Usage
43
+
44
+ 1. Run the application:
45
+ ```bash
46
+ python app.py
47
+ ```
48
+
49
+ 2. The application will start and provide you with two URLs:
50
+ - A local URL (http://127.0.0.1:7860) for local access
51
+ - A public URL (https://xxx.gradio.live) that anyone can access
52
+
53
+ You can share the public URL with others to let them interact with your document Q&A system.
54
+
55
+ 3. Upload your documents using the file upload interface
56
+
57
+ 4. Click "Process Document" to analyze your uploaded files
58
+
59
+ 5. Ask questions about your documents in the question box
60
+
61
+ ## How It Works
62
+
63
+ 1. **Document Processing**: When you upload documents, they are processed and split into manageable chunks
64
+ 2. **Vector Storage**: The chunks are converted into vectors and stored for efficient retrieval
65
+ 3. **Question Handling**: When you ask a question, the system:
66
+ - Retrieves relevant context from your documents
67
+ - Builds an intelligent prompt
68
+ - Uses Gemini AI to generate accurate answers
69
+
70
+ ## Dependencies
71
+
72
+ - gradio: For the web interface
73
+ - google-genai: For Gemini AI integration
74
+ - langchain: For document processing and chain operations
75
+ - faiss-cpu: For vector storage and retrieval
76
+ - sentence-transformers: For text embeddings
77
+ - unstructured: For document parsing
78
+
79
+ ## Contributing
80
+
81
+ Contributions are welcome! Please feel free to submit a Pull Request.
82
+
__pycache__/utils.cpython-312.pyc ADDED
Binary file (7.99 kB). View file
 
app.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from utils import (
3
+ authenticate,
4
+ split_documents,
5
+ build_vectorstore,
6
+ retrieve_context,
7
+ retrieve_context_approx,
8
+ build_prompt,
9
+ ask_gemini,
10
+ load_documents_gradio, # Import the new function
11
+ )
12
+
13
+ client = authenticate()
14
+ store = {"value": None}
15
+
16
+
17
+ def upload_and_process(files):
18
+ if files is None:
19
+ return "Please upload a file!"
20
+
21
+ raw_docs = load_documents_gradio(files)
22
+ chunks = split_documents(raw_docs)
23
+ store["value"] = build_vectorstore(chunks)
24
+ return "Document processed successfully! You can now ask questions."
25
+
26
+
27
+ def handle_question(query):
28
+ if store["value"] is None:
29
+ return "Please upload and process a document first."
30
+
31
+ if store["value"]["chunks"] <= 50:
32
+ top_chunks = retrieve_context(query, store["value"])
33
+ else:
34
+ top_chunks = retrieve_context_approx(query, store["value"])
35
+
36
+ prompt = build_prompt(top_chunks, query)
37
+ answer = ask_gemini(prompt, client)
38
+ return f"### My Insights :\n\n{answer.strip()}"
39
+
40
+
41
+ with gr.Blocks() as demo:
42
+ gr.Markdown("## Ask Questions from Your Uploaded Documents")
43
+ file_input = gr.File(label="Upload Your File", file_types=['.pdf', '.txt', '.docx', '.csv', '.json', '.pptx', '.xml', '.xlsx'], file_count='multiple')
44
+
45
+ process_btn = gr.Button("Process Document")
46
+ status = gr.Textbox(label="Processing Status")
47
+
48
+ question = gr.Textbox(label="Ask a Question")
49
+ answer = gr.Markdown()
50
+
51
+ process_btn.click(upload_and_process, inputs=file_input, outputs=status)
52
+ question.submit(handle_question, inputs=question, outputs=answer)
53
+
54
+ demo.launch(share=True) # Or demo.deploy(hf_space="your-username/your-space-name")
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ google-genai
3
+ langchain
4
+ langchain-community
5
+ langchain-google-genai
6
+ google-generativeai
7
+ faiss-cpu
8
+ sentence-transformers
9
+ unstructured[pdf]
10
+ unstructured[docx]
11
+ unstructured[ppt]
12
+ unstructured[excel]
13
+ unstructured[xml]
utils.py ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import getpass
3
+ import faiss
4
+ import numpy as np
5
+ import warnings
6
+ import logging
7
+
8
+ # Suppress warnings
9
+ logging.getLogger("pdfminer").setLevel(logging.ERROR)
10
+ warnings.filterwarnings("ignore")
11
+
12
+ from google import genai
13
+ from google.genai import types
14
+ from sentence_transformers import SentenceTransformer
15
+ from langchain_community.document_loaders import(
16
+ UnstructuredPDFLoader,
17
+ TextLoader,
18
+ CSVLoader,
19
+ JSONLoader,
20
+ UnstructuredPowerPointLoader,
21
+ UnstructuredExcelLoader,
22
+ UnstructuredXMLLoader,
23
+ UnstructuredWordDocumentLoader,
24
+ )
25
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
26
+
27
+
28
+ def authenticate():
29
+ """Authenticates with the Google Generative AI API using an API key."""
30
+ api_key = os.environ.get("GOOGLE_API_KEY")
31
+ if not api_key:
32
+ api_key = getpass.getpass("Enter your API Key: ")
33
+
34
+ client = genai.Client(api_key=api_key)
35
+ return client
36
+
37
+
38
+ def load_documents_gradio(uploaded_files):
39
+ docs = []
40
+ for file in uploaded_files:
41
+ file_path = file.name
42
+ # Detect type and load accordingly
43
+ if file_path.lower().endswith('.pdf'):
44
+ docs.extend(UnstructuredPDFLoader(file_path).load())
45
+ elif file_path.lower().endswith('.txt'):
46
+ docs.extend(TextLoader(file_path).load())
47
+ elif file_path.lower().endswith('.csv'):
48
+ docs.extend(CSVLoader(file_path).load())
49
+ elif file_path.lower().endswith('.json'):
50
+ docs.extend(JSONLoader(file_path).load())
51
+ elif file_path.lower().endswith('.pptx'):
52
+ docs.extend(UnstructuredPowerPointLoader(file_path).load())
53
+ elif file_path.lower().endswith('.xlsx'):
54
+ docs.extend(UnstructuredExcelLoader(file_path).load())
55
+ elif file_path.lower().endswith('.xml'):
56
+ docs.extend(UnstructuredXMLLoader(file_path).load())
57
+ elif file_path.lower().endswith('.docx'):
58
+ docs.extend(UnstructuredWordDocumentLoader(file_path).load())
59
+ else:
60
+ print(f'Unsupported File Type: {file_path}')
61
+ return docs
62
+
63
+
64
+ def split_documents(docs, chunk_size=500, chunk_overlap=100):
65
+ """Splits documents into smaller chunks using RecursiveCharacterTextSplitter."""
66
+ splitter = RecursiveCharacterTextSplitter(
67
+ chunk_size=chunk_size, chunk_overlap=chunk_overlap
68
+ )
69
+ return splitter.split_documents(docs)
70
+
71
+
72
+ def build_vectorstore(docs, embedding_model_name="all-MiniLM-L6-v2"):
73
+ """Builds a FAISS vector store from the document chunks."""
74
+ texts = [doc.page_content.strip() for doc in docs if doc.page_content.strip()]
75
+ if not texts:
76
+ raise ValueError("No valid text found in the documents.")
77
+
78
+ print(f"No. of Chunks: {len(texts)}")
79
+
80
+ model = SentenceTransformer(embedding_model_name)
81
+ embeddings = model.encode(texts)
82
+ print(embeddings.shape)
83
+
84
+ index = faiss.IndexFlatL2(embeddings.shape[1])
85
+ index.add(np.array(embeddings).astype("float32"))
86
+
87
+ return {
88
+ "index": index,
89
+ "texts": texts,
90
+ "embedding_model": model,
91
+ "embeddings": embeddings,
92
+ "chunks": len(texts),
93
+ }
94
+
95
+
96
+ def retrieve_context(query, store, k=6):
97
+ """Retrieves the top-k context chunks most similar to the query."""
98
+ query_vec = store["embedding_model"].encode([query])
99
+ k = min(k, len(store["texts"]))
100
+ distances, indices = store["index"].search(query_vec, k)
101
+ return [store["texts"][i] for i in indices[0]]
102
+
103
+
104
+ def retrieve_context_approx(query, store, k=6):
105
+ """Retrieves context chunks using approximate nearest neighbor search."""
106
+ ncells = 50
107
+ D = store["index"].d
108
+ index = faiss.IndexFlatL2(D)
109
+ nindex = faiss.IndexIVFFlat(index, D, ncells)
110
+ nindex.nprobe = 10
111
+
112
+ if not nindex.is_trained:
113
+ nindex.train(np.array(store["embeddings"]).astype("float32"))
114
+
115
+ nindex.add(np.array(store["embeddings"]).astype("float32"))
116
+ query_vec = store["embedding_model"].encode([query])
117
+ k = min(k, len(store["texts"]))
118
+ _, indices = nindex.search(np.array(query_vec).astype("float32"), k)
119
+ return [store["texts"][i] for i in indices[0]]
120
+
121
+
122
+ def build_prompt(context_chunks, query):
123
+ """Builds the prompt for the Gemini API using context and query."""
124
+ context = "\n".join(context_chunks)
125
+ return f"""You are a highly knowledgeable and helpful assistant. Use the following context to generate a **detailed and step-by-step** answer to the user's question. Include explanations, examples, and reasoning wherever helpful.
126
+
127
+ Context:
128
+ {context}
129
+
130
+ Question: {query}
131
+ Answer:"""
132
+
133
+
134
+ def ask_gemini(prompt, client):
135
+ """Calls the Gemini API with the given prompt and returns the response."""
136
+ response = client.models.generate_content(
137
+ model="gemini-2.0-flash", # Or your preferred model
138
+ contents=[prompt],
139
+ config=types.GenerateContentConfig(max_output_tokens=2048, temperature=0.5, seed=42),
140
+ )
141
+ return response.text