Spaces:
Sleeping
Sleeping
Upload 10 files
Browse files- .gitattributes +1 -0
- .gitignore +32 -7
- LICENSE +201 -0
- docs/README.md +206 -0
- docs/architecture-doc.md +554 -0
- docs/assets/app_screenshot.png +3 -0
- docs/changelog-doc.md +53 -0
- docs/configuration-doc.md +597 -0
- docs/data-handling-doc.md +687 -0
- scripts/preprocessing.ipynb +644 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
docs/assets/app_screenshot.png filter=lfs diff=lfs merge=lfs -text
|
.gitignore
CHANGED
@@ -1,24 +1,49 @@
|
|
1 |
-
#
|
2 |
.streamlit/secrets.toml
|
|
|
3 |
temp_credentials.json
|
|
|
4 |
|
5 |
-
#
|
6 |
metadata.jsonl
|
7 |
faiss_index.faiss
|
8 |
text_chunks.txt
|
9 |
all_embeddings.npy
|
|
|
|
|
|
|
|
|
10 |
|
11 |
-
# Python
|
12 |
__pycache__/
|
13 |
*.pyc
|
14 |
*.pyo
|
15 |
*.pyd
|
16 |
*.ipynb_checkpoints/
|
17 |
|
18 |
-
# Virtual
|
19 |
venv/
|
20 |
-
.
|
|
|
|
|
21 |
|
22 |
-
# Logs
|
23 |
logs/
|
24 |
-
*.log
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Secrets and Credentials
|
2 |
.streamlit/secrets.toml
|
3 |
+
*.env
|
4 |
temp_credentials.json
|
5 |
+
secrets.json
|
6 |
|
7 |
+
# Data and Model Files
|
8 |
metadata.jsonl
|
9 |
faiss_index.faiss
|
10 |
text_chunks.txt
|
11 |
all_embeddings.npy
|
12 |
+
*.npy
|
13 |
+
*.pt
|
14 |
+
*.pth
|
15 |
+
*.bin
|
16 |
|
17 |
+
# Python-specific
|
18 |
__pycache__/
|
19 |
*.pyc
|
20 |
*.pyo
|
21 |
*.pyd
|
22 |
*.ipynb_checkpoints/
|
23 |
|
24 |
+
# Virtual Environments
|
25 |
venv/
|
26 |
+
.venv/
|
27 |
+
env/
|
28 |
+
.env/
|
29 |
|
30 |
+
# Logs and Temporary Files
|
31 |
logs/
|
32 |
+
*.log
|
33 |
+
temp/
|
34 |
+
.tmp/
|
35 |
+
|
36 |
+
# OS-specific Files
|
37 |
+
.DS_Store
|
38 |
+
Thumbs.db
|
39 |
+
|
40 |
+
# IDE Files
|
41 |
+
.vscode/
|
42 |
+
.idea/
|
43 |
+
*.swp
|
44 |
+
*.swo
|
45 |
+
|
46 |
+
# Deployment and Build
|
47 |
+
*.egg-info/
|
48 |
+
dist/
|
49 |
+
build/
|
LICENSE
ADDED
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Apache License
|
2 |
+
Version 2.0, January 2004
|
3 |
+
http://www.apache.org/licenses/
|
4 |
+
|
5 |
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
6 |
+
|
7 |
+
1. Definitions.
|
8 |
+
|
9 |
+
"License" shall mean the terms and conditions for use, reproduction,
|
10 |
+
and distribution as defined by Sections 1 through 9 of this document.
|
11 |
+
|
12 |
+
"Licensor" shall mean the copyright owner or entity authorized by
|
13 |
+
the copyright owner that is granting the License.
|
14 |
+
|
15 |
+
"Legal Entity" shall mean the union of the acting entity and all
|
16 |
+
other entities that control, are controlled by, or are under common
|
17 |
+
control with that entity. For the purposes of this definition,
|
18 |
+
"control" means (i) the power, direct or indirect, to cause the
|
19 |
+
direction or management of such entity, whether by contract or
|
20 |
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
21 |
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
22 |
+
|
23 |
+
"You" (or "Your") shall mean an individual or Legal Entity
|
24 |
+
exercising permissions granted by this License.
|
25 |
+
|
26 |
+
"Source" form shall mean the preferred form for making modifications,
|
27 |
+
including but not limited to software source code, documentation
|
28 |
+
source, and configuration files.
|
29 |
+
|
30 |
+
"Object" form shall mean any form resulting from mechanical
|
31 |
+
transformation or translation of a Source form, including but
|
32 |
+
not limited to compiled object code, generated documentation,
|
33 |
+
and conversions to other media types.
|
34 |
+
|
35 |
+
"Work" shall mean the work of authorship, whether in Source or
|
36 |
+
Object form, made available under the License, as indicated by a
|
37 |
+
copyright notice that is included in or attached to the work
|
38 |
+
(an example is provided in the Appendix below).
|
39 |
+
|
40 |
+
"Derivative Works" shall mean any work, whether in Source or Object
|
41 |
+
form, that is based on (or derived from) the Work and for which the
|
42 |
+
editorial revisions, annotations, elaborations, or other modifications
|
43 |
+
represent, as a whole, an original work of authorship. For the purposes
|
44 |
+
of this License, Derivative Works shall not include works that remain
|
45 |
+
separable from, or merely link (or bind by name) to the interfaces of,
|
46 |
+
the Work and Derivative Works thereof.
|
47 |
+
|
48 |
+
"Contribution" shall mean any work of authorship, including
|
49 |
+
the original version of the Work and any modifications or additions
|
50 |
+
to that Work or Derivative Works thereof, that is intentionally
|
51 |
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
52 |
+
or by an individual or Legal Entity authorized to submit on behalf of
|
53 |
+
the copyright owner. For the purposes of this definition, "submitted"
|
54 |
+
means any form of electronic, verbal, or written communication sent
|
55 |
+
to the Licensor or its representatives, including but not limited to
|
56 |
+
communication on electronic mailing lists, source code control systems,
|
57 |
+
and issue tracking systems that are managed by, or on behalf of, the
|
58 |
+
Licensor for the purpose of discussing and improving the Work, but
|
59 |
+
excluding communication that is conspicuously marked or otherwise
|
60 |
+
designated in writing by the copyright owner as "Not a Contribution."
|
61 |
+
|
62 |
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
63 |
+
on behalf of whom a Contribution has been received by Licensor and
|
64 |
+
subsequently incorporated within the Work.
|
65 |
+
|
66 |
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
67 |
+
this License, each Contributor hereby grants to You a perpetual,
|
68 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
69 |
+
copyright license to reproduce, prepare Derivative Works of,
|
70 |
+
publicly display, publicly perform, sublicense, and distribute the
|
71 |
+
Work and such Derivative Works in Source or Object form.
|
72 |
+
|
73 |
+
3. Grant of Patent License. Subject to the terms and conditions of
|
74 |
+
this License, each Contributor hereby grants to You a perpetual,
|
75 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
76 |
+
(except as stated in this section) patent license to make, have made,
|
77 |
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
78 |
+
where such license applies only to those patent claims licensable
|
79 |
+
by such Contributor that are necessarily infringed by their
|
80 |
+
Contribution(s) alone or by combination of their Contribution(s)
|
81 |
+
with the Work to which such Contribution(s) was submitted. If You
|
82 |
+
institute patent litigation against any entity (including a
|
83 |
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
84 |
+
or a Contribution incorporated within the Work constitutes direct
|
85 |
+
or contributory patent infringement, then any patent licenses
|
86 |
+
granted to You under this License for that Work shall terminate
|
87 |
+
as of the date such litigation is filed.
|
88 |
+
|
89 |
+
4. Redistribution. You may reproduce and distribute copies of the
|
90 |
+
Work or Derivative Works thereof in any medium, with or without
|
91 |
+
modifications, and in Source or Object form, provided that You
|
92 |
+
meet the following conditions:
|
93 |
+
|
94 |
+
(a) You must give any other recipients of the Work or
|
95 |
+
Derivative Works a copy of this License; and
|
96 |
+
|
97 |
+
(b) You must cause any modified files to carry prominent notices
|
98 |
+
stating that You changed the files; and
|
99 |
+
|
100 |
+
(c) You must retain, in the Source form of any Derivative Works
|
101 |
+
that You distribute, all copyright, patent, trademark, and
|
102 |
+
attribution notices from the Source form of the Work,
|
103 |
+
excluding those notices that do not pertain to any part of
|
104 |
+
the Derivative Works; and
|
105 |
+
|
106 |
+
(d) If the Work includes a "NOTICE" text file as part of its
|
107 |
+
distribution, then any Derivative Works that You distribute must
|
108 |
+
include a readable copy of the attribution notices contained
|
109 |
+
within such NOTICE file, excluding those notices that do not
|
110 |
+
pertain to any part of the Derivative Works, in at least one
|
111 |
+
of the following places: within a NOTICE text file distributed
|
112 |
+
as part of the Derivative Works; within the Source form or
|
113 |
+
documentation, if provided along with the Derivative Works; or,
|
114 |
+
within a display generated by the Derivative Works, if and
|
115 |
+
wherever such third-party notices normally appear. The contents
|
116 |
+
of the NOTICE file are for informational purposes only and
|
117 |
+
do not modify the License. You may add Your own attribution
|
118 |
+
notices within Derivative Works that You distribute, alongside
|
119 |
+
or as an addendum to the NOTICE text from the Work, provided
|
120 |
+
that such additional attribution notices cannot be construed
|
121 |
+
as modifying the License.
|
122 |
+
|
123 |
+
You may add Your own copyright statement to Your modifications and
|
124 |
+
may provide additional or different license terms and conditions
|
125 |
+
for use, reproduction, or distribution of Your modifications, or
|
126 |
+
for any such Derivative Works as a whole, provided Your use,
|
127 |
+
reproduction, and distribution of the Work otherwise complies with
|
128 |
+
the conditions stated in this License.
|
129 |
+
|
130 |
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
131 |
+
any Contribution intentionally submitted for inclusion in the Work
|
132 |
+
by You to the Licensor shall be under the terms and conditions of
|
133 |
+
this License, without any additional terms or conditions.
|
134 |
+
Notwithstanding the above, nothing herein shall supersede or modify
|
135 |
+
the terms of any separate license agreement you may have executed
|
136 |
+
with Licensor regarding such Contributions.
|
137 |
+
|
138 |
+
6. Trademarks. This License does not grant permission to use the trade
|
139 |
+
names, trademarks, service marks, or product names of the Licensor,
|
140 |
+
except as required for reasonable and customary use in describing the
|
141 |
+
origin of the Work and reproducing the content of the NOTICE file.
|
142 |
+
|
143 |
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
144 |
+
agreed to in writing, Licensor provides the Work (and each
|
145 |
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
146 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
147 |
+
implied, including, without limitation, any warranties or conditions
|
148 |
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
149 |
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
150 |
+
appropriateness of using or redistributing the Work and assume any
|
151 |
+
risks associated with Your exercise of permissions under this License.
|
152 |
+
|
153 |
+
8. Limitation of Liability. In no event and under no legal theory,
|
154 |
+
whether in tort (including negligence), contract, or otherwise,
|
155 |
+
unless required by applicable law (such as deliberate and grossly
|
156 |
+
negligent acts) or agreed to in writing, shall any Contributor be
|
157 |
+
liable to You for damages, including any direct, indirect, special,
|
158 |
+
incidental, or consequential damages of any character arising as a
|
159 |
+
result of this License or out of the use or inability to use the
|
160 |
+
Work (including but not limited to damages for loss of goodwill,
|
161 |
+
work stoppage, computer failure or malfunction, or any and all
|
162 |
+
other commercial damages or losses), even if such Contributor
|
163 |
+
has been advised of the possibility of such damages.
|
164 |
+
|
165 |
+
9. Accepting Warranty or Additional Liability. While redistributing
|
166 |
+
the Work or Derivative Works thereof, You may choose to offer,
|
167 |
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
168 |
+
or other liability obligations and/or rights consistent with this
|
169 |
+
License. However, in accepting such obligations, You may act only
|
170 |
+
on Your own behalf and on Your sole responsibility, not on behalf
|
171 |
+
of any other Contributor, and only if You agree to indemnify,
|
172 |
+
defend, and hold each Contributor harmless for any liability
|
173 |
+
incurred by, or claims asserted against, such Contributor by reason
|
174 |
+
of your accepting any such warranty or additional liability.
|
175 |
+
|
176 |
+
END OF TERMS AND CONDITIONS
|
177 |
+
|
178 |
+
APPENDIX: How to apply the Apache License to your work.
|
179 |
+
|
180 |
+
To apply the Apache License to your work, attach the following
|
181 |
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
182 |
+
replaced with your own identifying information. (Don't include
|
183 |
+
the brackets!) The text should be enclosed in the appropriate
|
184 |
+
comment syntax for the file format. We also recommend that a
|
185 |
+
file or class name and description of purpose be included on the
|
186 |
+
same "printed page" as the copyright notice for easier
|
187 |
+
identification within third-party archives.
|
188 |
+
|
189 |
+
Copyright [2025] [Ankan Ghosh]
|
190 |
+
|
191 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
192 |
+
you may not use this file except in compliance with the License.
|
193 |
+
You may obtain a copy of the License at
|
194 |
+
|
195 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
196 |
+
|
197 |
+
Unless required by applicable law or agreed to in writing, software
|
198 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
199 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
200 |
+
See the License for the specific language governing permissions and
|
201 |
+
limitations under the License.
|
docs/README.md
ADDED
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Anveshak: Spirituality Q&A
|
2 |
+
|
3 |
+
[](https://huggingface.co/spaces/ankanghosh/anveshak)
|
4 |
+
[](https://opensource.org/licenses/Apache-2.0)
|
5 |
+
|
6 |
+
A Retrieval-Augmented Generation (RAG) application that provides concise answers to spiritual questions by referencing a curated collection of Indian spiritual texts, philosophical treatises, and teachings from revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters of all genders, backgrounds, traditions, and walks of life.
|
7 |
+
|
8 |
+
<p align="center">
|
9 |
+
<img src="assets/app_screenshot.png" alt="Application Screenshot" width="800"/>
|
10 |
+
</p>
|
11 |
+
|
12 |
+
## Overview
|
13 |
+
|
14 |
+
Anveshak (meaning "seeker" in Sanskrit) serves as a bridge between ancient Indian spiritual wisdom and modern technology, allowing users to ask questions and receive answers grounded in traditional spiritual texts. The system combines the power of modern AI with the timeless wisdom found in these texts, making spiritual knowledge more accessible to seekers.
|
15 |
+
|
16 |
+
Our goal is to make a small contribution to the journey of beings toward self-discovery by making this knowledge available and accessible within ethical, moral, and resource-based constraints. **We have no commercial or for-profit interests; this application is purely for educational purposes.**
|
17 |
+
|
18 |
+
As stated in the application: "The path and journey to the SELF is designed to be undertaken alone. The all-encompassing knowledge is internal and not external."
|
19 |
+
|
20 |
+
### Key Features
|
21 |
+
|
22 |
+
- **Question-answering:** Ask spiritual questions and receive concise answers grounded in traditional texts
|
23 |
+
- **Source citations:** All answers include references to the original texts
|
24 |
+
- **Configurable retrieval:** Adjust the number of sources and word limit for answers
|
25 |
+
- **Responsive interface:** Built with Streamlit for a clean, accessible experience
|
26 |
+
- **Privacy-focused:** No user data or queries are saved
|
27 |
+
- **Inclusive recognition:** Acknowledges spiritual teachers from all backgrounds, genders, and traditions
|
28 |
+
|
29 |
+
## π§ How It Works
|
30 |
+
|
31 |
+
Anveshak follows a classic RAG architecture:
|
32 |
+
|
33 |
+
1. **Data processing pipeline:** Collects, cleans, and processes ~133 spiritual texts
|
34 |
+
2. **Text embedding:** Uses the E5-large-v2 model to create vector representations of text chunks
|
35 |
+
3. **Vector storage:** Stores embeddings in a FAISS index for fast similarity search
|
36 |
+
4. **Retrieval system:** Finds relevant passages from the text collection based on user queries
|
37 |
+
5. **Generation system:** Synthesizes concise answers from retrieved passages using a large language model
|
38 |
+
|
39 |
+
## π Getting Started
|
40 |
+
|
41 |
+
### Prerequisites
|
42 |
+
|
43 |
+
- Python 3.8 or higher
|
44 |
+
- [Google Cloud Storage](https://cloud.google.com/storage) account for data storage
|
45 |
+
- [OpenAI API](https://openai.com/api/) key for generation
|
46 |
+
|
47 |
+
### Installation
|
48 |
+
|
49 |
+
1. Clone the repository
|
50 |
+
```bash
|
51 |
+
git clone https://github.com/YourUsername/anveshak.git
|
52 |
+
cd anveshak
|
53 |
+
```
|
54 |
+
|
55 |
+
2. Install dependencies
|
56 |
+
```bash
|
57 |
+
pip install -r requirements.txt
|
58 |
+
```
|
59 |
+
|
60 |
+
3. Configure authentication
|
61 |
+
- Create a `.streamlit/secrets.toml` file with the following structure:
|
62 |
+
```toml
|
63 |
+
# GCP Configuration
|
64 |
+
BUCKET_NAME_GCS = "your-bucket-name"
|
65 |
+
METADATA_PATH_GCS = "metadata/metadata.jsonl"
|
66 |
+
EMBEDDINGS_PATH_GCS = "processed/embeddings/all_embeddings.npy"
|
67 |
+
INDICES_PATH_GCS = "processed/indices/faiss_index.faiss"
|
68 |
+
CHUNKS_PATH_GCS = "processed/chunks/text_chunks.txt"
|
69 |
+
EMBEDDING_MODEL = "intfloat/e5-large-v2"
|
70 |
+
LLM_MODEL = "gpt-3.5-turbo"
|
71 |
+
|
72 |
+
# OpenAI API Configuration
|
73 |
+
openai_api_key = "your-openai-api-key"
|
74 |
+
|
75 |
+
# GCP Service Account Credentials (JSON format)
|
76 |
+
[gcp_credentials]
|
77 |
+
type = "service_account"
|
78 |
+
project_id = "your-project-id"
|
79 |
+
private_key_id = "your-private-key-id"
|
80 |
+
private_key = "your-private-key"
|
81 |
+
client_email = "your-client-email"
|
82 |
+
client_id = "your-client-id"
|
83 |
+
auth_uri = "https://accounts.google.com/o/oauth2/auth"
|
84 |
+
token_uri = "https://oauth2.googleapis.com/token"
|
85 |
+
auth_provider_x509_cert_url = "https://www.googleapis.com/oauth2/v1/certs"
|
86 |
+
client_x509_cert_url = "your-client-cert-url"
|
87 |
+
```
|
88 |
+
|
89 |
+
### Running the Application Locally
|
90 |
+
|
91 |
+
**Important Note**: Running Anveshak locally requires above 16GB of RAM due to the embedding model. Most standard laptops will experience crashes during model loading. Hugging Face Spaces deployment is strongly recommended.
|
92 |
+
|
93 |
+
```bash
|
94 |
+
streamlit run app.py
|
95 |
+
```
|
96 |
+
|
97 |
+
The application will be available at http://localhost:8501.
|
98 |
+
|
99 |
+
### Deploying to Hugging Face Spaces
|
100 |
+
|
101 |
+
This application is designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces):
|
102 |
+
|
103 |
+
1. Fork this repository to your GitHub account
|
104 |
+
2. Create a new Space on Hugging Face:
|
105 |
+
- Go to [huggingface.co/spaces](https://huggingface.co/spaces)
|
106 |
+
- Click "Create new Space"
|
107 |
+
- Select "Streamlit" as the SDK
|
108 |
+
- Connect your GitHub repository
|
109 |
+
3. Configure secrets in the Hugging Face UI:
|
110 |
+
- Go to your Space settings
|
111 |
+
- Under "Repository secrets"
|
112 |
+
- Add each of the required secrets from your `.streamlit/secrets.toml` file
|
113 |
+
|
114 |
+
## π Project Structure
|
115 |
+
|
116 |
+
```
|
117 |
+
anveshak/
|
118 |
+
βββ .gitignore # Specifies intentionally untracked files to ignore
|
119 |
+
βββ .gitattributes # Defines attributes for pathnames in the repository
|
120 |
+
βββ app.py # Main Streamlit application
|
121 |
+
βββ requirements.txt # Python dependencies
|
122 |
+
βββ rag_engine.py # Core RAG functionality
|
123 |
+
βββ utils.py # Utility functions for authentication
|
124 |
+
βββ pages/ # Streamlit pages
|
125 |
+
β βββ 1_Sources.py # Sources information page
|
126 |
+
β βββ 2_Publishers.py # Publisher acknowledgments page
|
127 |
+
β βββ 3_Contact_us.py # Contact information page
|
128 |
+
βββ docs/ # Documentation
|
129 |
+
β βββ architecture-doc.md # Architecture details
|
130 |
+
β βββ data-handling-doc.md # Data handling explanation
|
131 |
+
β βββ configuration-doc.md # Configuration guide
|
132 |
+
β βββ changelog-doc.md # Project change log
|
133 |
+
β βββ README.md # Project overview and instructions
|
134 |
+
βββ scripts/ # Data processing scripts
|
135 |
+
βββ preprocessing.ipynb # Text preprocessing notebook
|
136 |
+
```
|
137 |
+
|
138 |
+
## π Data Privacy & Ethics
|
139 |
+
|
140 |
+
- Anveshak: Spirituality Q&A **does not** save any user data or queries
|
141 |
+
- All texts are sourced from freely available resources with proper attribution
|
142 |
+
- Publisher acknowledgments are included within the application
|
143 |
+
- Word limits are implemented to prevent excessive content reproduction and respect copyright
|
144 |
+
- User queries are processed using OpenAI's services but not stored by Anveshak
|
145 |
+
- The application presents information with appropriate reverence for spiritual traditions
|
146 |
+
- Responses are generated by AI based on the retrieved texts and may not perfectly represent the original teachings, intended meaning, or context
|
147 |
+
- The inclusion of any spiritual teacher, text, or tradition does not imply their endorsement of Anveshak
|
148 |
+
|
149 |
+
## π Data Flow
|
150 |
+
|
151 |
+
```
|
152 |
+
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
|
153 |
+
β β β β β β
|
154 |
+
β Data Pipeline ββββββΆβ Retrieval System ββββββΆβ Generation System β
|
155 |
+
β β β β β β
|
156 |
+
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
|
157 |
+
β² β² β
|
158 |
+
β β β
|
159 |
+
βββββββββββββββββ βββββββββββββββββ ββββββββββΌββββββββ
|
160 |
+
β β β β β β
|
161 |
+
β Spiritual β β User Query β β Final Answer β
|
162 |
+
β Text Corpus β β β β with Citations β
|
163 |
+
β β β β β β
|
164 |
+
βββββββββββββββββ βββββββββββββββββ ββββββββββββββββββ
|
165 |
+
```
|
166 |
+
|
167 |
+
## π Notes
|
168 |
+
|
169 |
+
- Anveshak: Spirituality Q&A is designed to provide concise answers rather than lengthy explanations or lists
|
170 |
+
- The application is not a general chatbot or conversational AI. It is specifically designed to answer spiritual questions with short, concise answers based on referenced texts.
|
171 |
+
- You may receive slightly different answers when asking the same question multiple times. This variation is intentional and reflects the nuanced nature of spiritual teachings across different traditions.
|
172 |
+
- Currently, Anveshak is only available in English
|
173 |
+
- The application acknowledges and honors spiritual teachers from all backgrounds, genders, traditions, and walks of life
|
174 |
+
- **Anveshak is a tool that is not a substitute for direct spiritual guidance, personal practice, or studying original texts in their complete form.**
|
175 |
+
|
176 |
+
## π Acknowledgments
|
177 |
+
|
178 |
+
Anveshak: Spirituality Q&A is made possible by the wisdom contained in numerous spiritual texts and the teachings of revered Saints, Sages, and Spiritual Masters from India and beyond. We extend our sincere gratitude to:
|
179 |
+
|
180 |
+
- **The Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters** of all genders, backgrounds, traditions, and walks of life whose timeless wisdom illuminates this application
|
181 |
+
- **The Sacred Texts** that have preserved the eternal truths across millennia
|
182 |
+
- **The Publishers** who have diligently preserved and disseminated these precious teachings
|
183 |
+
- **The Authors** who have dedicated their lives to interpreting and explaining complex spiritual concepts
|
184 |
+
|
185 |
+
See the "Publishers" and "Sources" pages within the application for complete acknowledgments.
|
186 |
+
|
187 |
+
## Future Roadmap
|
188 |
+
|
189 |
+
- **Multi-language support** (Sanskrit, Hindi, Bengali, Tamil, and more)
|
190 |
+
- **Enhanced retrieval** with hybrid retrieval methods
|
191 |
+
- **Self-hosted open-source LLM integration**
|
192 |
+
- **User feedback collection** for answer quality
|
193 |
+
- **Personalized learning paths** based on user interests (implemented with privacy-preserving approaches like client-side storage, session-based preferences, or explicit opt-in)
|
194 |
+
|
195 |
+
For a complete roadmap, see the [changelog](changelog-doc.md).
|
196 |
+
|
197 |
+
## Blog and Additional Resources
|
198 |
+
Read our detailed blog post about the project: [Anveshak: Spirituality Q&A - Bridging Faith and Intelligence](https://researchguy.in/anveshak-spirituality-qa-bridging-faith-and-intelligence/)
|
199 |
+
|
200 |
+
## π License
|
201 |
+
|
202 |
+
This project is licensed under the Apache License 2.0 - see the [LICENSE](../LICENSE) file for details.
|
203 |
+
|
204 |
+
## π Contact
|
205 |
+
|
206 |
+
For questions, feedback, or suggestions, please contact us at [email protected].
|
docs/architecture-doc.md
ADDED
@@ -0,0 +1,554 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Architecture Document
|
2 |
+
|
3 |
+
This document provides a detailed overview of the architecture, component interactions, and technical design decisions encompassing Anveshak: Spirituality Q&A.
|
4 |
+
|
5 |
+
## System Architecture Overview
|
6 |
+
|
7 |
+
Anveshak: Spirituality Q&A follows a Retrieval-Augmented Generation (RAG) architecture pattern, combining information retrieval with language generation to produce factual, grounded answers to spiritual questions.
|
8 |
+
|
9 |
+
### High-Level Architecture Diagram
|
10 |
+
|
11 |
+
```
|
12 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
13 |
+
β FRONT-END LAYER β
|
14 |
+
β β
|
15 |
+
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ β
|
16 |
+
β β Main App Page β β Sources Page β β Publishers Page β β
|
17 |
+
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ β
|
18 |
+
β β
|
19 |
+
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
|
20 |
+
β
|
21 |
+
βΌ
|
22 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
23 |
+
β BACKEND LAYER β
|
24 |
+
β β
|
25 |
+
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ β
|
26 |
+
β β Query Processor β β Retrieval Engineβ β Generation Engine β β
|
27 |
+
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ β
|
28 |
+
β β
|
29 |
+
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
|
30 |
+
β
|
31 |
+
βΌ
|
32 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
33 |
+
β DATA LAYER β
|
34 |
+
β β
|
35 |
+
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ β
|
36 |
+
β β FAISS Index β β Text Chunks β β Metadata β β
|
37 |
+
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ β
|
38 |
+
β β
|
39 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
40 |
+
```
|
41 |
+
|
42 |
+
## Component Details
|
43 |
+
|
44 |
+
### 1. Front-end Layer
|
45 |
+
|
46 |
+
The front-end layer is built with Streamlit and consists of multiple pages:
|
47 |
+
|
48 |
+
#### Main App Page (`app.py`)
|
49 |
+
- Provides the question input interface
|
50 |
+
- Displays answers and citations
|
51 |
+
- Offers configurable parameters (number of sources, word limit)
|
52 |
+
- Shows pre-selected common spiritual questions
|
53 |
+
- Contains information about the application and disclaimers
|
54 |
+
- Contains acknowledgment sections
|
55 |
+
|
56 |
+
#### Sources Page (`1_Sources.py`)
|
57 |
+
- Lists all spiritual texts and traditions used in Anveshak: Spirituality Q&A
|
58 |
+
- Provides information about the Saints and Spiritual Masters
|
59 |
+
- Organizes sources by tradition and category
|
60 |
+
|
61 |
+
#### Publishers Page (`2_Publishers.py`)
|
62 |
+
- Acknowledges all publishers whose works are referenced
|
63 |
+
- Explains copyright considerations and fair use
|
64 |
+
|
65 |
+
#### Contacts Page (`3_Contacts.py`)
|
66 |
+
- Provides contact information for feedback and questions
|
67 |
+
- Explains the purpose and limitations of Anveshak: Spirituality Q&A
|
68 |
+
|
69 |
+
### 2. Backend Layer
|
70 |
+
|
71 |
+
The backend layer handles the core functionality of processing queries, retrieving relevant passages, and generating answers.
|
72 |
+
|
73 |
+
#### Query Processor
|
74 |
+
- Takes user queries from the front-end
|
75 |
+
- Manages the end-to-end processing flow
|
76 |
+
- Caches results to improve performance
|
77 |
+
- Formats and returns answers with citations
|
78 |
+
|
79 |
+
```python
|
80 |
+
@st.cache_data(ttl=3600, show_spinner=False)
|
81 |
+
def cached_process_query(query, top_k=5, word_limit=100):
|
82 |
+
"""
|
83 |
+
Process a user query with caching to avoid redundant computation.
|
84 |
+
|
85 |
+
This function is cached with a Time-To-Live (TTL) of 1 hour, meaning identical
|
86 |
+
queries within this time period will return cached results rather than
|
87 |
+
reprocessing, improving responsiveness.
|
88 |
+
|
89 |
+
Args:
|
90 |
+
query (str): The user's spiritual question
|
91 |
+
top_k (int): Number of sources to retrieve and use for answer generation
|
92 |
+
word_limit (int): Maximum word count for the generated answer
|
93 |
+
|
94 |
+
Returns:
|
95 |
+
dict: Dictionary containing the query, answer, and citations
|
96 |
+
"""
|
97 |
+
print(f"\nπ Processing query (cached): {query}")
|
98 |
+
# Load all necessary data resources (with caching)
|
99 |
+
faiss_index, text_chunks, metadata_dict = cached_load_data_files()
|
100 |
+
# Handle missing data gracefully
|
101 |
+
if faiss_index is None or text_chunks is None or metadata_dict is None:
|
102 |
+
return {
|
103 |
+
"query": query,
|
104 |
+
"answer_with_rag": "β οΈ System error: Data files not loaded properly.",
|
105 |
+
"citations": "No citations available."
|
106 |
+
}
|
107 |
+
# Step 1: Retrieve relevant passages using similarity search
|
108 |
+
retrieved_context, retrieved_sources = retrieve_passages(
|
109 |
+
query,
|
110 |
+
faiss_index,
|
111 |
+
text_chunks,
|
112 |
+
metadata_dict,
|
113 |
+
top_k=top_k
|
114 |
+
)
|
115 |
+
# Step 2: Format citations for display
|
116 |
+
sources = format_citations(retrieved_sources) if retrieved_sources else "No citation available."
|
117 |
+
# Step 3: Generate the answer if relevant context was found
|
118 |
+
if retrieved_context:
|
119 |
+
context_with_sources = list(zip(retrieved_sources, retrieved_context))
|
120 |
+
llm_answer_with_rag = answer_with_llm(query, context_with_sources, word_limit=word_limit)
|
121 |
+
else:
|
122 |
+
llm_answer_with_rag = "β οΈ No relevant context found."
|
123 |
+
# Return the complete response package
|
124 |
+
return {"query": query, "answer_with_rag": llm_answer_with_rag, "citations": sources}
|
125 |
+
|
126 |
+
def process_query(query, top_k=5, word_limit=100):
|
127 |
+
"""
|
128 |
+
Process a query through the RAG pipeline with proper formatting.
|
129 |
+
|
130 |
+
This is the main entry point for query processing, wrapping the cached
|
131 |
+
query processing function.
|
132 |
+
|
133 |
+
Args:
|
134 |
+
query (str): The user's spiritual question
|
135 |
+
top_k (int): Number of sources to retrieve and use for answer generation
|
136 |
+
word_limit (int): Maximum word count for the generated answer
|
137 |
+
|
138 |
+
Returns:
|
139 |
+
dict: Dictionary containing the query, answer, and citations
|
140 |
+
"""
|
141 |
+
return cached_process_query(query, top_k, word_limit)
|
142 |
+
```
|
143 |
+
|
144 |
+
#### Retrieval Engine
|
145 |
+
- Generates embeddings for user queries
|
146 |
+
- Performs similarity search in the FAISS index
|
147 |
+
- Retrieves the most relevant text chunks
|
148 |
+
- Adds metadata to the retrieved passages
|
149 |
+
|
150 |
+
```python
|
151 |
+
ddef retrieve_passages(query, faiss_index, text_chunks, metadata_dict, top_k=5, similarity_threshold=0.5):
|
152 |
+
"""
|
153 |
+
Retrieve the most relevant passages for a given spiritual query.
|
154 |
+
|
155 |
+
This function:
|
156 |
+
1. Embeds the user query using the same model used for text chunks
|
157 |
+
2. Finds similar passages using the FAISS index with cosine similarity
|
158 |
+
3. Filters results based on similarity threshold to ensure relevance
|
159 |
+
4. Enriches results with metadata (title, author, publisher)
|
160 |
+
5. Ensures passage diversity by including only one passage per source title
|
161 |
+
|
162 |
+
Args:
|
163 |
+
query (str): The user's spiritual question
|
164 |
+
faiss_index: FAISS index containing passage embeddings
|
165 |
+
text_chunks (dict): Dictionary mapping IDs to text chunks and metadata
|
166 |
+
metadata_dict (dict): Dictionary containing publication information
|
167 |
+
top_k (int): Maximum number of passages to retrieve
|
168 |
+
similarity_threshold (float): Minimum similarity score (0.0-1.0) for retrieved passages
|
169 |
+
|
170 |
+
Returns:
|
171 |
+
tuple: (retrieved_passages, retrieved_sources) containing the text and source information
|
172 |
+
"""
|
173 |
+
try:
|
174 |
+
print(f"\nπ Retrieving passages for query: {query}")
|
175 |
+
query_embedding = get_embedding(query)
|
176 |
+
distances, indices = faiss_index.search(query_embedding, top_k * 2)
|
177 |
+
print(f"Found {len(distances[0])} potential matches")
|
178 |
+
retrieved_passages = []
|
179 |
+
retrieved_sources = []
|
180 |
+
cited_titles = set()
|
181 |
+
for dist, idx in zip(distances[0], indices[0]):
|
182 |
+
print(f"Distance: {dist:.4f}, Index: {idx}")
|
183 |
+
if idx in text_chunks and dist >= similarity_threshold:
|
184 |
+
title_with_txt, author, text = text_chunks[idx]
|
185 |
+
clean_title = title_with_txt.replace(".txt", "") if title_with_txt.endswith(".txt") else title_with_txt
|
186 |
+
clean_title = unicodedata.normalize("NFC", clean_title)
|
187 |
+
if clean_title in cited_titles:
|
188 |
+
continue
|
189 |
+
metadata_entry = metadata_dict.get(clean_title, {})
|
190 |
+
author = metadata_entry.get("Author", "Unknown")
|
191 |
+
publisher = metadata_entry.get("Publisher", "Unknown")
|
192 |
+
cited_titles.add(clean_title)
|
193 |
+
retrieved_passages.append(text)
|
194 |
+
retrieved_sources.append((clean_title, author, publisher))
|
195 |
+
if len(retrieved_passages) == top_k:
|
196 |
+
break
|
197 |
+
print(f"Retrieved {len(retrieved_passages)} passages")
|
198 |
+
return retrieved_passages, retrieved_sources
|
199 |
+
except Exception as e:
|
200 |
+
print(f"β Error in retrieve_passages: {str(e)}")
|
201 |
+
return [], []
|
202 |
+
```
|
203 |
+
|
204 |
+
#### Generation Engine
|
205 |
+
- Takes retrieved passages as context
|
206 |
+
- Uses OpenAI's GPT model to generate answers
|
207 |
+
- Ensures answers respect the word limit
|
208 |
+
- Formats the output with proper citations
|
209 |
+
|
210 |
+
```python
|
211 |
+
def answer_with_llm(query, context=None, word_limit=100):
|
212 |
+
"""
|
213 |
+
Generate an answer using the OpenAI GPT model with formatted citations.
|
214 |
+
|
215 |
+
This function:
|
216 |
+
1. Formats retrieved passages with source information
|
217 |
+
2. Creates a prompt with system and user messages
|
218 |
+
3. Calls the OpenAI API to generate an answer
|
219 |
+
4. Trims the response to the specified word limit
|
220 |
+
|
221 |
+
The system prompt ensures answers maintain appropriate respect for spiritual traditions,
|
222 |
+
synthesize rather than quote directly, and acknowledge gaps when relevant information
|
223 |
+
isn't available.
|
224 |
+
|
225 |
+
Args:
|
226 |
+
query (str): The user's spiritual question
|
227 |
+
context (list, optional): List of (source_info, text) tuples for context
|
228 |
+
word_limit (int): Maximum word count for the generated answer
|
229 |
+
|
230 |
+
Returns:
|
231 |
+
str: The generated answer or an error message
|
232 |
+
"""
|
233 |
+
try:
|
234 |
+
if context:
|
235 |
+
formatted_contexts = []
|
236 |
+
total_chars = 0
|
237 |
+
max_context_chars = 4000 # Limit context size to avoid exceeding token limits
|
238 |
+
for (title, author, publisher), text in context:
|
239 |
+
remaining_space = max(0, max_context_chars - total_chars)
|
240 |
+
excerpt_len = min(150, remaining_space)
|
241 |
+
if excerpt_len > 50:
|
242 |
+
excerpt = text[:excerpt_len].strip() + "..." if len(text) > excerpt_len else text
|
243 |
+
formatted_context = f"[{title} by {author}, Published by {publisher}] {excerpt}"
|
244 |
+
formatted_contexts.append(formatted_context)
|
245 |
+
total_chars += len(formatted_context)
|
246 |
+
if total_chars >= max_context_chars:
|
247 |
+
break
|
248 |
+
formatted_context = "\n".join(formatted_contexts)
|
249 |
+
else:
|
250 |
+
formatted_context = "No relevant information available."
|
251 |
+
|
252 |
+
system_message = (
|
253 |
+
"You are an AI specialized in spirituality, primarily based on Indian spiritual texts and teachings."
|
254 |
+
"While your knowledge is predominantly from Indian spiritual traditions, you also have limited familiarity with spiritual concepts from other global traditions."
|
255 |
+
"Answer based on context, summarizing ideas rather than quoting verbatim."
|
256 |
+
"If no relevant information is found in the provided context, politely inform the user that this specific query may not be covered in the available spiritual texts. Suggest they try a related question or rephrase their query or try a different query."
|
257 |
+
"Avoid repetition and irrelevant details."
|
258 |
+
"Ensure proper citation and do not include direct excerpts."
|
259 |
+
"Maintain appropriate, respectful language at all times."
|
260 |
+
"Do not use profanity, expletives, obscenities, slurs, hate speech, sexually explicit content, or language promoting violence."
|
261 |
+
"As a spiritual guidance system, ensure all responses reflect dignity, peace, love, and compassion consistent with spiritual traditions."
|
262 |
+
"Provide concise, focused answers without lists or lengthy explanations."
|
263 |
+
)
|
264 |
+
|
265 |
+
user_message = f"""
|
266 |
+
Context:
|
267 |
+
{formatted_context}
|
268 |
+
Question:
|
269 |
+
{query}
|
270 |
+
"""
|
271 |
+
|
272 |
+
try:
|
273 |
+
llm_model = st.secrets["LLM_MODEL"]
|
274 |
+
except KeyError:
|
275 |
+
print("β Error: LLM model not found in secrets")
|
276 |
+
return "I apologize, but I am unable to answer at the moment."
|
277 |
+
|
278 |
+
response = openai.chat.completions.create(
|
279 |
+
model=llm_model,
|
280 |
+
messages=[
|
281 |
+
{"role": "system", "content": system_message},
|
282 |
+
{"role": "user", "content": user_message}
|
283 |
+
],
|
284 |
+
max_tokens=200,
|
285 |
+
temperature=0.7
|
286 |
+
)
|
287 |
+
|
288 |
+
# Extract the answer and apply word limit
|
289 |
+
answer = response.choices[0].message.content.strip()
|
290 |
+
words = answer.split()
|
291 |
+
if len(words) > word_limit:
|
292 |
+
answer = " ".join(words[:word_limit])
|
293 |
+
if not answer.endswith((".", "!", "?")):
|
294 |
+
answer += "."
|
295 |
+
return answer
|
296 |
+
except Exception as e:
|
297 |
+
print(f"β LLM API error: {str(e)}")
|
298 |
+
return "I apologize, but I am unable to answer at the moment."
|
299 |
+
|
300 |
+
### 3. Data Layer
|
301 |
+
|
302 |
+
The data layer stores and manages the embedded text chunks, metadata, and vector indices:
|
303 |
+
|
304 |
+
#### FAISS Index
|
305 |
+
- Stores vector embeddings of all text chunks
|
306 |
+
- Enables efficient similarity search with cosine similarity
|
307 |
+
- Provides fast retrieval for Anveshak
|
308 |
+
|
309 |
+
```python
|
310 |
+
# Building the FAISS index (during preprocessing)
|
311 |
+
dimension = all_embeddings.shape[1]
|
312 |
+
index = faiss.IndexFlatIP(dimension) # Inner product (cosine similarity for normalized vectors)
|
313 |
+
index.add(all_embeddings)
|
314 |
+
```
|
315 |
+
|
316 |
+
#### Text Chunks
|
317 |
+
- Contains the actual text content split into manageable chunks
|
318 |
+
- Stores text with unique identifiers that map to the FAISS index
|
319 |
+
- Formatted as tab-separated values with IDs, titles, authors, and content
|
320 |
+
|
321 |
+
```
|
322 |
+
# Format of text_chunks.txt
|
323 |
+
ID Title Author Text_Content
|
324 |
+
0 Bhagavad Gita Vyasa The supreme Lord said: I have taught this imperishable yoga to Vivasvan...
|
325 |
+
1 Yoga Sutras Patanjali Yogas chitta vritti nirodhah - Yoga is the stilling of the fluctuations...
|
326 |
+
...
|
327 |
+
```
|
328 |
+
|
329 |
+
#### Metadata
|
330 |
+
- Stores additional information about each source text
|
331 |
+
- Includes author information, publisher details, copyright information, and more
|
332 |
+
- Used to provide accurate citations for answers
|
333 |
+
|
334 |
+
```json
|
335 |
+
// Example metadata.jsonl entry
|
336 |
+
{"Title": "Text_Name", "Author": "Vyasa", "Publisher": "Publisher_Name", "URL": "URL", "Uploaded": true}
|
337 |
+
```
|
338 |
+
|
339 |
+
## Data Flow and Processing
|
340 |
+
|
341 |
+
### 1. Preprocessing Pipeline
|
342 |
+
|
343 |
+
The preprocessing pipeline runs offline to prepare the text corpus:
|
344 |
+
|
345 |
+
```
|
346 |
+
Raw Texts β Cleaning β Chunking β Embedding β Indexing β GCS Storage
|
347 |
+
```
|
348 |
+
|
349 |
+
Each step is handled by specific functions in the `preprocessing.py` script:
|
350 |
+
|
351 |
+
1. **Text Collection**: Texts are collected from various sources and uploaded to Google Cloud Storage
|
352 |
+
2. **Text Cleaning**: HTML and formatting artifacts are removed using `rigorous_clean_text()`
|
353 |
+
3. **Text Chunking**: Long texts are split into manageable chunks with `chunk_text()`
|
354 |
+
4. **Embedding Generation**: Text chunks are converted to vector embeddings using `create_embeddings()`
|
355 |
+
5. **Index Building**: Embeddings are added to a FAISS index for efficient retrieval
|
356 |
+
6. **Storage**: All processed data is stored in Google Cloud Storage for Anveshak to access
|
357 |
+
|
358 |
+
### 2. Query Processing Flow
|
359 |
+
|
360 |
+
When a user submits a question, the system follows this flow:
|
361 |
+
|
362 |
+
1. **Query Embedding**: The user's question is embedded using the same model as the text corpus
|
363 |
+
2. **Similarity Search**: The query embedding is compared against the FAISS index to find similar text chunks
|
364 |
+
3. **Context Assembly**: Retrieved chunks are combined with their metadata to form the context
|
365 |
+
4. **Answer Generation**: The context and query are sent to the Large Language Model (LLM) to generate an answer
|
366 |
+
5. **Citation Formatting**: Sources are formatted as citations to accompany the answer
|
367 |
+
6. **Result Presentation**: The answer and citations are displayed to the user
|
368 |
+
|
369 |
+
## Caching Strategy
|
370 |
+
|
371 |
+
Anveshak implements a multi-level caching strategy to optimize performance:
|
372 |
+
|
373 |
+
### Resource Caching
|
374 |
+
- Model and data files are cached using `@st.cache_resource`
|
375 |
+
- Ensures the embedding model and FAISS index are loaded only once during the session
|
376 |
+
|
377 |
+
```python
|
378 |
+
@st.cache_resource(show_spinner=False)
|
379 |
+
def cached_load_model():
|
380 |
+
# Load embedding model once and cache it
|
381 |
+
|
382 |
+
@st.cache_resource(show_spinner=False)
|
383 |
+
def cached_load_data_files():
|
384 |
+
# Load FAISS index, text chunks, and metadata once and cache them
|
385 |
+
```
|
386 |
+
|
387 |
+
### Data Caching
|
388 |
+
- Query results are cached using `@st.cache_data` with a Time-To-Live (TTL) of 1 hour
|
389 |
+
- Prevents redundant processing of identical queries
|
390 |
+
|
391 |
+
```python
|
392 |
+
@st.cache_data(ttl=3600, show_spinner=False)
|
393 |
+
def cached_process_query(query, top_k=5, word_limit=100):
|
394 |
+
# Cache query results for an hour
|
395 |
+
```
|
396 |
+
|
397 |
+
### Session State Management
|
398 |
+
- Streamlit session state is used to manage UI state and user interactions
|
399 |
+
- Prevents unnecessary recomputation during re-renders
|
400 |
+
|
401 |
+
```python
|
402 |
+
...
|
403 |
+
if 'initialized' not in st.session_state:
|
404 |
+
st.session_state.initialized = False
|
405 |
+
...
|
406 |
+
if 'last_query' not in st.session_state:
|
407 |
+
st.session_state.last_query = ""
|
408 |
+
# ... and more session state variables
|
409 |
+
```
|
410 |
+
|
411 |
+
## Authentication and Security
|
412 |
+
|
413 |
+
Anveshak uses two authentication systems:
|
414 |
+
|
415 |
+
### Google Cloud Storage Authentication
|
416 |
+
- Authenticates with GCS to access stored data
|
417 |
+
- Uses service account credentials stored securely
|
418 |
+
|
419 |
+
```python
|
420 |
+
Anveshak: Spirituality Q&A uses two authentication systems:
|
421 |
+
|
422 |
+
### Google Cloud Storage Authentication
|
423 |
+
- Authenticates with GCS to access stored data
|
424 |
+
- Uses service account credentials stored exclusively in Hugging Face Spaces secrets for production deployment
|
425 |
+
- Supports alternative authentication methods (environment variables, Streamlit secrets) for development environments
|
426 |
+
|
427 |
+
```python
|
428 |
+
def setup_gcp_auth():
|
429 |
+
"""Setup Google Cloud Platform (GCP) authentication using various methods.
|
430 |
+
|
431 |
+
This function tries multiple authentication methods in order of preference:
|
432 |
+
1. HF Spaces environment variable (GCP_CREDENTIALS) - primary production method
|
433 |
+
2. Local environment variable pointing to credentials file (GOOGLE_APPLICATION_CREDENTIALS)
|
434 |
+
3. Streamlit secrets (gcp_credentials)
|
435 |
+
|
436 |
+
Note: In production, credentials are stored exclusively in HF Spaces secrets.
|
437 |
+
"""
|
438 |
+
# Try multiple authentication methods and return credentials.```
|
439 |
+
|
440 |
+
### OpenAI API Authentication
|
441 |
+
- Authenticates with OpenAI to use their LLM API
|
442 |
+
- Uses API key stored securely
|
443 |
+
|
444 |
+
```python
|
445 |
+
def setup_openai_auth():
|
446 |
+
"""Setup OpenAI API authentication using various methods.
|
447 |
+
|
448 |
+
This function tries multiple authentication methods in order of preference:
|
449 |
+
1. Standard environment variable (OPENAI_API_KEY)
|
450 |
+
2. HF Spaces environment variable (OPENAI_KEY) - primary production method
|
451 |
+
3. Streamlit secrets (openai_api_key)
|
452 |
+
|
453 |
+
Note: In production, the API key is stored exclusively in HF Spaces secrets.
|
454 |
+
"""
|
455 |
+
# Try multiple authentication methods to set up the API key
|
456 |
+
```
|
457 |
+
|
458 |
+
## Privacy Considerations
|
459 |
+
|
460 |
+
Anveshak: Spirituality Q&A is designed with privacy in mind:
|
461 |
+
|
462 |
+
1. **No Data Collection**: The application does not save user data or queries
|
463 |
+
2. **Stateless Operation**: Each query is processed independently
|
464 |
+
3. **No User Tracking**: No analytics or tracking mechanisms are implemented
|
465 |
+
4. **Local Processing**: Embedding generation happens locally when possible
|
466 |
+
|
467 |
+
## Deployment Architecture
|
468 |
+
|
469 |
+
Anveshak: Spirituality Q&A is deployed on Hugging Face Spaces, which provides:
|
470 |
+
|
471 |
+
- Containerized environment
|
472 |
+
- Git-based deployment
|
473 |
+
- Secret management for API keys and credentials
|
474 |
+
- Persistent storage for cached files
|
475 |
+
- Continuous availability
|
476 |
+
|
477 |
+
The deployment process involves:
|
478 |
+
1. Pushing code to GitHub
|
479 |
+
2. Connecting the GitHub repository to Hugging Face Spaces
|
480 |
+
3. Configuring environment variables and secrets in the Hugging Face UI
|
481 |
+
4. Automatic deployment when changes are pushed to the repository
|
482 |
+
|
483 |
+
## Technical Design Decisions
|
484 |
+
|
485 |
+
### Choice of Embedding Model
|
486 |
+
- **Selected Model**: E5-large-v2
|
487 |
+
- **Justification**:
|
488 |
+
- Strong performance on information retrieval tasks
|
489 |
+
- Good balance between accuracy and computational efficiency
|
490 |
+
- Supports semantic understanding of spiritual concepts
|
491 |
+
|
492 |
+
### Vector Search Implementation
|
493 |
+
- **Selected Technology**: FAISS with IndexFlatIP
|
494 |
+
- **Justification**:
|
495 |
+
- Optimized for inner product (cosine similarity) search
|
496 |
+
- Exact search rather than approximate for maximum accuracy
|
497 |
+
- Small enough index to fit in memory for this application
|
498 |
+
|
499 |
+
### LLM Selection
|
500 |
+
- **Selected Model**: OpenAI GPT-3.5 Turbo
|
501 |
+
- **Justification**:
|
502 |
+
- Powerful context understanding
|
503 |
+
- Strong ability to synthesize information from multiple sources
|
504 |
+
- Good balance between accuracy and cost
|
505 |
+
|
506 |
+
### Front-end Framework
|
507 |
+
- **Selected Technology**: Streamlit
|
508 |
+
- **Justification**:
|
509 |
+
- Rapid development of data-focused applications
|
510 |
+
- Built-in caching mechanisms
|
511 |
+
- Easy deployment on Hugging Face Spaces
|
512 |
+
- Simple, intuitive UI for non-technical users
|
513 |
+
|
514 |
+
### Response Format
|
515 |
+
- **Design Choice**: Concise, direct answers
|
516 |
+
- **Justification**:
|
517 |
+
- Spiritual wisdom often benefits from simplicity and directness
|
518 |
+
- Avoids overwhelming users with excessive information
|
519 |
+
- Maintains focus on the core of the question
|
520 |
+
|
521 |
+
## Limitations and Constraints
|
522 |
+
|
523 |
+
1. **Context Window Limitations**: The LLM has a maximum context window, limiting the amount of text that can be included in each query.
|
524 |
+
- Mitigation: Text chunks are limited to 500 words, and only a subset of the most relevant chunks are included in the context.
|
525 |
+
|
526 |
+
2. **Embedding Model Accuracy**: No embedding model perfectly captures the semantics of spiritual texts.
|
527 |
+
- Mitigation: Use of a high-quality embedding model (E5-large-v2) and a similarity threshold to filter out less relevant results.
|
528 |
+
|
529 |
+
3. **Resource Constraints**: Hugging Face Spaces has limited computational resources.
|
530 |
+
- Mitigation: Forcing CPU usage for the embedding model, implementing aggressive caching, and optimizing memory usage.
|
531 |
+
|
532 |
+
4. **Copyright Considerations**: Anveshak: Spirituality Q&A respects copyright while providing valuable information.
|
533 |
+
- Implementation: Word limits on responses, proper citations for all sources, and encouragement for users to purchase original texts.
|
534 |
+
|
535 |
+
5. **Language Limitations**: Currently, Anveshak is only available in English.
|
536 |
+
- Mitigation: Future plans include support for multiple Indian languages.
|
537 |
+
|
538 |
+
## Future Architecture Extensions
|
539 |
+
|
540 |
+
1. **Multi-language Support**: Add capability to process and answer questions in Sanskrit, Hindi, Bengali, Tamil, and other Indian languages.
|
541 |
+
|
542 |
+
2. **Hybrid Retrieval**: Implement a combination of dense and sparse retrieval to improve passage selection.
|
543 |
+
|
544 |
+
3. **Local LLM Integration**: Use a self-hosted open-source alternative for the LLM.
|
545 |
+
|
546 |
+
4. **User Feedback Loop**: Add a mechanism for users to rate answers and use this feedback to improve retrieval.
|
547 |
+
|
548 |
+
5. **Advanced Caching**: Implement a distributed caching system for better performance at scale.
|
549 |
+
|
550 |
+
## Conclusion
|
551 |
+
|
552 |
+
The architecture of Anveshak balances technical sophistication with simplicity and accessibility. By combining modern NLP techniques with traditional spiritual texts, it creates a bridge between ancient wisdom and contemporary technology, making spiritual knowledge more accessible to seekers around the world.
|
553 |
+
|
554 |
+
Anveshak: Spirituality Q&A acknowledges and honors Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters from all backgrounds, genders, traditions, and walks of life, understanding that wisdom transcends all such distinctions. Its focused approach on providing concise, direct answers maintains the essence of spiritual teaching while embracing modern technological capabilities.
|
docs/assets/app_screenshot.png
ADDED
![]() |
Git LFS Details
|
docs/changelog-doc.md
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Changelog
|
2 |
+
|
3 |
+
All notable changes to Anveshak: Spirituality Q&A will be documented in this file.
|
4 |
+
|
5 |
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
6 |
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
7 |
+
|
8 |
+
## [1.0.0] - 2025-04-01
|
9 |
+
|
10 |
+
### Added
|
11 |
+
- Initial release of Anveshak: Spirituality Q&A
|
12 |
+
- Core RAG functionality with E5-large-v2 embedding model
|
13 |
+
- FAISS index for efficient text retrieval
|
14 |
+
- Integration with OpenAI API for answer generation
|
15 |
+
- Streamlit-based user interface
|
16 |
+
- Caching mechanisms for improved performance
|
17 |
+
- Support for customizable number of sources and word limits
|
18 |
+
- Pre-selected common spiritual questions
|
19 |
+
- Comprehensive acknowledgment of sources and publishers
|
20 |
+
- Detailed documentation
|
21 |
+
|
22 |
+
### Technical Features
|
23 |
+
- Google Cloud Storage integration for data storage
|
24 |
+
- Authentication handling for GCP and OpenAI
|
25 |
+
- Memory optimization for resource-constrained environments
|
26 |
+
- Multi-page Streamlit application structure
|
27 |
+
- Custom CSS styling for enhanced user experience
|
28 |
+
- Privacy protection with no user data storage
|
29 |
+
- Concise answer generation system
|
30 |
+
- Recognition of Saints and Spiritual Masters of all backgrounds and traditions
|
31 |
+
|
32 |
+
## Future Roadmap
|
33 |
+
|
34 |
+
### Planned for v1.1.0
|
35 |
+
- Multi-language support (Sanskrit, Hindi, Bengali, Tamil, and more)
|
36 |
+
- User feedback collection for answer quality
|
37 |
+
- Enhanced answer relevance with hybrid retrieval methods
|
38 |
+
- Additional spiritual texts from diverse traditions
|
39 |
+
- Improved citation formatting with page numbers where available
|
40 |
+
|
41 |
+
### Planned for v1.2.0
|
42 |
+
- Self-hosted open-source LLM integration
|
43 |
+
- Advanced visualization of concept relationships
|
44 |
+
- Search functionality for specific texts or authors
|
45 |
+
- Audio output for visually impaired users
|
46 |
+
- Mobile-optimized interface
|
47 |
+
|
48 |
+
### Planned for v2.0.0
|
49 |
+
- Meditation timer and guide integration
|
50 |
+
- Personalized learning paths based on user interests (implemented with privacy-preserving approaches like client-side storage, session-based preferences, or explicit opt-in)
|
51 |
+
- Interactive glossary of spiritual terms
|
52 |
+
- Spiritual practice guide with scheduler and tracker
|
53 |
+
- Community features for discussion and shared learning
|
docs/configuration-doc.md
ADDED
@@ -0,0 +1,597 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Configuration Guide
|
2 |
+
|
3 |
+
This document provides detailed instructions for configuring and deploying Anveshak: Spirituality Q&A, covering environment setup, authentication, customization options, and deployment strategies.
|
4 |
+
|
5 |
+
## Environment Configuration
|
6 |
+
|
7 |
+
### Configuration Parameters
|
8 |
+
|
9 |
+
Anveshak: Spirituality Q&A uses the following configuration parameters, which can be set through environment variables or Hugging Face Spaces secrets:
|
10 |
+
|
11 |
+
| Parameter | Description | Example Value |
|
12 |
+
|-----------|-------------|---------------|
|
13 |
+
| `BUCKET_NAME_GCS` | GCS bucket name for data storage | `"your-bucket-name"` |
|
14 |
+
| `METADATA_PATH_GCS` | Path to metadata file in GCS | `"metadata/metadata.jsonl"` |
|
15 |
+
| `EMBEDDINGS_PATH_GCS` | Path to embeddings file in GCS | `"processed/embeddings/all_embeddings.npy"` |
|
16 |
+
| `INDICES_PATH_GCS` | Path to FAISS index in GCS | `"processed/indices/faiss_index.faiss"` |
|
17 |
+
| `CHUNKS_PATH_GCS` | Path to text chunks file in GCS | `"processed/chunks/text_chunks.txt"` |
|
18 |
+
| `RAW_TEXTS_UPLOADED_PATH_GCS` | Path to uploaded raw texts in GCS | `"raw-texts/uploaded"` |
|
19 |
+
| `RAW_TEXTS_DOWNLOADED_PATH_GCS` | Path to downloaded raw texts in GCS | `"raw-texts/downloaded/"` |
|
20 |
+
| `CLEANED_TEXTS_PATH_GCS` | Path to cleaned texts in GCS | `"cleaned-texts/"` |
|
21 |
+
| `EMBEDDING_MODEL` | Hugging Face model ID for embeddings | `"intfloat/e5-large-v2"` |
|
22 |
+
| `LLM_MODEL` | OpenAI model for answer generation | `"gpt-3.5-turbo"` |
|
23 |
+
| `OPENAI_API_KEY` | OpenAI API key | `"sk-..."` |
|
24 |
+
| `GCP_CREDENTIALS` | GCP service account credentials (JSON) | `{"type":"service_account",...}` |
|
25 |
+
|
26 |
+
### Streamlit Secrets Configuration (Optional)
|
27 |
+
|
28 |
+
If developing locally with Streamlit, you can create a `.streamlit/secrets.toml` file with the following structure:
|
29 |
+
|
30 |
+
```toml
|
31 |
+
# GCS Configuration
|
32 |
+
BUCKET_NAME_GCS = "your-bucket-name"
|
33 |
+
METADATA_PATH_GCS = "metadata/metadata.jsonl"
|
34 |
+
EMBEDDINGS_PATH_GCS = "processed/embeddings/all_embeddings.npy"
|
35 |
+
INDICES_PATH_GCS = "processed/indices/faiss_index.faiss"
|
36 |
+
CHUNKS_PATH_GCS = "processed/chunks/text_chunks.txt"
|
37 |
+
RAW_TEXTS_UPLOADED_PATH_GCS = "raw-texts/uploaded"
|
38 |
+
RAW_TEXTS_DOWNLOADED_PATH_GCS = "raw-texts/downloaded/"
|
39 |
+
CLEANED_TEXTS_PATH_GCS = "cleaned-texts/"
|
40 |
+
EMBEDDING_MODEL = "intfloat/e5-large-v2"
|
41 |
+
LLM_MODEL = "gpt-3.5-turbo"
|
42 |
+
|
43 |
+
# OpenAI API Configuration
|
44 |
+
openai_api_key = "your-openai-api-key"
|
45 |
+
|
46 |
+
# GCP Service Account Credentials (JSON format)
|
47 |
+
[gcp_credentials]
|
48 |
+
type = "service_account"
|
49 |
+
project_id = "your-project-id"
|
50 |
+
private_key_id = "your-private-key-id"
|
51 |
+
private_key = "your-private-key"
|
52 |
+
client_email = "your-client-email"
|
53 |
+
client_id = "your-client-id"
|
54 |
+
auth_uri = "https://accounts.google.com/o/oauth2/auth"
|
55 |
+
token_uri = "https://oauth2.googleapis.com/token"
|
56 |
+
auth_provider_x509_cert_url = "https://www.googleapis.com/oauth2/v1/certs"
|
57 |
+
client_x509_cert_url = "your-client-cert-url"
|
58 |
+
```
|
59 |
+
|
60 |
+
### Environment Variables for Alternative Deployments
|
61 |
+
|
62 |
+
For deployments that support environment variables (like Heroku or Docker), you can use the following environment variables:
|
63 |
+
|
64 |
+
```bash
|
65 |
+
# GCS Configuration
|
66 |
+
export BUCKET_NAME_GCS="your-bucket-name"
|
67 |
+
export METADATA_PATH_GCS="metadata/metadata.jsonl"
|
68 |
+
export EMBEDDINGS_PATH_GCS="processed/embeddings/all_embeddings.npy"
|
69 |
+
export INDICES_PATH_GCS="processed/indices/faiss_index.faiss"
|
70 |
+
export CHUNKS_PATH_GCS="processed/chunks/text_chunks.txt"
|
71 |
+
export RAW_TEXTS_UPLOADED_PATH_GCS="raw-texts/uploaded"
|
72 |
+
export RAW_TEXTS_DOWNLOADED_PATH_GCS="raw-texts/downloaded/"
|
73 |
+
export CLEANED_TEXTS_PATH_GCS="cleaned-texts/"
|
74 |
+
export EMBEDDING_MODEL="intfloat/e5-large-v2"
|
75 |
+
export LLM_MODEL="gpt-3.5-turbo"
|
76 |
+
|
77 |
+
# OpenAI API Configuration
|
78 |
+
export OPENAI_API_KEY="your-openai-api-key"
|
79 |
+
|
80 |
+
# GCP Service Account (as a JSON string)
|
81 |
+
export GCP_CREDENTIALS='{"type":"service_account","project_id":"your-project-id",...}'
|
82 |
+
```
|
83 |
+
|
84 |
+
## Authentication Setup
|
85 |
+
|
86 |
+
### Google Cloud Storage (GCS) Authentication
|
87 |
+
|
88 |
+
Anveshak: Spirituality Q&A supports multiple methods for authenticating with GCS:
|
89 |
+
|
90 |
+
#### Setting Up a GCP Service Account (Required)
|
91 |
+
|
92 |
+
Before configuring authentication methods, you'll need to create a Google Cloud Platform (GCP) service account:
|
93 |
+
|
94 |
+
1. **Create a GCP project** (if you don't already have one):
|
95 |
+
- Go to the [Google Cloud Console](https://console.cloud.google.com/)
|
96 |
+
- Click on "Select a project" at the top right and then "New Project"
|
97 |
+
- Enter a project name and click "Create"
|
98 |
+
|
99 |
+
2. **Enable the Cloud Storage API**:
|
100 |
+
- Go to "APIs & Services" > "Library" in the left sidebar
|
101 |
+
- Search for "Cloud Storage"
|
102 |
+
- Click on "Cloud Storage API" and then "Enable"
|
103 |
+
|
104 |
+
3. **Create a service account**:
|
105 |
+
- Go to "IAM & Admin" > "Service Accounts" in the left sidebar
|
106 |
+
- Click "Create Service Account"
|
107 |
+
- Enter a service account name and description
|
108 |
+
- Click "Create and Continue"
|
109 |
+
|
110 |
+
4. **Assign roles to the service account**:
|
111 |
+
- Add the "Storage Object Admin" role for access to GCS objects
|
112 |
+
- Add the "Viewer" role for basic read permissions
|
113 |
+
- Click "Continue" and then "Done"
|
114 |
+
|
115 |
+
5. **Create and download service account key**:
|
116 |
+
- Find your new service account in the list and click on it
|
117 |
+
- Go to the "Keys" tab
|
118 |
+
- Click "Add Key" > "Create new key"
|
119 |
+
- Choose "JSON" as the key type
|
120 |
+
- Click "Create" to download the key file (This is your GCP credentials JSON file)
|
121 |
+
|
122 |
+
6. **Create a GCS bucket**:
|
123 |
+
- Go to "Cloud Storage" > "Buckets" in the left sidebar
|
124 |
+
- Click "Create"
|
125 |
+
- Enter a globally unique bucket name
|
126 |
+
- Choose your settings for location, class, and access control
|
127 |
+
- Click "Create"
|
128 |
+
|
129 |
+
Once you have created your service account and GCS bucket, you can use any of the following authentication methods:
|
130 |
+
|
131 |
+
#### Option 1: HF Spaces Environment Variable (Recommended Production Method)
|
132 |
+
|
133 |
+
For Hugging Face Spaces, set the `GCP_CREDENTIALS` environment variable in the Spaces UI:
|
134 |
+
|
135 |
+
1. Go to your Space settings
|
136 |
+
2. Under "Repository secrets"
|
137 |
+
3. Add a new secret with name `GCP_CREDENTIALS` and value containing your JSON credentials
|
138 |
+
|
139 |
+
#### Option 2: Local Development with Application Default Credentials
|
140 |
+
|
141 |
+
For local development, you can use Application Default Credentials:
|
142 |
+
|
143 |
+
```bash
|
144 |
+
# Export path to your service account key file
|
145 |
+
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account-file.json"
|
146 |
+
```
|
147 |
+
|
148 |
+
#### Option 3: Streamlit Secrets
|
149 |
+
|
150 |
+
Add your service account credentials to the `.streamlit/secrets.toml` file as shown in the example above.
|
151 |
+
|
152 |
+
The authentication logic is handled by the `setup_gcp_auth()` function in `utils.py`:
|
153 |
+
|
154 |
+
```python
|
155 |
+
def setup_gcp_auth():
|
156 |
+
"""
|
157 |
+
Setup Google Cloud Platform (GCP) authentication using various methods.
|
158 |
+
|
159 |
+
This function tries multiple authentication methods in order of preference:
|
160 |
+
1. HF Spaces environment variable (GCP_CREDENTIALS) - primary production method
|
161 |
+
2. Local environment variable pointing to credentials file (GOOGLE_APPLICATION_CREDENTIALS)
|
162 |
+
3. Streamlit secrets (gcp_credentials)
|
163 |
+
|
164 |
+
Note: In production, credentials are stored exclusively in HF Spaces secrets.
|
165 |
+
"""
|
166 |
+
try:
|
167 |
+
# Option 1: HF Spaces environment variable
|
168 |
+
if "GCP_CREDENTIALS" in os.environ:
|
169 |
+
gcp_credentials = json.loads(os.getenv("GCP_CREDENTIALS"))
|
170 |
+
print("β
Using GCP credentials from HF Spaces environment variable")
|
171 |
+
credentials = service_account.Credentials.from_service_account_info(gcp_credentials)
|
172 |
+
return credentials
|
173 |
+
|
174 |
+
# Option 2: Local environment variable pointing to file
|
175 |
+
elif "GOOGLE_APPLICATION_CREDENTIALS" in os.environ:
|
176 |
+
credentials_path = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
|
177 |
+
print(f"β
Using GCP credentials from file at {credentials_path}")
|
178 |
+
credentials = service_account.Credentials.from_service_account_file(credentials_path)
|
179 |
+
return credentials
|
180 |
+
|
181 |
+
# Option 3: Streamlit secrets
|
182 |
+
elif "gcp_credentials" in st.secrets:
|
183 |
+
gcp_credentials = st.secrets["gcp_credentials"]
|
184 |
+
|
185 |
+
# Handle different secret formats
|
186 |
+
if isinstance(gcp_credentials, dict) or hasattr(gcp_credentials, 'to_dict'):
|
187 |
+
# Convert AttrDict to dict if needed
|
188 |
+
if hasattr(gcp_credentials, 'to_dict'):
|
189 |
+
gcp_credentials = gcp_credentials.to_dict()
|
190 |
+
|
191 |
+
print("β
Using GCP credentials from Streamlit secrets (dict format)")
|
192 |
+
credentials = service_account.Credentials.from_service_account_info(gcp_credentials)
|
193 |
+
return credentials
|
194 |
+
else:
|
195 |
+
# Assume it's a JSON string
|
196 |
+
try:
|
197 |
+
gcp_credentials_dict = json.loads(gcp_credentials)
|
198 |
+
print("β
Using GCP credentials from Streamlit secrets (JSON string)")
|
199 |
+
credentials = service_account.Credentials.from_service_account_info(gcp_credentials_dict)
|
200 |
+
return credentials
|
201 |
+
except json.JSONDecodeError:
|
202 |
+
print("β οΈ GCP credentials in Streamlit secrets is not valid JSON, trying as file path")
|
203 |
+
if os.path.exists(gcp_credentials):
|
204 |
+
credentials = service_account.Credentials.from_service_account_file(gcp_credentials)
|
205 |
+
return credentials
|
206 |
+
else:
|
207 |
+
raise ValueError("GCP credentials format not recognized")
|
208 |
+
|
209 |
+
else:
|
210 |
+
raise ValueError("No GCP credentials found in environment or Streamlit secrets")
|
211 |
+
|
212 |
+
except Exception as e:
|
213 |
+
error_msg = f"β Authentication error: {str(e)}"
|
214 |
+
print(error_msg)
|
215 |
+
st.error(error_msg)
|
216 |
+
raise
|
217 |
+
```
|
218 |
+
|
219 |
+
### OpenAI API Authentication
|
220 |
+
|
221 |
+
Similarly, OpenAI API authentication can be configured in multiple ways:
|
222 |
+
|
223 |
+
#### Option 1: HF Spaces Environment Variable (Recommended Production Method)
|
224 |
+
|
225 |
+
Set the `OPENAI_API_KEY` environment variable in the Hugging Face Spaces UI.
|
226 |
+
|
227 |
+
#### Option 2: Environment Variables
|
228 |
+
|
229 |
+
Set the `OPENAI_API_KEY` environment variable:
|
230 |
+
|
231 |
+
```bash
|
232 |
+
export OPENAI_API_KEY="your-openai-api-key"
|
233 |
+
```
|
234 |
+
|
235 |
+
#### Option 3: Streamlit Secrets
|
236 |
+
|
237 |
+
Add your OpenAI API key to the `.streamlit/secrets.toml` file:
|
238 |
+
|
239 |
+
```toml
|
240 |
+
openai_api_key = "your-openai-api-key"
|
241 |
+
```
|
242 |
+
|
243 |
+
The authentication logic is handled by the `setup_openai_auth()` function in `utils.py`:
|
244 |
+
|
245 |
+
```python
|
246 |
+
def setup_openai_auth():
|
247 |
+
"""
|
248 |
+
Setup OpenAI API authentication using various methods.
|
249 |
+
|
250 |
+
This function tries multiple authentication methods in order of preference:
|
251 |
+
1. Standard environment variable (OPENAI_API_KEY)
|
252 |
+
2. HF Spaces environment variable (OPENAI_KEY) - primary production method
|
253 |
+
3. Streamlit secrets (openai_api_key)
|
254 |
+
|
255 |
+
Note: In production, the API key is stored exclusively in HF Spaces secrets.
|
256 |
+
"""
|
257 |
+
try:
|
258 |
+
# Option 1: Standard environment variable
|
259 |
+
if "OPENAI_API_KEY" in os.environ:
|
260 |
+
openai.api_key = os.getenv("OPENAI_API_KEY")
|
261 |
+
print("β
Using OpenAI API key from environment variable")
|
262 |
+
return
|
263 |
+
|
264 |
+
# Option 2: HF Spaces environment variable with different name
|
265 |
+
elif "OPENAI_KEY" in os.environ:
|
266 |
+
openai.api_key = os.getenv("OPENAI_KEY")
|
267 |
+
print("β
Using OpenAI API key from HF Spaces environment variable")
|
268 |
+
return
|
269 |
+
|
270 |
+
# Option 3: Streamlit secrets
|
271 |
+
elif "openai_api_key" in st.secrets:
|
272 |
+
openai.api_key = st.secrets["openai_api_key"]
|
273 |
+
print("β
Using OpenAI API key from Streamlit secrets")
|
274 |
+
return
|
275 |
+
|
276 |
+
else:
|
277 |
+
raise ValueError("No OpenAI API key found in environment or Streamlit secrets")
|
278 |
+
|
279 |
+
except Exception as e:
|
280 |
+
error_msg = f"β OpenAI authentication error: {str(e)}"
|
281 |
+
print(error_msg)
|
282 |
+
st.error(error_msg)
|
283 |
+
raise
|
284 |
+
```
|
285 |
+
|
286 |
+
## Application Customization
|
287 |
+
|
288 |
+
### UI Customization
|
289 |
+
|
290 |
+
Anveshak's UI can be customized through the CSS in the `app.py` file:
|
291 |
+
|
292 |
+
```python
|
293 |
+
# Custom CSS
|
294 |
+
st.markdown("""
|
295 |
+
<style>
|
296 |
+
.main-title {
|
297 |
+
font-size: 2.5rem;
|
298 |
+
color: #c0392b;
|
299 |
+
text-align: center;
|
300 |
+
margin-bottom: 1rem;
|
301 |
+
}
|
302 |
+
.subtitle {
|
303 |
+
font-size: 1.2rem;
|
304 |
+
color: #555;
|
305 |
+
text-align: center;
|
306 |
+
margin-bottom: 1.5rem;
|
307 |
+
font-style: italic;
|
308 |
+
}
|
309 |
+
/* More CSS rules... */
|
310 |
+
</style>
|
311 |
+
<div class="main-title">Anveshak</div>
|
312 |
+
<div class="subtitle">Spirituality Q&A</div>
|
313 |
+
""", unsafe_allow_html=True)
|
314 |
+
```
|
315 |
+
|
316 |
+
To change the appearance:
|
317 |
+
|
318 |
+
1. Modify the CSS variables in the `<style>` tag
|
319 |
+
2. Update color schemes, fonts, or layouts as needed
|
320 |
+
3. Add new CSS classes for additional UI elements
|
321 |
+
|
322 |
+
### Common Questions Configuration
|
323 |
+
|
324 |
+
The list of pre-selected common questions can be modified in the `app.py` file:
|
325 |
+
|
326 |
+
```python
|
327 |
+
# Common spiritual questions for users to select from
|
328 |
+
common_questions = [
|
329 |
+
"What is the Atman or the soul?",
|
330 |
+
"Are there rebirths?",
|
331 |
+
"What is Karma?",
|
332 |
+
# Add or modify questions here
|
333 |
+
]
|
334 |
+
```
|
335 |
+
|
336 |
+
### Retrieval Parameters
|
337 |
+
|
338 |
+
Two key retrieval parameters can be adjusted by users through the UI:
|
339 |
+
|
340 |
+
1. **Number of sources** (`top_k`): Controls how many distinct sources are used for generating answers
|
341 |
+
- Default: 5
|
342 |
+
- Range: 3-10
|
343 |
+
- UI Component: Slider in the main interface
|
344 |
+
|
345 |
+
2. **Word limit** (`word_limit`): Controls the maximum length of generated answers
|
346 |
+
- Default: 200
|
347 |
+
- Range: 50-500
|
348 |
+
- UI Component: Slider in the main interface
|
349 |
+
|
350 |
+
These parameters are implemented in the Streamlit UI:
|
351 |
+
|
352 |
+
```python
|
353 |
+
# Sliders for customization
|
354 |
+
col1, col2 = st.columns(2)
|
355 |
+
with col1:
|
356 |
+
top_k = st.slider("Number of sources:", 3, 10, 5)
|
357 |
+
with col2:
|
358 |
+
word_limit = st.slider("Word limit:", 50, 500, 200)
|
359 |
+
```
|
360 |
+
|
361 |
+
## Deployment Options
|
362 |
+
|
363 |
+
### Recommended: Hugging Face Spaces Deployment
|
364 |
+
|
365 |
+
The recommended and tested deployment method for Anveshak: Spirituality Q&A is Hugging Face Spaces, which provides the necessary resources for running the application efficiently.
|
366 |
+
|
367 |
+
To deploy on Hugging Face Spaces:
|
368 |
+
|
369 |
+
1. Fork the repository to your GitHub account
|
370 |
+
|
371 |
+
2. Create a new Space on Hugging Face:
|
372 |
+
- Go to [huggingface.co/spaces](https://huggingface.co/spaces)
|
373 |
+
- Click "Create new Space"
|
374 |
+
- Select "Streamlit" as the SDK
|
375 |
+
- Connect your GitHub repository
|
376 |
+
|
377 |
+
3. Configure secrets in the Hugging Face UI:
|
378 |
+
- Go to your Space settings
|
379 |
+
- Under "Repository secrets"
|
380 |
+
- Add each of the following secrets:
|
381 |
+
- `OPENAI_API_KEY`
|
382 |
+
- `GCP_CREDENTIALS` (the entire JSON as a string)
|
383 |
+
- `BUCKET_NAME_GCS`
|
384 |
+
- `LLM_MODEL`
|
385 |
+
- `METADATA_PATH_GCS`
|
386 |
+
- `RAW_TEXTS_UPLOADED_PATH_GCS`
|
387 |
+
- `RAW_TEXTS_DOWNLOADED_PATH_GCS`
|
388 |
+
- `CLEANED_TEXTS_PATH_GCS`
|
389 |
+
- `EMBEDDINGS_PATH_GCS`
|
390 |
+
- `INDICES_PATH_GCS`
|
391 |
+
- `CHUNKS_PATH_GCS`
|
392 |
+
- `EMBEDDING_MODEL`
|
393 |
+
|
394 |
+
4. The app should automatically deploy. If needed, manually trigger a rebuild from the Spaces UI.
|
395 |
+
|
396 |
+
### Local Development (Not Recommended)
|
397 |
+
|
398 |
+
**Important Note**: Running Anveshak: Spirituality Q&A locally requires above 16GB of RAM due to the embedding model. Most standard laptops will experience crashes during model loading. Hugging Face Spaces deployment is strongly recommended.
|
399 |
+
|
400 |
+
If you still want to run it locally for development purposes:
|
401 |
+
|
402 |
+
1. Clone the repository
|
403 |
+
```bash
|
404 |
+
git clone https://github.com/YourUsername/anveshak.git
|
405 |
+
cd anveshak
|
406 |
+
```
|
407 |
+
|
408 |
+
2. Install dependencies
|
409 |
+
```bash
|
410 |
+
pip install -r requirements.txt
|
411 |
+
```
|
412 |
+
|
413 |
+
3. Create the `.streamlit/secrets.toml` file as described above
|
414 |
+
|
415 |
+
4. Run the application
|
416 |
+
```bash
|
417 |
+
streamlit run app.py
|
418 |
+
```
|
419 |
+
|
420 |
+
### Alternative: Docker Deployment
|
421 |
+
|
422 |
+
For containerized deployment (not tested in production):
|
423 |
+
|
424 |
+
1. Create a `Dockerfile`:
|
425 |
+
|
426 |
+
```dockerfile
|
427 |
+
FROM python:3.9-slim
|
428 |
+
|
429 |
+
WORKDIR /app
|
430 |
+
|
431 |
+
COPY requirements.txt .
|
432 |
+
RUN pip install -r requirements.txt
|
433 |
+
|
434 |
+
COPY . .
|
435 |
+
|
436 |
+
EXPOSE 8501
|
437 |
+
|
438 |
+
CMD ["streamlit", "run", "app.py"]
|
439 |
+
```
|
440 |
+
|
441 |
+
2. Build the Docker image:
|
442 |
+
```bash
|
443 |
+
docker build -t anveshak .
|
444 |
+
```
|
445 |
+
|
446 |
+
3. Run the container:
|
447 |
+
```bash
|
448 |
+
docker run -p 8501:8501 \
|
449 |
+
-e BUCKET_NAME_GCS=your-bucket-name \
|
450 |
+
-e METADATA_PATH_GCS=metadata/metadata.jsonl \
|
451 |
+
-e EMBEDDINGS_PATH_GCS=processed/embeddings/all_embeddings.npy \
|
452 |
+
-e INDICES_PATH_GCS=processed/indices/faiss_index.faiss \
|
453 |
+
-e CHUNKS_PATH_GCS=processed/chunks/text_chunks.txt \
|
454 |
+
-e RAW_TEXTS_UPLOADED_PATH_GCS=raw-texts/uploaded \
|
455 |
+
-e RAW_TEXTS_DOWNLOADED_PATH_GCS=raw-texts/downloaded/ \
|
456 |
+
-e CLEANED_TEXTS_PATH_GCS=cleaned-texts/ \
|
457 |
+
-e EMBEDDING_MODEL=intfloat/e5-large-v2 \
|
458 |
+
-e LLM_MODEL=gpt-3.5-turbo \
|
459 |
+
-e OPENAI_API_KEY=your-openai-api-key \
|
460 |
+
-e GCP_CREDENTIALS='{"type":"service_account",...}' \
|
461 |
+
anveshak
|
462 |
+
```
|
463 |
+
|
464 |
+
## Performance Tuning
|
465 |
+
|
466 |
+
### Caching Configuration
|
467 |
+
|
468 |
+
Anveshak: Spirituality Q&A uses Streamlit's caching mechanisms to optimize performance:
|
469 |
+
|
470 |
+
#### Resource Caching
|
471 |
+
Used for loading models and data files that remain constant:
|
472 |
+
|
473 |
+
```python
|
474 |
+
@st.cache_resource(show_spinner=False)
|
475 |
+
def cached_load_model():
|
476 |
+
# Load embedding model once and cache it
|
477 |
+
```
|
478 |
+
|
479 |
+
This cache persists for the lifetime of the application.
|
480 |
+
|
481 |
+
#### Data Caching
|
482 |
+
Used for caching query results with a time-to-live (TTL):
|
483 |
+
|
484 |
+
```python
|
485 |
+
@st.cache_data(ttl=3600, show_spinner=False)
|
486 |
+
def cached_process_query(query, top_k=5, word_limit=100):
|
487 |
+
# Cache query results for an hour
|
488 |
+
```
|
489 |
+
|
490 |
+
The TTL (3600 seconds = 1 hour) can be adjusted based on your needs.
|
491 |
+
|
492 |
+
### Memory Optimization
|
493 |
+
|
494 |
+
For deployments with limited memory:
|
495 |
+
|
496 |
+
1. **Force CPU Usage**: Anveshak already forces CPU usage for the embedding model to avoid GPU memory issues:
|
497 |
+
```python
|
498 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = ""
|
499 |
+
```
|
500 |
+
|
501 |
+
2. **Adjust Batch Size**: If you're recreating the embeddings, consider reducing the batch size:
|
502 |
+
```python
|
503 |
+
def create_embeddings(text_chunks, batch_size=16): # Reduced from 32
|
504 |
+
```
|
505 |
+
|
506 |
+
3. **Garbage Collection**: Anveshak performs explicit garbage collection after operations:
|
507 |
+
```python
|
508 |
+
del outputs, inputs
|
509 |
+
gc.collect()
|
510 |
+
```
|
511 |
+
|
512 |
+
## Troubleshooting
|
513 |
+
|
514 |
+
### Common Issues
|
515 |
+
|
516 |
+
#### Authentication Errors
|
517 |
+
|
518 |
+
**Symptom**: Error message about invalid credentials or permission denied.
|
519 |
+
|
520 |
+
**Solution**:
|
521 |
+
1. Verify that your service account has the correct permissions (Storage Object Admin)
|
522 |
+
2. Check that your API keys are correctly formatted and not expired
|
523 |
+
3. Ensure that your GCP credentials JSON is valid and properly formatted
|
524 |
+
|
525 |
+
#### Missing Files
|
526 |
+
|
527 |
+
**Symptom**: Error about missing files or "File not found" when accessing GCS.
|
528 |
+
|
529 |
+
**Solution**:
|
530 |
+
1. Verify the correct bucket name and file paths in your configuration
|
531 |
+
2. Check that all required files exist in your GCS bucket
|
532 |
+
3. Ensure your service account has access to the specified bucket
|
533 |
+
|
534 |
+
#### Memory Issues
|
535 |
+
|
536 |
+
**Symptom**: Application crashes with out-of-memory errors.
|
537 |
+
|
538 |
+
**Solution**:
|
539 |
+
1. Increase the memory allocation for your deployment (if possible)
|
540 |
+
2. Ensure that `os.environ["CUDA_VISIBLE_DEVICES"] = ""` is set to force CPU usage
|
541 |
+
3. Implement additional garbage collection calls in high-memory operations
|
542 |
+
|
543 |
+
#### OpenAI API Rate Limits
|
544 |
+
|
545 |
+
**Symptom**: Errors about rate limits or exceeding quotas with OpenAI.
|
546 |
+
|
547 |
+
**Solution**:
|
548 |
+
1. Implement retry logic with exponential backoff
|
549 |
+
2. Consider using a paid tier OpenAI account with higher rate limits
|
550 |
+
3. Add caching to reduce the number of API calls
|
551 |
+
|
552 |
+
### Logs and Debugging
|
553 |
+
|
554 |
+
Anveshak includes comprehensive logging:
|
555 |
+
|
556 |
+
```python
|
557 |
+
print(f"β
Model loaded successfully (cached)")
|
558 |
+
print(f"β Error loading model: {str(e)}")
|
559 |
+
```
|
560 |
+
|
561 |
+
To enable more detailed logging, you can use Streamlit's built-in logging configuration:
|
562 |
+
|
563 |
+
```python
|
564 |
+
import logging
|
565 |
+
logging.basicConfig(level=logging.INFO)
|
566 |
+
logger = logging.getLogger(__name__)
|
567 |
+
|
568 |
+
# Then use logger instead of print
|
569 |
+
logger.info("Model loaded successfully")
|
570 |
+
logger.error(f"Error loading model: {str(e)}")
|
571 |
+
```
|
572 |
+
|
573 |
+
## Special Considerations
|
574 |
+
|
575 |
+
### Privacy
|
576 |
+
|
577 |
+
Anveshak: Spirituality Q&A is designed to not save or store any user queries or data. This is important for spiritual questions, which may be of a personal nature. No additional configuration is needed for this - the application simply does not implement any data storage functionality.
|
578 |
+
|
579 |
+
### Language Support
|
580 |
+
|
581 |
+
Currently, Anveshak is only available in English. This is a known limitation of the current implementation. Future versions may include support for Sanskrit, Hindi, Bengali, Tamil, and other Indian languages.
|
582 |
+
|
583 |
+
### Concise Answers
|
584 |
+
|
585 |
+
Anveshak generates concise answers rather than lengthy explanations. This is by design, to respect both copyright constraints and the nature of spiritual wisdom, which often benefits from clarity and simplicity.
|
586 |
+
|
587 |
+
## Conclusion
|
588 |
+
|
589 |
+
This configuration guide provides all the necessary information to set up, customize, and deploy Anveshak: Spirituality Q&A. By following these instructions, you should be able to:
|
590 |
+
|
591 |
+
1. Configure the necessary authentication for GCS and OpenAI
|
592 |
+
2. Customize Anveshak's appearance and behavior
|
593 |
+
3. Deploy the application on Hugging Face Spaces (recommended) or other platforms
|
594 |
+
4. Optimize performance for your specific use case
|
595 |
+
5. Troubleshoot common issues
|
596 |
+
|
597 |
+
The flexibility of the configuration options allows you to adapt the application to different deployment environments while maintaining the core functionality of providing spiritually informed answers based on traditional texts from diverse traditions and teachers of all backgrounds.
|
docs/data-handling-doc.md
ADDED
@@ -0,0 +1,687 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Data Handling Explanation
|
2 |
+
|
3 |
+
This document explains how data is processed, stored, and handled in Anveshak: Spirituality Q&A, with special attention to ethical considerations and copyright respect.
|
4 |
+
|
5 |
+
## Data Sources
|
6 |
+
|
7 |
+
### Text Corpus Overview
|
8 |
+
|
9 |
+
Anveshak: Spirituality Q&A uses approximately 133 digitized spiritual texts sourced from freely available resources. These texts include:
|
10 |
+
|
11 |
+
- Ancient sacred literature (Vedas, Upanishads, Puranas, Sutras, DharmaΕΔstras, and Agamas)
|
12 |
+
- Classical Indian texts (The Bhagavad Gita, The ΕrΔ«mad BhΔgavatam, and others)
|
13 |
+
- Indian historical texts (The Mahabharata and The Ramayana)
|
14 |
+
- Teachings of revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters of all genders, backgrounds, traditions, and walks of life
|
15 |
+
|
16 |
+
As stated in app.py:
|
17 |
+
|
18 |
+
> "Anveshak draws from a rich tapestry of spiritual wisdom found in classical Indian texts, philosophical treatises, and the teachings of revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters across centuries. The knowledge presented here spans multiple traditions, schools of thought, and spiritual lineages that have flourished in the Indian subcontinent and beyond."
|
19 |
+
|
20 |
+
### Ethical Sourcing
|
21 |
+
|
22 |
+
All texts included in Anveshak meet the following criteria:
|
23 |
+
|
24 |
+
1. **Public availability**: All texts were freely available from sources like archive.org
|
25 |
+
2. **Educational use**: Texts are used solely for educational purposes
|
26 |
+
3. **Proper attribution**: All sources are credited with author and publisher information
|
27 |
+
4. **Respect for copyright**: Implementation of word limits and other copyright-respecting measures
|
28 |
+
|
29 |
+
As mentioned in app.py:
|
30 |
+
|
31 |
+
> "Note that the sources consist of about 133 digitized texts, all of which were freely available over the internet (on sites like archive.org). Many of the texts are English translations of original (and in some cases, ancient) sacred and spiritual texts. All of the copyrights belong to the respective authors and publishers and we bow down in gratitude to their selfless work. Anveshak merely re-presents the ocean of spiritual knowledge and wisdom contained in the original works with relevant citations in a limited number of words."
|
32 |
+
|
33 |
+
## Data Processing Pipeline
|
34 |
+
|
35 |
+
### 1. Data Collection
|
36 |
+
|
37 |
+
The data collection process involves two methods as implemented in preprocessing.py:
|
38 |
+
|
39 |
+
#### Manual Upload
|
40 |
+
Texts are manually uploaded to Google Cloud Storage (GCS) through a preprocessing script:
|
41 |
+
|
42 |
+
```python
|
43 |
+
def upload_files_to_colab():
|
44 |
+
"""Upload raw text files and metadata from local machine to Colab."""
|
45 |
+
# First, upload text files
|
46 |
+
print("Step 1: Please upload your text files...")
|
47 |
+
uploaded_text_files = files.upload() # This will prompt the user to upload files
|
48 |
+
|
49 |
+
# Create directory structure if it doesn't exist
|
50 |
+
os.makedirs(LOCAL_RAW_TEXTS_FOLDER, exist_ok=True)
|
51 |
+
|
52 |
+
# Move uploaded text files to the raw-texts folder
|
53 |
+
for filename, content in uploaded_text_files.items():
|
54 |
+
if filename.endswith(".txt"):
|
55 |
+
with open(os.path.join(LOCAL_RAW_TEXTS_FOLDER, filename), "wb") as f:
|
56 |
+
f.write(content)
|
57 |
+
print(f"β
Saved {filename} to {LOCAL_RAW_TEXTS_FOLDER}")
|
58 |
+
```
|
59 |
+
|
60 |
+
#### Web Downloading
|
61 |
+
Some texts are automatically downloaded from URLs listed in the metadata file:
|
62 |
+
|
63 |
+
```python
|
64 |
+
def download_text_files():
|
65 |
+
"""Fetch metadata, filter unuploaded files, and download text files."""
|
66 |
+
metadata = fetch_metadata_from_gcs()
|
67 |
+
# Filter entries where Uploaded is False
|
68 |
+
files_to_download = [item for item in metadata if item["Uploaded"] == False]
|
69 |
+
|
70 |
+
# Process only necessary files
|
71 |
+
for item in files_to_download:
|
72 |
+
name, author, url = item["Title"], item["Author"], item["URL"]
|
73 |
+
if url.lower() == "not available":
|
74 |
+
print(f"β Skipping {name} - No URL available.")
|
75 |
+
continue
|
76 |
+
|
77 |
+
try:
|
78 |
+
response = requests.get(url)
|
79 |
+
if response.status_code == 200:
|
80 |
+
raw_text = response.text
|
81 |
+
filename = "{}.txt".format(name.replace(" ", "_"))
|
82 |
+
# Save to local first
|
83 |
+
local_path = f"/tmp/{filename}"
|
84 |
+
with open(local_path, "w", encoding="utf-8") as file:
|
85 |
+
file.write(raw_text)
|
86 |
+
# Upload to GCS
|
87 |
+
gcs_path = f"{RAW_TEXTS_DOWNLOADED_PATH_GCS}{filename}"
|
88 |
+
upload_to_gcs(local_path, gcs_path)
|
89 |
+
print(f"β
Downloaded & uploaded: {filename} ({len(raw_text.split())} words)")
|
90 |
+
else:
|
91 |
+
print(f"β Failed to download {name}: {url} (Status {response.status_code})")
|
92 |
+
except Exception as e:
|
93 |
+
print(f"β Error processing {name}: {e}")
|
94 |
+
```
|
95 |
+
|
96 |
+
### 2. Text Cleaning
|
97 |
+
|
98 |
+
Raw texts often contain HTML tags, OCR errors, and formatting issues. The cleaning process removes these artifacts using the exact implementation from preprocessing.py:
|
99 |
+
|
100 |
+
```python
|
101 |
+
def rigorous_clean_text(text):
|
102 |
+
"""
|
103 |
+
Clean text by removing metadata, junk text, and formatting issues.
|
104 |
+
|
105 |
+
This function:
|
106 |
+
1. Removes HTML tags using BeautifulSoup
|
107 |
+
2. Removes URLs and standalone numbers
|
108 |
+
3. Removes all-caps OCR noise words
|
109 |
+
4. Deduplicates adjacent identical lines
|
110 |
+
5. Normalizes Unicode characters
|
111 |
+
6. Standardizes whitespace and newlines
|
112 |
+
|
113 |
+
Args:
|
114 |
+
text (str): The raw text to clean
|
115 |
+
|
116 |
+
Returns:
|
117 |
+
str: The cleaned text
|
118 |
+
"""
|
119 |
+
text = BeautifulSoup(text, "html.parser").get_text()
|
120 |
+
text = re.sub(r"https?:\/\/\S+", "", text) # Remove links
|
121 |
+
text = re.sub(r"\b\d+\b", "", text) # Remove standalone numbers
|
122 |
+
text = re.sub(r"\b[A-Z]{5,}\b", "", text) # Remove all-caps OCR noise words
|
123 |
+
lines = text.split("\n")
|
124 |
+
cleaned_lines = []
|
125 |
+
last_line = None
|
126 |
+
|
127 |
+
for line in lines:
|
128 |
+
line = line.strip()
|
129 |
+
if line and line != last_line:
|
130 |
+
cleaned_lines.append(line)
|
131 |
+
last_line = line
|
132 |
+
|
133 |
+
text = "\n".join(cleaned_lines)
|
134 |
+
text = unicodedata.normalize("NFKD", text)
|
135 |
+
text = re.sub(r"\s+", " ", text).strip()
|
136 |
+
text = re.sub(r"\n{2,}", "\n", text)
|
137 |
+
return text
|
138 |
+
```
|
139 |
+
|
140 |
+
The cleaning process:
|
141 |
+
- Removes HTML tags using BeautifulSoup
|
142 |
+
- Eliminates URLs and standalone numbers
|
143 |
+
- Removes all-caps OCR noise words (common in digitized texts)
|
144 |
+
- Deduplicates adjacent identical lines
|
145 |
+
- Normalizes Unicode characters
|
146 |
+
- Standardizes whitespace and newlines
|
147 |
+
|
148 |
+
### 3. Text Chunking
|
149 |
+
|
150 |
+
Clean texts are split into smaller, manageable chunks for processing using the exact implementation from preprocessing.py:
|
151 |
+
|
152 |
+
```python
|
153 |
+
def chunk_text(text, chunk_size=500, overlap=50):
|
154 |
+
"""
|
155 |
+
Split text into smaller, overlapping chunks for better retrieval.
|
156 |
+
|
157 |
+
Args:
|
158 |
+
text (str): The text to chunk
|
159 |
+
chunk_size (int): Maximum number of words per chunk
|
160 |
+
overlap (int): Number of words to overlap between chunks
|
161 |
+
|
162 |
+
Returns:
|
163 |
+
list: List of text chunks
|
164 |
+
"""
|
165 |
+
words = text.split()
|
166 |
+
chunks = []
|
167 |
+
i = 0
|
168 |
+
|
169 |
+
while i < len(words):
|
170 |
+
chunk = " ".join(words[i:i + chunk_size])
|
171 |
+
chunks.append(chunk)
|
172 |
+
i += chunk_size - overlap
|
173 |
+
|
174 |
+
return chunks
|
175 |
+
```
|
176 |
+
|
177 |
+
Chunking characteristics:
|
178 |
+
- **Chunk size**: 500 words per chunk, balancing context and retrieval precision
|
179 |
+
- **Overlap**: 50-word overlap between chunks to maintain context across chunk boundaries
|
180 |
+
- **Context preservation**: Ensures that passages aren't arbitrarily cut in the middle of important concepts
|
181 |
+
|
182 |
+
### 4. Text Embedding
|
183 |
+
|
184 |
+
Chunks are converted to vector embeddings using the E5-large-v2 model with the actual implementation from preprocessing.py:
|
185 |
+
|
186 |
+
```python
|
187 |
+
def create_embeddings(text_chunks, batch_size=32):
|
188 |
+
"""
|
189 |
+
Generate embeddings for the given chunks of text using the specified embedding model.
|
190 |
+
|
191 |
+
This function:
|
192 |
+
1. Uses SentenceTransformer to load the embedding model
|
193 |
+
2. Prefixes each chunk with "passage:" as required by the E5 model
|
194 |
+
3. Processes chunks in batches to manage memory usage
|
195 |
+
4. Normalizes embeddings for cosine similarity search
|
196 |
+
|
197 |
+
Args:
|
198 |
+
text_chunks (list): List of text chunks to embed
|
199 |
+
batch_size (int): Number of chunks to process at once
|
200 |
+
|
201 |
+
Returns:
|
202 |
+
numpy.ndarray: Matrix of embeddings, one per text chunk
|
203 |
+
"""
|
204 |
+
# Load the model with GPU optimization
|
205 |
+
model = SentenceTransformer(EMBEDDING_MODEL)
|
206 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
207 |
+
model = model.to(device)
|
208 |
+
print(f"π Using device for embeddings: {device}")
|
209 |
+
|
210 |
+
prefixed_chunks = [f"passage: {text}" for text in text_chunks]
|
211 |
+
all_embeddings = []
|
212 |
+
|
213 |
+
for i in range(0, len(prefixed_chunks), batch_size):
|
214 |
+
batch = prefixed_chunks[i:i+batch_size]
|
215 |
+
# Move batch to GPU (if available) for faster processing
|
216 |
+
with torch.no_grad():
|
217 |
+
batch_embeddings = model.encode(batch, convert_to_numpy=True, normalize_embeddings=True)
|
218 |
+
all_embeddings.append(batch_embeddings)
|
219 |
+
|
220 |
+
if (i + batch_size) % 100 == 0 or (i + batch_size) >= len(prefixed_chunks):
|
221 |
+
print(f"π Processed {i + min(batch_size, len(prefixed_chunks) - i)}/{len(prefixed_chunks)} documents")
|
222 |
+
|
223 |
+
return np.vstack(all_embeddings).astype("float32")
|
224 |
+
```
|
225 |
+
|
226 |
+
Embedding process details:
|
227 |
+
- **Model**: E5-large-v2, a state-of-the-art embedding model for retrieval tasks
|
228 |
+
- **Prefix**: "passage:" prefix is added to each chunk for optimal embedding
|
229 |
+
- **Batching**: Processing in batches of 32 for memory efficiency
|
230 |
+
- **Normalization**: Embeddings are normalized for cosine similarity search
|
231 |
+
- **Output**: Each text chunk becomes a 1024-dimensional vector
|
232 |
+
|
233 |
+
### 5. FAISS Index Creation
|
234 |
+
|
235 |
+
Embeddings are stored in a Facebook AI Similarity Search (FAISS) index for efficient similarity search:
|
236 |
+
|
237 |
+
```python
|
238 |
+
# Build FAISS index
|
239 |
+
dimension = all_embeddings.shape[1]
|
240 |
+
index = faiss.IndexFlatIP(dimension)
|
241 |
+
index.add(all_embeddings)
|
242 |
+
```
|
243 |
+
|
244 |
+
FAISS index characteristics:
|
245 |
+
- **Index type**: IndexFlatIP (Inner Product) for cosine similarity search
|
246 |
+
- **Exact search**: Uses exact search rather than approximate for maximum accuracy
|
247 |
+
- **Dimension**: 1024-dimensional vectors from the E5-large-v2 model
|
248 |
+
|
249 |
+
### 6. Metadata Management
|
250 |
+
|
251 |
+
The system maintains metadata for each text to provide proper citations, using the implementation from rag_engine.py:
|
252 |
+
|
253 |
+
```python
|
254 |
+
def fetch_metadata_from_gcs():
|
255 |
+
"""
|
256 |
+
Fetch metadata.jsonl from GCS and return as a list of dictionaries.
|
257 |
+
|
258 |
+
Each dictionary represents a text entry with metadata like title, author, etc.
|
259 |
+
|
260 |
+
Returns:
|
261 |
+
list: List of dictionaries containing metadata for each text
|
262 |
+
"""
|
263 |
+
blob = bucket.blob(METADATA_PATH_GCS)
|
264 |
+
# Download metadata file
|
265 |
+
metadata_jsonl = blob.download_as_text()
|
266 |
+
# Parse JSONL
|
267 |
+
metadata = [json.loads(line) for line in metadata_jsonl.splitlines()]
|
268 |
+
return metadata
|
269 |
+
```
|
270 |
+
|
271 |
+
Metadata structure (JSONL format):
|
272 |
+
```json
|
273 |
+
{"Title": "Bhagavad Gita", "Author": "Vyasa", "Publisher": "Gita Press, Gorakhpur, India", "URL": "https://archive.org/details/bhagavad-gita", "Uploaded": true}
|
274 |
+
{"Title": "Yoga Sutras", "Author": "Patanjali", "Publisher": "DIVINE LIFE SOCIETY", "URL": "https://archive.org/details/yoga-sutras", "Uploaded": true}
|
275 |
+
```
|
276 |
+
|
277 |
+
## Data Storage Architecture
|
278 |
+
|
279 |
+
### Google Cloud Storage Structure
|
280 |
+
|
281 |
+
Anveshak: Spirituality Q&A uses Google Cloud Storage (GCS) as its primary data store, organized as follows:
|
282 |
+
|
283 |
+
```
|
284 |
+
bucket_name/
|
285 |
+
βββ metadata/
|
286 |
+
β βββ metadata.jsonl # Metadata for all texts
|
287 |
+
βββ raw-texts/
|
288 |
+
β βββ uploaded/ # Manually uploaded texts
|
289 |
+
β βββ downloaded/ # Automatically downloaded texts
|
290 |
+
βββ cleaned-texts/ # Cleaned versions of all texts
|
291 |
+
βββ processed/
|
292 |
+
βββ embeddings/
|
293 |
+
β βββ all_embeddings.npy # Numpy array of embeddings
|
294 |
+
βββ indices/
|
295 |
+
β βββ faiss_index.faiss # FAISS index file
|
296 |
+
βββ chunks/
|
297 |
+
βββ text_chunks.txt # Text chunks with metadata
|
298 |
+
```
|
299 |
+
|
300 |
+
### Local Caching
|
301 |
+
|
302 |
+
For deployment on Hugging Face Spaces, essential files are downloaded to local storage using the implementation from rag_engine.py:
|
303 |
+
|
304 |
+
```python
|
305 |
+
# Local Paths
|
306 |
+
local_embeddings_file = "all_embeddings.npy"
|
307 |
+
local_faiss_index_file = "faiss_index.faiss"
|
308 |
+
local_text_chunks_file = "text_chunks.txt"
|
309 |
+
local_metadata_file = "metadata.jsonl"
|
310 |
+
```
|
311 |
+
|
312 |
+
These files are loaded with caching to improve performance, using the actual implementation from rag_engine.py:
|
313 |
+
|
314 |
+
```python
|
315 |
+
@st.cache_resource(show_spinner=False)
|
316 |
+
def cached_load_data_files():
|
317 |
+
"""
|
318 |
+
Cached version of load_data_files() for FAISS index, text chunks, and metadata.
|
319 |
+
|
320 |
+
This function loads:
|
321 |
+
- FAISS index for vector similarity search
|
322 |
+
- Text chunks containing the original spiritual text passages
|
323 |
+
- Metadata dictionary with publication and author information
|
324 |
+
|
325 |
+
All files are downloaded from Google Cloud Storage if not already present locally.
|
326 |
+
|
327 |
+
Returns:
|
328 |
+
tuple: (faiss_index, text_chunks, metadata_dict) or (None, None, None) if loading fails
|
329 |
+
"""
|
330 |
+
# Initialize GCP and OpenAI clients
|
331 |
+
bucket = setup_gcp_client()
|
332 |
+
openai_initialized = setup_openai_client()
|
333 |
+
|
334 |
+
if not bucket or not openai_initialized:
|
335 |
+
print("Failed to initialize required services")
|
336 |
+
return None, None, None
|
337 |
+
|
338 |
+
# Get GCS paths from secrets - required
|
339 |
+
try:
|
340 |
+
metadata_file_gcs = st.secrets["METADATA_PATH_GCS"]
|
341 |
+
embeddings_file_gcs = st.secrets["EMBEDDINGS_PATH_GCS"]
|
342 |
+
faiss_index_file_gcs = st.secrets["INDICES_PATH_GCS"]
|
343 |
+
text_chunks_file_gcs = st.secrets["CHUNKS_PATH_GCS"]
|
344 |
+
except KeyError as e:
|
345 |
+
print(f"β Error: Required GCS path not found in secrets: {e}")
|
346 |
+
return None, None, None
|
347 |
+
|
348 |
+
# Download necessary files if not already present locally
|
349 |
+
success = True
|
350 |
+
success &= download_file_from_gcs(bucket, faiss_index_file_gcs, local_faiss_index_file)
|
351 |
+
success &= download_file_from_gcs(bucket, text_chunks_file_gcs, local_text_chunks_file)
|
352 |
+
success &= download_file_from_gcs(bucket, metadata_file_gcs, local_metadata_file)
|
353 |
+
|
354 |
+
if not success:
|
355 |
+
print("Failed to download required files")
|
356 |
+
return None, None, None
|
357 |
+
|
358 |
+
# Load FAISS index, text chunks, and metadata
|
359 |
+
try:
|
360 |
+
faiss_index = faiss.read_index(local_faiss_index_file)
|
361 |
+
except Exception as e:
|
362 |
+
print(f"β Error loading FAISS index: {str(e)}")
|
363 |
+
return None, None, None
|
364 |
+
|
365 |
+
# Load text chunks
|
366 |
+
try:
|
367 |
+
text_chunks = {} # Mapping: ID -> (Title, Author, Text)
|
368 |
+
with open(local_text_chunks_file, "r", encoding="utf-8") as f:
|
369 |
+
for line in f:
|
370 |
+
parts = line.strip().split("\t")
|
371 |
+
if len(parts) == 4:
|
372 |
+
text_chunks[int(parts[0])] = (parts[1], parts[2], parts[3])
|
373 |
+
except Exception as e:
|
374 |
+
print(f"β Error loading text chunks: {str(e)}")
|
375 |
+
return None, None, None
|
376 |
+
|
377 |
+
# Load metadata
|
378 |
+
try:
|
379 |
+
metadata_dict = {}
|
380 |
+
with open(local_metadata_file, "r", encoding="utf-8") as f:
|
381 |
+
for line in f:
|
382 |
+
item = json.loads(line)
|
383 |
+
metadata_dict[item["Title"]] = item
|
384 |
+
except Exception as e:
|
385 |
+
print(f"β Error loading metadata: {str(e)}")
|
386 |
+
return None, None, None
|
387 |
+
|
388 |
+
print(f"β
Data loaded successfully (cached): {len(text_chunks)} passages available")
|
389 |
+
return faiss_index, text_chunks, metadata_dict
|
390 |
+
```
|
391 |
+
|
392 |
+
## Data Access During Query Processing
|
393 |
+
|
394 |
+
### Query Embedding
|
395 |
+
|
396 |
+
User queries are embedded using the same model as the text corpus, with the actual implementation from rag_engine.py:
|
397 |
+
|
398 |
+
```python
|
399 |
+
def get_embedding(text):
|
400 |
+
"""
|
401 |
+
Generate embeddings for a text query using the cached model.
|
402 |
+
|
403 |
+
Uses an in-memory cache to avoid redundant embedding generation for repeated queries.
|
404 |
+
Properly prefixes inputs with "query:" or "passage:" as required by the E5 model.
|
405 |
+
|
406 |
+
Args:
|
407 |
+
text (str): The query text to embed
|
408 |
+
|
409 |
+
Returns:
|
410 |
+
numpy.ndarray: The embedding vector or a zero vector if embedding fails
|
411 |
+
"""
|
412 |
+
if text in query_embedding_cache:
|
413 |
+
return query_embedding_cache[text]
|
414 |
+
|
415 |
+
try:
|
416 |
+
tokenizer, model = cached_load_model()
|
417 |
+
if model is None:
|
418 |
+
print("Model is None, returning zero embedding")
|
419 |
+
return np.zeros((1, 384), dtype=np.float32)
|
420 |
+
|
421 |
+
# Format input based on text length
|
422 |
+
# For E5 models, "query:" prefix is for questions, "passage:" for documents
|
423 |
+
input_text = f"query: {text}" if len(text) < 512 else f"passage: {text}"
|
424 |
+
inputs = tokenizer(
|
425 |
+
input_text,
|
426 |
+
padding=True,
|
427 |
+
truncation=True,
|
428 |
+
return_tensors="pt",
|
429 |
+
max_length=512,
|
430 |
+
return_attention_mask=True
|
431 |
+
)
|
432 |
+
with torch.no_grad():
|
433 |
+
outputs = model(**inputs)
|
434 |
+
embeddings = average_pool(outputs.last_hidden_state, inputs['attention_mask'])
|
435 |
+
embeddings = nn.functional.normalize(embeddings, p=2, dim=1)
|
436 |
+
embeddings = embeddings.detach().cpu().numpy()
|
437 |
+
del outputs, inputs
|
438 |
+
gc.collect()
|
439 |
+
query_embedding_cache[text] = embeddings
|
440 |
+
return embeddings
|
441 |
+
except Exception as e:
|
442 |
+
print(f"β Embedding error: {str(e)}")
|
443 |
+
return np.zeros((1, 384), dtype=np.float32)
|
444 |
+
```
|
445 |
+
|
446 |
+
Note the use of:
|
447 |
+
- **Query prefix**: "query:" is added to distinguish query embeddings from passage embeddings
|
448 |
+
- **Truncation**: Queries are truncated to 512 tokens if necessary
|
449 |
+
- **Memory management**: Tensors are detached and moved to CPU after computation
|
450 |
+
- **Caching**: Query embeddings are cached to avoid redundant computation
|
451 |
+
|
452 |
+
### Passage Retrieval
|
453 |
+
|
454 |
+
The system retrieves relevant passages based on query embedding similarity using the implementation from rag_engine.py:
|
455 |
+
|
456 |
+
```python
|
457 |
+
def retrieve_passages(query, faiss_index, text_chunks, metadata_dict, top_k=5, similarity_threshold=0.5):
|
458 |
+
"""
|
459 |
+
Retrieve the most relevant passages for a given spiritual query.
|
460 |
+
|
461 |
+
This function:
|
462 |
+
1. Embeds the user query using the same model used for text chunks
|
463 |
+
2. Finds similar passages using the FAISS index with cosine similarity
|
464 |
+
3. Filters results based on similarity threshold to ensure relevance
|
465 |
+
4. Enriches results with metadata (title, author, publisher)
|
466 |
+
5. Ensures passage diversity by including only one passage per source title
|
467 |
+
|
468 |
+
Args:
|
469 |
+
query (str): The user's spiritual question
|
470 |
+
faiss_index: FAISS index containing passage embeddings
|
471 |
+
text_chunks (dict): Dictionary mapping IDs to text chunks and metadata
|
472 |
+
metadata_dict (dict): Dictionary containing publication information
|
473 |
+
top_k (int): Maximum number of passages to retrieve
|
474 |
+
similarity_threshold (float): Minimum similarity score (0.0-1.0) for retrieved passages
|
475 |
+
|
476 |
+
Returns:
|
477 |
+
tuple: (retrieved_passages, retrieved_sources) containing the text and source information
|
478 |
+
"""
|
479 |
+
try:
|
480 |
+
print(f"\nπ Retrieving passages for query: {query}")
|
481 |
+
query_embedding = get_embedding(query)
|
482 |
+
distances, indices = faiss_index.search(query_embedding, top_k * 2)
|
483 |
+
print(f"Found {len(distances[0])} potential matches")
|
484 |
+
retrieved_passages = []
|
485 |
+
retrieved_sources = []
|
486 |
+
cited_titles = set()
|
487 |
+
for dist, idx in zip(distances[0], indices[0]):
|
488 |
+
print(f"Distance: {dist:.4f}, Index: {idx}")
|
489 |
+
if idx in text_chunks and dist >= similarity_threshold:
|
490 |
+
title_with_txt, author, text = text_chunks[idx]
|
491 |
+
clean_title = title_with_txt.replace(".txt", "") if title_with_txt.endswith(".txt") else title_with_txt
|
492 |
+
clean_title = unicodedata.normalize("NFC", clean_title)
|
493 |
+
if clean_title in cited_titles:
|
494 |
+
continue
|
495 |
+
metadata_entry = metadata_dict.get(clean_title, {})
|
496 |
+
author = metadata_entry.get("Author", "Unknown")
|
497 |
+
publisher = metadata_entry.get("Publisher", "Unknown")
|
498 |
+
cited_titles.add(clean_title)
|
499 |
+
retrieved_passages.append(text)
|
500 |
+
retrieved_sources.append((clean_title, author, publisher))
|
501 |
+
if len(retrieved_passages) == top_k:
|
502 |
+
break
|
503 |
+
print(f"Retrieved {len(retrieved_passages)} passages")
|
504 |
+
return retrieved_passages, retrieved_sources
|
505 |
+
except Exception as e:
|
506 |
+
print(f"β Error in retrieve_passages: {str(e)}")
|
507 |
+
return [], []
|
508 |
+
```
|
509 |
+
|
510 |
+
Important aspects:
|
511 |
+
- **Similarity threshold**: Passages must have a similarity score >= 0.5 to be included
|
512 |
+
- **Diversity**: Only one passage per source title is included in the results
|
513 |
+
- **Metadata enrichment**: Publisher information is added from the metadata
|
514 |
+
- **Configurable retrieval**: The `top_k` parameter allows users to adjust how many sources to use
|
515 |
+
|
516 |
+
## User Data Privacy
|
517 |
+
|
518 |
+
### No Data Collection
|
519 |
+
|
520 |
+
Anveshak is designed to respect user privacy by not collecting or storing any user data:
|
521 |
+
|
522 |
+
1. **No Query Storage**: User questions are processed in memory and not saved
|
523 |
+
2. **No User Identification**: No user accounts or identification is required
|
524 |
+
3. **No Analytics**: No usage tracking or analytics are implemented
|
525 |
+
4. **No Cookies**: No browser cookies are used to track users
|
526 |
+
|
527 |
+
As stated in app.py:
|
528 |
+
|
529 |
+
> "We do not save any user data or queries. However, user questions are processed using OpenAI's LLM service to generate responses. While we do not store this information, please be aware that interactions are processed through OpenAI's platform and are subject to their privacy policies and data handling practices."
|
530 |
+
|
531 |
+
This privacy-first approach ensures that users can freely explore spiritual questions without concerns about their queries being stored or analyzed.
|
532 |
+
|
533 |
+
## Copyright and Ethical Considerations
|
534 |
+
|
535 |
+
### Word Limit Implementation
|
536 |
+
|
537 |
+
To respect copyright and ensure fair use, answers are limited to a configurable word count using the actual implementation from rag_engine.py:
|
538 |
+
|
539 |
+
```python
|
540 |
+
def answer_with_llm(query, context=None, word_limit=100):
|
541 |
+
# ... LLM processing ...
|
542 |
+
|
543 |
+
# Extract and format the answer
|
544 |
+
answer = response.choices[0].message.content.strip()
|
545 |
+
words = answer.split()
|
546 |
+
if len(words) > word_limit:
|
547 |
+
answer = " ".join(words[:word_limit])
|
548 |
+
if not answer.endswith((".", "!", "?")):
|
549 |
+
answer += "."
|
550 |
+
|
551 |
+
return answer
|
552 |
+
```
|
553 |
+
|
554 |
+
Users can adjust the word limit from 50 to 500 words, ensuring that responses are:
|
555 |
+
- Short enough to respect copyright
|
556 |
+
- Long enough to provide meaningful information
|
557 |
+
- Always properly cited to the original source
|
558 |
+
|
559 |
+
### Citation Format
|
560 |
+
|
561 |
+
Every answer includes citations to the original sources using the implementation from rag_engine.py:
|
562 |
+
|
563 |
+
```python
|
564 |
+
def format_citations(sources):
|
565 |
+
"""
|
566 |
+
Format citations for display to the user.
|
567 |
+
|
568 |
+
Creates properly formatted citations for each source used in generating the answer.
|
569 |
+
Each citation appears on a new line with consistent formatting.
|
570 |
+
|
571 |
+
Args:
|
572 |
+
sources (list): List of (title, author, publisher) tuples
|
573 |
+
|
574 |
+
Returns:
|
575 |
+
str: Formatted citations as a string with each citation on a new line
|
576 |
+
"""
|
577 |
+
formatted_citations = []
|
578 |
+
for title, author, publisher in sources:
|
579 |
+
if publisher.endswith(('.', '!', '?')):
|
580 |
+
formatted_citations.append(f"π {title} by {author}, Published by {publisher}")
|
581 |
+
else:
|
582 |
+
formatted_citations.append(f"π {title} by {author}, Published by {publisher}.")
|
583 |
+
return "\n".join(formatted_citations)
|
584 |
+
```
|
585 |
+
|
586 |
+
Citations include:
|
587 |
+
- Book/text title
|
588 |
+
- Author name
|
589 |
+
- Publisher information
|
590 |
+
|
591 |
+
### Acknowledgment of Sources
|
592 |
+
|
593 |
+
Anveshak: Spirituality Q&A includes dedicated pages for acknowledging:
|
594 |
+
- Publishers of the original texts
|
595 |
+
- Saints, Sages, and Spiritual Masters whose teachings are referenced
|
596 |
+
- The origins and traditions of the spiritual texts
|
597 |
+
|
598 |
+
A thank-you note is also prominently featured on the main page, as shown in app.py:
|
599 |
+
|
600 |
+
```python
|
601 |
+
st.markdown('<div class="acknowledgment-header">A Heartfelt Thank You</div>', unsafe_allow_html=True)
|
602 |
+
st.markdown("""
|
603 |
+
It is believed that one cannot be in a spiritual path without the will of the Lord. One need not be a believer or a non-believer, merely proceeding to thoughtlessness and observation is enough to evolve and shape perspectives. But that happens through grace. It is believed that without the will of the Lord, one cannot be blessed by real Saints, and without the will of the Saints, one cannot get close to them or God.
|
604 |
+
|
605 |
+
Therefore, with deepest reverence, we express our gratitude to:
|
606 |
+
|
607 |
+
**The Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters** of all genders, backgrounds, traditions, and walks of life whose timeless wisdom illuminates Anveshak. From ancient Sages to modern Masters, their selfless dedication to uplift humanity through selfless love and spiritual knowledge continues to guide seekers on the path.
|
608 |
+
# ...
|
609 |
+
""")
|
610 |
+
```
|
611 |
+
|
612 |
+
### Inclusive Recognition
|
613 |
+
|
614 |
+
Anveshak explicitly acknowledges and honors spiritual teachers from all backgrounds:
|
615 |
+
|
616 |
+
- All references to spiritual figures capitalize the first letter (Saints, Sages, etc.)
|
617 |
+
- The application includes language acknowledging Masters of "all genders, backgrounds, traditions, and walks of life"
|
618 |
+
- The selection of texts aims to represent diverse spiritual traditions
|
619 |
+
|
620 |
+
From the Sources.py file:
|
621 |
+
|
622 |
+
> "Additionally, there are and there have been many other great Saints, enlightened beings, Sadhus, Sages, and Gurus who have worked tirelessly to uplift humanity and guide beings to their true SELF and path, of whom little is known and documented. We thank them and acknowledge their contribution to the world."
|
623 |
+
|
624 |
+
## Data Replication and Backup
|
625 |
+
|
626 |
+
### GCS as Primary Storage
|
627 |
+
|
628 |
+
Google Cloud Storage serves as both the primary storage and backup system:
|
629 |
+
|
630 |
+
- All preprocessed data is stored in GCS buckets
|
631 |
+
- GCS provides built-in redundancy and backup capabilities
|
632 |
+
- Data is loaded from GCS at application startup
|
633 |
+
|
634 |
+
### Local Caching
|
635 |
+
|
636 |
+
For performance, Anveshak caches data locally using the implementation from rag_engine.py:
|
637 |
+
|
638 |
+
```python
|
639 |
+
def download_file_from_gcs(bucket, gcs_path, local_path):
|
640 |
+
"""
|
641 |
+
Download a file from GCS to local storage if not already present.
|
642 |
+
|
643 |
+
Only downloads if the file isn't already present locally, avoiding redundant downloads.
|
644 |
+
|
645 |
+
Args:
|
646 |
+
bucket: GCS bucket object
|
647 |
+
gcs_path (str): Path to the file in GCS
|
648 |
+
local_path (str): Local path where the file should be saved
|
649 |
+
|
650 |
+
Returns:
|
651 |
+
bool: True if download was successful or file already exists, False otherwise
|
652 |
+
"""
|
653 |
+
try:
|
654 |
+
if os.path.exists(local_path):
|
655 |
+
print(f"File already exists locally: {local_path}")
|
656 |
+
return True
|
657 |
+
|
658 |
+
blob = bucket.blob(gcs_path)
|
659 |
+
blob.download_to_filename(local_path)
|
660 |
+
print(f"β
Downloaded {gcs_path} β {local_path}")
|
661 |
+
return True
|
662 |
+
except Exception as e:
|
663 |
+
print(f"β Error downloading {gcs_path}: {str(e)}")
|
664 |
+
return False
|
665 |
+
```
|
666 |
+
|
667 |
+
This approach:
|
668 |
+
- Avoids redundant downloads
|
669 |
+
- Preserves data across application restarts
|
670 |
+
- Reduces API calls to GCS
|
671 |
+
|
672 |
+
## Conclusion
|
673 |
+
|
674 |
+
Anveshak: Spirituality Q&A implements a comprehensive data handling strategy that:
|
675 |
+
|
676 |
+
1. **Respects Copyright**: Through word limits, citations, and acknowledgments
|
677 |
+
2. **Preserves Source Integrity**: By maintaining accurate metadata and citations
|
678 |
+
3. **Optimizes Performance**: Through efficient storage, retrieval, and caching
|
679 |
+
4. **Ensures Ethical Use**: By focusing on educational purposes and proper attribution
|
680 |
+
5. **Protects Privacy**: By not collecting or storing user data
|
681 |
+
6. **Honors Diversity**: By acknowledging spiritual teachers of all backgrounds and traditions
|
682 |
+
|
683 |
+
This balance between technical efficiency and ethical responsibility allows Anveshak to serve as a bridge to spiritual knowledge while respecting the original sources, traditions, and user privacy. The system is designed not to replace personal spiritual inquiry but to supplement it by making traditional wisdom more accessible.
|
684 |
+
|
685 |
+
As stated in the conclusion of the blog post:
|
686 |
+
|
687 |
+
> "The core philosophy guiding this project is that while technology can facilitate access to spiritual knowledge, the journey to self-discovery remains deeply personal. As Anveshak states: 'The path and journey to the SELF is designed to be undertaken alone. The all-encompassing knowledge is internal and not external.'"
|
scripts/preprocessing.ipynb
ADDED
@@ -0,0 +1,644 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"nbformat": 4,
|
3 |
+
"nbformat_minor": 0,
|
4 |
+
"metadata": {
|
5 |
+
"colab": {
|
6 |
+
"provenance": [],
|
7 |
+
"gpuType": "L4"
|
8 |
+
},
|
9 |
+
"kernelspec": {
|
10 |
+
"name": "python3",
|
11 |
+
"display_name": "Python 3"
|
12 |
+
},
|
13 |
+
"language_info": {
|
14 |
+
"name": "python"
|
15 |
+
},
|
16 |
+
"accelerator": "GPU"
|
17 |
+
},
|
18 |
+
"cells": [
|
19 |
+
{
|
20 |
+
"cell_type": "code",
|
21 |
+
"source": [
|
22 |
+
"\"\"\"\n",
|
23 |
+
"Anveshak: Spirituality Q&A - Data Preprocessing Pipeline\n",
|
24 |
+
"\n",
|
25 |
+
"This script processes the spiritual text corpus for the Anveshak application:\n",
|
26 |
+
"1. Uploads and downloads text files from various sources\n",
|
27 |
+
"2. Cleans and processes the texts to remove artifacts and noise\n",
|
28 |
+
"3. Chunks texts into smaller, manageable pieces\n",
|
29 |
+
"4. Generates embeddings using the E5-large-v2 model\n",
|
30 |
+
"5. Creates a FAISS index for efficient similarity search\n",
|
31 |
+
"6. Uploads all processed data to Google Cloud Storage\n",
|
32 |
+
"\n",
|
33 |
+
"Usage:\n",
|
34 |
+
"- Run in Google Colab with GPU runtime for faster embedding generation\n",
|
35 |
+
"- Ensure GCP authentication is set up before running\n",
|
36 |
+
"- Configure the constants below with your actual settings\n",
|
37 |
+
"\"\"\""
|
38 |
+
],
|
39 |
+
"metadata": {
|
40 |
+
"id": "Cyjr-eDz9GmH"
|
41 |
+
},
|
42 |
+
"execution_count": null,
|
43 |
+
"outputs": []
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"cell_type": "code",
|
47 |
+
"source": [
|
48 |
+
"# =============================================================================\n",
|
49 |
+
"# CONFIGURATION SETTINGS\n",
|
50 |
+
"# =============================================================================\n",
|
51 |
+
"# Update these values with your actual settings\n",
|
52 |
+
"# Before open-sourcing, clear these values or replace with placeholders\n",
|
53 |
+
"BUCKET_NAME_GCS = \"your-bucket-name\" # e.g., \"spiritual-texts-bucket\"\n",
|
54 |
+
"EMBEDDING_MODEL = \"your-embedding-model\" # e.g., \"intfloat/e5-large-v2\"\n",
|
55 |
+
"# LLM_MODEL = \"your-llm-model\" # e.g., \"gpt-3.5-turbo\"\n",
|
56 |
+
"\n",
|
57 |
+
"# GCS Paths - update these with your folder structure\n",
|
58 |
+
"METADATA_PATH_GCS = \"metadata/metadata.jsonl\"\n",
|
59 |
+
"RAW_TEXTS_UPLOADED_PATH_GCS = \"raw-texts/uploaded\"\n",
|
60 |
+
"RAW_TEXTS_DOWNLOADED_PATH_GCS = \"raw-texts/downloaded/\"\n",
|
61 |
+
"CLEANED_TEXTS_PATH_GCS = \"cleaned-texts/\"\n",
|
62 |
+
"EMBEDDINGS_PATH_GCS = \"processed/embeddings/all_embeddings.npy\"\n",
|
63 |
+
"INDICES_PATH_GCS = \"processed/indices/faiss_index.faiss\"\n",
|
64 |
+
"CHUNKS_PATH_GCS = \"processed/chunks/text_chunks.txt\"\n",
|
65 |
+
"\n",
|
66 |
+
"# Local file paths in Colab environment - update these with your folder structure\n",
|
67 |
+
"LOCAL_METADATA_FILE = \"/content/metadata.jsonl\"\n",
|
68 |
+
"LOCAL_RAW_TEXTS_FOLDER = \"/content/raw-texts/uploaded\"\n",
|
69 |
+
"LOCAL_EMBEDDINGS_FILE = \"/tmp/all_embeddings.npy\"\n",
|
70 |
+
"LOCAL_FAISS_INDEX_FILE = \"/tmp/faiss_index.faiss\"\n",
|
71 |
+
"LOCAL_TEXT_CHUNKS_FILE = \"/tmp/text_chunks.txt\""
|
72 |
+
],
|
73 |
+
"metadata": {
|
74 |
+
"id": "YEDyIvmoXsPB"
|
75 |
+
},
|
76 |
+
"execution_count": null,
|
77 |
+
"outputs": []
|
78 |
+
},
|
79 |
+
{
|
80 |
+
"cell_type": "code",
|
81 |
+
"execution_count": null,
|
82 |
+
"metadata": {
|
83 |
+
"id": "H1tEbKhur8xf"
|
84 |
+
},
|
85 |
+
"outputs": [],
|
86 |
+
"source": [
|
87 |
+
"# Install required packages\n",
|
88 |
+
"!pip install faiss-cpu"
|
89 |
+
]
|
90 |
+
},
|
91 |
+
{
|
92 |
+
"cell_type": "code",
|
93 |
+
"source": [
|
94 |
+
"# Import necessary libraries\n",
|
95 |
+
"from google.colab import files\n",
|
96 |
+
"from google.colab import auth\n",
|
97 |
+
"from google.cloud import storage\n",
|
98 |
+
"import os\n",
|
99 |
+
"import json\n",
|
100 |
+
"import requests\n",
|
101 |
+
"import re\n",
|
102 |
+
"import unicodedata\n",
|
103 |
+
"from bs4 import BeautifulSoup\n",
|
104 |
+
"import numpy as np\n",
|
105 |
+
"import faiss\n",
|
106 |
+
"import torch\n",
|
107 |
+
"from sentence_transformers import SentenceTransformer"
|
108 |
+
],
|
109 |
+
"metadata": {
|
110 |
+
"id": "xCDTvZJRse4-"
|
111 |
+
},
|
112 |
+
"execution_count": null,
|
113 |
+
"outputs": []
|
114 |
+
},
|
115 |
+
{
|
116 |
+
"cell_type": "code",
|
117 |
+
"source": [
|
118 |
+
"# =============================================================================\n",
|
119 |
+
"# AUTHENTICATION & INITIALIZATION\n",
|
120 |
+
"# =============================================================================\n",
|
121 |
+
"\n",
|
122 |
+
"# Authenticate with Google Cloud (only needed in Colab)\n",
|
123 |
+
"auth.authenticate_user()\n",
|
124 |
+
"\n",
|
125 |
+
"# Initialize GCS client (single initialization)\n",
|
126 |
+
"storage_client = storage.Client()\n",
|
127 |
+
"bucket = storage_client.bucket(BUCKET_NAME_GCS)"
|
128 |
+
],
|
129 |
+
"metadata": {
|
130 |
+
"id": "hSYQ0ZSasjLd"
|
131 |
+
},
|
132 |
+
"execution_count": null,
|
133 |
+
"outputs": []
|
134 |
+
},
|
135 |
+
{
|
136 |
+
"cell_type": "code",
|
137 |
+
"source": [
|
138 |
+
"# =============================================================================\n",
|
139 |
+
"# PART 1: UPLOAD RAW TEXTS AND METADATA\n",
|
140 |
+
"# =============================================================================\n",
|
141 |
+
"\n",
|
142 |
+
"def upload_files_to_colab():\n",
|
143 |
+
" \"\"\"\n",
|
144 |
+
" Upload raw text files and metadata from local machine to Colab.\n",
|
145 |
+
"\n",
|
146 |
+
" This function:\n",
|
147 |
+
" 1. Prompts the user to upload text files\n",
|
148 |
+
" 2. Saves the uploaded files to a local directory\n",
|
149 |
+
" 3. Prompts the user to upload the metadata.jsonl file\n",
|
150 |
+
" 4. Saves the metadata file to the specified location\n",
|
151 |
+
"\n",
|
152 |
+
" Returns:\n",
|
153 |
+
" bool: True if upload was successful, False otherwise\n",
|
154 |
+
" \"\"\"\n",
|
155 |
+
" # First, upload text files\n",
|
156 |
+
" print(\"Step 1: Please upload your text files...\")\n",
|
157 |
+
" uploaded_text_files = files.upload() # This will prompt the user to upload files\n",
|
158 |
+
"\n",
|
159 |
+
" # Create directory structure if it doesn't exist\n",
|
160 |
+
" os.makedirs(LOCAL_RAW_TEXTS_FOLDER, exist_ok=True)\n",
|
161 |
+
"\n",
|
162 |
+
" # Move uploaded text files to the raw-texts folder\n",
|
163 |
+
" for filename, content in uploaded_text_files.items():\n",
|
164 |
+
" if filename.endswith(\".txt\"):\n",
|
165 |
+
" with open(os.path.join(LOCAL_RAW_TEXTS_FOLDER, filename), \"wb\") as f:\n",
|
166 |
+
" f.write(content)\n",
|
167 |
+
" print(f\"β
Saved {filename} to {LOCAL_RAW_TEXTS_FOLDER}\")\n",
|
168 |
+
"\n",
|
169 |
+
" print(\"Text files upload complete!\")\n",
|
170 |
+
"\n",
|
171 |
+
" # Next, upload metadata file\n",
|
172 |
+
" print(\"\\nStep 2: Please upload your metadata.jsonl file...\")\n",
|
173 |
+
" uploaded_metadata = files.upload() # This will prompt the user to upload files\n",
|
174 |
+
"\n",
|
175 |
+
" # Save metadata file\n",
|
176 |
+
" metadata_uploaded = False\n",
|
177 |
+
" for filename, content in uploaded_metadata.items():\n",
|
178 |
+
" if filename == \"metadata.jsonl\":\n",
|
179 |
+
" # Ensure the directory for metadata file exists\n",
|
180 |
+
" os.makedirs(os.path.dirname(LOCAL_METADATA_FILE), exist_ok=True)\n",
|
181 |
+
" with open(LOCAL_METADATA_FILE, \"wb\") as f:\n",
|
182 |
+
" f.write(content)\n",
|
183 |
+
" print(f\"β
Saved metadata.jsonl to {LOCAL_METADATA_FILE}\")\n",
|
184 |
+
" metadata_uploaded = True\n",
|
185 |
+
"\n",
|
186 |
+
" if not metadata_uploaded:\n",
|
187 |
+
" print(\"β οΈ Warning: metadata.jsonl was not uploaded. Please upload it to continue.\")\n",
|
188 |
+
" return False\n",
|
189 |
+
"\n",
|
190 |
+
" print(\"Upload to Colab complete!\")\n",
|
191 |
+
" return True\n",
|
192 |
+
"\n",
|
193 |
+
"def upload_files_to_gcs():\n",
|
194 |
+
" \"\"\"\n",
|
195 |
+
" Upload raw text files and metadata from Colab to Google Cloud Storage.\n",
|
196 |
+
"\n",
|
197 |
+
" This function:\n",
|
198 |
+
" 1. Uploads each text file from the local directory to GCS\n",
|
199 |
+
" 2. Uploads the metadata.jsonl file to GCS\n",
|
200 |
+
"\n",
|
201 |
+
" All files are uploaded to the paths specified in the configuration constants.\n",
|
202 |
+
" \"\"\"\n",
|
203 |
+
" # Upload each file from the local raw-texts folder to GCS\n",
|
204 |
+
" for filename in os.listdir(LOCAL_RAW_TEXTS_FOLDER):\n",
|
205 |
+
" local_path = os.path.join(LOCAL_RAW_TEXTS_FOLDER, filename)\n",
|
206 |
+
" blob_path = f\"{RAW_TEXTS_UPLOADED_PATH_GCS}/{filename}\" # GCS path\n",
|
207 |
+
" blob = bucket.blob(blob_path)\n",
|
208 |
+
" try:\n",
|
209 |
+
" blob.upload_from_filename(local_path)\n",
|
210 |
+
" print(f\"β
Uploaded: {filename} -> gs://{BUCKET_NAME_GCS}/{blob_path}\")\n",
|
211 |
+
" except Exception as e:\n",
|
212 |
+
" print(f\"β Failed to upload {filename}: {e}\")\n",
|
213 |
+
"\n",
|
214 |
+
" # Upload metadata file\n",
|
215 |
+
" blob = bucket.blob(METADATA_PATH_GCS)\n",
|
216 |
+
" try:\n",
|
217 |
+
" blob.upload_from_filename(LOCAL_METADATA_FILE)\n",
|
218 |
+
" print(f\"β
Uploaded metadata.jsonl -> gs://{BUCKET_NAME_GCS}/{METADATA_PATH_GCS}\")\n",
|
219 |
+
" except Exception as e:\n",
|
220 |
+
" print(f\"β Failed to upload metadata: {e}\")"
|
221 |
+
],
|
222 |
+
"metadata": {
|
223 |
+
"id": "cShc029islmO"
|
224 |
+
},
|
225 |
+
"execution_count": null,
|
226 |
+
"outputs": []
|
227 |
+
},
|
228 |
+
{
|
229 |
+
"cell_type": "code",
|
230 |
+
"source": [
|
231 |
+
"# =============================================================================\n",
|
232 |
+
"# PART 2: DOWNLOAD AND CLEAN TEXTS\n",
|
233 |
+
"# =============================================================================\n",
|
234 |
+
"\n",
|
235 |
+
"def fetch_metadata_from_gcs():\n",
|
236 |
+
" \"\"\"\n",
|
237 |
+
" Fetch metadata.jsonl from GCS and return as a list of dictionaries.\n",
|
238 |
+
"\n",
|
239 |
+
" Each dictionary represents a text entry with metadata like title, author, etc.\n",
|
240 |
+
"\n",
|
241 |
+
" Returns:\n",
|
242 |
+
" list: List of dictionaries containing metadata for each text\n",
|
243 |
+
" \"\"\"\n",
|
244 |
+
" blob = bucket.blob(METADATA_PATH_GCS)\n",
|
245 |
+
" # Download metadata file\n",
|
246 |
+
" metadata_jsonl = blob.download_as_text()\n",
|
247 |
+
" # Parse JSONL\n",
|
248 |
+
" metadata = [json.loads(line) for line in metadata_jsonl.splitlines()]\n",
|
249 |
+
" return metadata\n",
|
250 |
+
"\n",
|
251 |
+
"def upload_to_gcs(source_file, destination_path):\n",
|
252 |
+
" \"\"\"\n",
|
253 |
+
" Upload a local file to Google Cloud Storage.\n",
|
254 |
+
"\n",
|
255 |
+
" Args:\n",
|
256 |
+
" source_file (str): Path to the local file\n",
|
257 |
+
" destination_path (str): Path in GCS where the file should be uploaded\n",
|
258 |
+
" \"\"\"\n",
|
259 |
+
" blob = bucket.blob(destination_path)\n",
|
260 |
+
" blob.upload_from_filename(source_file)\n",
|
261 |
+
" print(f\"π€ Uploaded to GCS: {destination_path}\")\n",
|
262 |
+
"\n",
|
263 |
+
"def download_text_files():\n",
|
264 |
+
" \"\"\"\n",
|
265 |
+
" Download text files from URLs specified in the metadata.\n",
|
266 |
+
"\n",
|
267 |
+
" This function:\n",
|
268 |
+
" 1. Fetches metadata from GCS\n",
|
269 |
+
" 2. Filters entries where Uploaded=False (texts to be downloaded)\n",
|
270 |
+
" 3. Downloads each text from its URL\n",
|
271 |
+
" 4. Uploads the downloaded text to GCS\n",
|
272 |
+
"\n",
|
273 |
+
" This allows automated collection of texts that weren't manually uploaded.\n",
|
274 |
+
" \"\"\"\n",
|
275 |
+
" metadata = fetch_metadata_from_gcs()\n",
|
276 |
+
" # Filter entries where Uploaded is False\n",
|
277 |
+
" files_to_download = [item for item in metadata if item[\"Uploaded\"] == False]\n",
|
278 |
+
" print(f\"π Found {len(files_to_download)} files to download\")\n",
|
279 |
+
"\n",
|
280 |
+
" # Process only necessary files\n",
|
281 |
+
" for item in files_to_download:\n",
|
282 |
+
" name, author, url = item[\"Title\"], item[\"Author\"], item[\"URL\"]\n",
|
283 |
+
" if url.lower() == \"not available\":\n",
|
284 |
+
" print(f\"β Skipping {name} - No URL available.\")\n",
|
285 |
+
" continue\n",
|
286 |
+
"\n",
|
287 |
+
" try:\n",
|
288 |
+
" response = requests.get(url)\n",
|
289 |
+
" if response.status_code == 200:\n",
|
290 |
+
" raw_text = response.text\n",
|
291 |
+
" filename = \"{}.txt\".format(name.replace(\" \", \"_\"))\n",
|
292 |
+
" # Save to local first\n",
|
293 |
+
" local_path = f\"/tmp/{filename}\"\n",
|
294 |
+
" with open(local_path, \"w\", encoding=\"utf-8\") as file:\n",
|
295 |
+
" file.write(raw_text)\n",
|
296 |
+
" # Upload to GCS\n",
|
297 |
+
" gcs_path = f\"{RAW_TEXTS_DOWNLOADED_PATH_GCS}{filename}\"\n",
|
298 |
+
" upload_to_gcs(local_path, gcs_path)\n",
|
299 |
+
" print(f\"β
Downloaded & uploaded: {filename} ({len(raw_text.split())} words)\")\n",
|
300 |
+
" # Clean up temp file\n",
|
301 |
+
" os.remove(local_path)\n",
|
302 |
+
" else:\n",
|
303 |
+
" print(f\"β Failed to download {name}: {url} (Status {response.status_code})\")\n",
|
304 |
+
" except Exception as e:\n",
|
305 |
+
" print(f\"β Error processing {name}: {e}\")\n",
|
306 |
+
"\n",
|
307 |
+
"def rigorous_clean_text(text):\n",
|
308 |
+
" \"\"\"\n",
|
309 |
+
" Clean text by removing metadata, junk text, and formatting issues.\n",
|
310 |
+
"\n",
|
311 |
+
" This function:\n",
|
312 |
+
" 1. Removes HTML tags using BeautifulSoup\n",
|
313 |
+
" 2. Removes URLs and standalone numbers\n",
|
314 |
+
" 3. Removes all-caps OCR noise words\n",
|
315 |
+
" 4. Deduplicates adjacent identical lines\n",
|
316 |
+
" 5. Normalizes Unicode characters\n",
|
317 |
+
" 6. Standardizes whitespace and newlines\n",
|
318 |
+
"\n",
|
319 |
+
" Args:\n",
|
320 |
+
" text (str): The raw text to clean\n",
|
321 |
+
"\n",
|
322 |
+
" Returns:\n",
|
323 |
+
" str: The cleaned text\n",
|
324 |
+
" \"\"\"\n",
|
325 |
+
" text = BeautifulSoup(text, \"html.parser\").get_text()\n",
|
326 |
+
" text = re.sub(r\"https?:\\/\\/\\S+\", \"\", text) # Remove links\n",
|
327 |
+
" text = re.sub(r\"\\b\\d+\\b\", \"\", text) # Remove standalone numbers\n",
|
328 |
+
" text = re.sub(r\"\\b[A-Z]{5,}\\b\", \"\", text) # Remove all-caps OCR noise words\n",
|
329 |
+
" lines = text.split(\"\\n\")\n",
|
330 |
+
" cleaned_lines = []\n",
|
331 |
+
" last_line = None\n",
|
332 |
+
"\n",
|
333 |
+
" for line in lines:\n",
|
334 |
+
" line = line.strip()\n",
|
335 |
+
" if line and line != last_line:\n",
|
336 |
+
" cleaned_lines.append(line)\n",
|
337 |
+
" last_line = line\n",
|
338 |
+
"\n",
|
339 |
+
" text = \"\\n\".join(cleaned_lines)\n",
|
340 |
+
" text = unicodedata.normalize(\"NFKD\", text)\n",
|
341 |
+
" text = re.sub(r\"\\s+\", \" \", text).strip()\n",
|
342 |
+
" text = re.sub(r\"\\n{2,}\", \"\\n\", text)\n",
|
343 |
+
" return text\n",
|
344 |
+
"\n",
|
345 |
+
"def clean_and_upload_texts():\n",
|
346 |
+
" \"\"\"\n",
|
347 |
+
" Download raw texts from GCS, clean them, and upload cleaned versions back to GCS.\n",
|
348 |
+
"\n",
|
349 |
+
" This function processes all texts in both the uploaded and downloaded folders:\n",
|
350 |
+
" 1. For each text file, downloads it from GCS\n",
|
351 |
+
" 2. Cleans the text using rigorous_clean_text()\n",
|
352 |
+
" 3. Uploads the cleaned version back to GCS in the cleaned-texts folder\n",
|
353 |
+
"\n",
|
354 |
+
" This step ensures that all texts are properly formatted before embedding generation.\n",
|
355 |
+
" \"\"\"\n",
|
356 |
+
" raw_texts_folders = [RAW_TEXTS_DOWNLOADED_PATH_GCS, RAW_TEXTS_UPLOADED_PATH_GCS] # Process both folders\n",
|
357 |
+
" total_files = 0 # Counter to track number of processed files\n",
|
358 |
+
"\n",
|
359 |
+
" for raw_texts_folder in raw_texts_folders:\n",
|
360 |
+
" # List all files in the current raw-texts folder\n",
|
361 |
+
" blobs = list(bucket.list_blobs(prefix=raw_texts_folder))\n",
|
362 |
+
" print(f\"π Found {len(blobs)} files in {raw_texts_folder}\")\n",
|
363 |
+
"\n",
|
364 |
+
" for blob in blobs:\n",
|
365 |
+
" if not blob.name.endswith(\".txt\"): # Skip non-text files\n",
|
366 |
+
" continue\n",
|
367 |
+
"\n",
|
368 |
+
" try:\n",
|
369 |
+
" # Download file\n",
|
370 |
+
" raw_text = blob.download_as_text().strip()\n",
|
371 |
+
" if not raw_text: # Skip empty files\n",
|
372 |
+
" print(f\"β οΈ Skipping empty file: {blob.name}\")\n",
|
373 |
+
" continue\n",
|
374 |
+
"\n",
|
375 |
+
" # Clean text\n",
|
376 |
+
" cleaned_text = rigorous_clean_text(raw_text)\n",
|
377 |
+
"\n",
|
378 |
+
" # Save cleaned text back to GCS\n",
|
379 |
+
" cleaned_blob_name = blob.name.replace(raw_texts_folder, CLEANED_TEXTS_PATH_GCS)\n",
|
380 |
+
" cleaned_blob = bucket.blob(cleaned_blob_name)\n",
|
381 |
+
" cleaned_blob.upload_from_string(cleaned_text, content_type=\"text/plain\")\n",
|
382 |
+
" print(f\"β
Cleaned & uploaded: {cleaned_blob_name} ({len(cleaned_text.split())} words, {len(cleaned_text)} characters)\")\n",
|
383 |
+
" total_files += 1\n",
|
384 |
+
" except Exception as e:\n",
|
385 |
+
" print(f\"β Error processing {blob.name}: {e}\")\n",
|
386 |
+
"\n",
|
387 |
+
" print(f\"π Cleaning process completed! Total cleaned & uploaded files: {total_files}\")"
|
388 |
+
],
|
389 |
+
"metadata": {
|
390 |
+
"id": "Vskwg984s25K"
|
391 |
+
},
|
392 |
+
"execution_count": null,
|
393 |
+
"outputs": []
|
394 |
+
},
|
395 |
+
{
|
396 |
+
"cell_type": "code",
|
397 |
+
"source": [
|
398 |
+
"# =============================================================================\n",
|
399 |
+
"# PART 3: GENERATE EMBEDDINGS AND INDEX\n",
|
400 |
+
"# =============================================================================\n",
|
401 |
+
"\n",
|
402 |
+
"def fetch_metadata_dict_from_gcs():\n",
|
403 |
+
" \"\"\"\n",
|
404 |
+
" Fetch metadata.jsonl from GCS and return as a dictionary.\n",
|
405 |
+
"\n",
|
406 |
+
" The dictionary is keyed by title for easy lookup during text processing.\n",
|
407 |
+
"\n",
|
408 |
+
" Returns:\n",
|
409 |
+
" dict: Dictionary mapping text titles to their metadata\n",
|
410 |
+
" \"\"\"\n",
|
411 |
+
" metadata_blob = bucket.blob(METADATA_PATH_GCS)\n",
|
412 |
+
" metadata_dict = {}\n",
|
413 |
+
"\n",
|
414 |
+
" if metadata_blob.exists():\n",
|
415 |
+
" metadata_content = metadata_blob.download_as_text()\n",
|
416 |
+
" for line in metadata_content.splitlines():\n",
|
417 |
+
" item = json.loads(line)\n",
|
418 |
+
" metadata_dict[item[\"Title\"]] = item # Keep space-based lookup\n",
|
419 |
+
" else:\n",
|
420 |
+
" print(\"β Metadata file not found in GCS\")\n",
|
421 |
+
"\n",
|
422 |
+
" return metadata_dict\n",
|
423 |
+
"\n",
|
424 |
+
"def chunk_text(text, chunk_size=500, overlap=50):\n",
|
425 |
+
" \"\"\"\n",
|
426 |
+
" Split text into smaller, overlapping chunks for better retrieval.\n",
|
427 |
+
"\n",
|
428 |
+
" Args:\n",
|
429 |
+
" text (str): The text to chunk\n",
|
430 |
+
" chunk_size (int): Maximum number of words per chunk\n",
|
431 |
+
" overlap (int): Number of words to overlap between chunks\n",
|
432 |
+
"\n",
|
433 |
+
" Returns:\n",
|
434 |
+
" list: List of text chunks\n",
|
435 |
+
" \"\"\"\n",
|
436 |
+
" words = text.split()\n",
|
437 |
+
" chunks = []\n",
|
438 |
+
" i = 0\n",
|
439 |
+
"\n",
|
440 |
+
" while i < len(words):\n",
|
441 |
+
" chunk = \" \".join(words[i:i + chunk_size])\n",
|
442 |
+
" chunks.append(chunk)\n",
|
443 |
+
" i += chunk_size - overlap\n",
|
444 |
+
"\n",
|
445 |
+
" return chunks\n",
|
446 |
+
"\n",
|
447 |
+
"def create_embeddings(text_chunks, batch_size=32):\n",
|
448 |
+
" \"\"\"\n",
|
449 |
+
" Generate embeddings for the given chunks of text using the specified embedding model.\n",
|
450 |
+
"\n",
|
451 |
+
" This function:\n",
|
452 |
+
" 1. Uses SentenceTransformer to load the embedding model\n",
|
453 |
+
" 2. Prefixes each chunk with \"passage:\" as required by the E5 model\n",
|
454 |
+
" 3. Processes chunks in batches to manage memory usage\n",
|
455 |
+
" 4. Normalizes embeddings for cosine similarity search\n",
|
456 |
+
"\n",
|
457 |
+
" Args:\n",
|
458 |
+
" text_chunks (list): List of text chunks to embed\n",
|
459 |
+
" batch_size (int): Number of chunks to process at once\n",
|
460 |
+
"\n",
|
461 |
+
" Returns:\n",
|
462 |
+
" numpy.ndarray: Matrix of embeddings, one per text chunk\n",
|
463 |
+
" \"\"\"\n",
|
464 |
+
" # Load the model with GPU optimization\n",
|
465 |
+
" model = SentenceTransformer(EMBEDDING_MODEL)\n",
|
466 |
+
" device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
467 |
+
" model = model.to(device)\n",
|
468 |
+
" print(f\"π Using device for embeddings: {device}\")\n",
|
469 |
+
"\n",
|
470 |
+
" prefixed_chunks = [f\"passage: {text}\" for text in text_chunks]\n",
|
471 |
+
" all_embeddings = []\n",
|
472 |
+
"\n",
|
473 |
+
" for i in range(0, len(prefixed_chunks), batch_size):\n",
|
474 |
+
" batch = prefixed_chunks[i:i+batch_size]\n",
|
475 |
+
"\n",
|
476 |
+
" # Move batch to GPU (if available) for faster processing\n",
|
477 |
+
" with torch.no_grad():\n",
|
478 |
+
" batch_embeddings = model.encode(batch, convert_to_numpy=True, normalize_embeddings=True)\n",
|
479 |
+
"\n",
|
480 |
+
" all_embeddings.append(batch_embeddings)\n",
|
481 |
+
"\n",
|
482 |
+
" if (i + batch_size) % 100 == 0 or (i + batch_size) >= len(prefixed_chunks):\n",
|
483 |
+
" print(f\"π Processed {i + min(batch_size, len(prefixed_chunks) - i)}/{len(prefixed_chunks)} documents\")\n",
|
484 |
+
"\n",
|
485 |
+
" return np.vstack(all_embeddings).astype(\"float32\")\n",
|
486 |
+
"\n",
|
487 |
+
"def process_cleaned_texts():\n",
|
488 |
+
" \"\"\"\n",
|
489 |
+
" Process cleaned texts to create embeddings, FAISS index, and text chunks with metadata.\n",
|
490 |
+
"\n",
|
491 |
+
" This function:\n",
|
492 |
+
" 1. Downloads all cleaned texts from GCS\n",
|
493 |
+
" 2. Chunks each text into smaller pieces\n",
|
494 |
+
" 3. Generates embeddings for each chunk\n",
|
495 |
+
" 4. Creates a FAISS index for similarity search\n",
|
496 |
+
" 5. Saves and uploads all processed data back to GCS\n",
|
497 |
+
"\n",
|
498 |
+
" This is the core processing step that prepares data for the RAG system.\n",
|
499 |
+
" \"\"\"\n",
|
500 |
+
" all_chunks = []\n",
|
501 |
+
" all_metadata = []\n",
|
502 |
+
" chunk_counter = 0\n",
|
503 |
+
"\n",
|
504 |
+
" metadata_dict = fetch_metadata_dict_from_gcs() # Load metadata\n",
|
505 |
+
"\n",
|
506 |
+
" # Optimized listing of blobs in cleaned-texts folder\n",
|
507 |
+
" blobs = list(storage_client.list_blobs(BUCKET_NAME_GCS, prefix=CLEANED_TEXTS_PATH_GCS))\n",
|
508 |
+
" print(f\"π Found {len(blobs)} files in {CLEANED_TEXTS_PATH_GCS}\")\n",
|
509 |
+
"\n",
|
510 |
+
" if not blobs:\n",
|
511 |
+
" print(f\"β No files found in {CLEANED_TEXTS_PATH_GCS}. Exiting.\")\n",
|
512 |
+
" return\n",
|
513 |
+
"\n",
|
514 |
+
" for blob in blobs:\n",
|
515 |
+
" file_name = blob.name.split(\"/\")[-1]\n",
|
516 |
+
" if not file_name or file_name.startswith(\".\"):\n",
|
517 |
+
" continue # Skip empty or hidden files\n",
|
518 |
+
"\n",
|
519 |
+
" # Convert filename back to space-based title for metadata lookup\n",
|
520 |
+
" book_name = file_name.replace(\"_\", \" \")\n",
|
521 |
+
" metadata = metadata_dict.get(book_name, {\"Author\": \"Unknown\", \"Publisher\": \"Unknown\"})\n",
|
522 |
+
" author = metadata.get(\"Author\", \"Unknown\")\n",
|
523 |
+
"\n",
|
524 |
+
" try:\n",
|
525 |
+
" # Download and read text\n",
|
526 |
+
" raw_text = blob.download_as_text().strip()\n",
|
527 |
+
"\n",
|
528 |
+
" # Skip empty or corrupt files\n",
|
529 |
+
" if not raw_text:\n",
|
530 |
+
" print(f\"β Skipping empty file: {file_name}\")\n",
|
531 |
+
" continue\n",
|
532 |
+
"\n",
|
533 |
+
" chunks = chunk_text(raw_text)\n",
|
534 |
+
" print(f\"β
Processed {book_name}: {len(chunks)} chunks\")\n",
|
535 |
+
"\n",
|
536 |
+
" for chunk in chunks:\n",
|
537 |
+
" all_chunks.append(chunk)\n",
|
538 |
+
" all_metadata.append((chunk_counter, book_name, author))\n",
|
539 |
+
" chunk_counter += 1\n",
|
540 |
+
" except Exception as e:\n",
|
541 |
+
" print(f\"β Error processing {file_name}: {e}\")\n",
|
542 |
+
"\n",
|
543 |
+
" # Ensure there are chunks before embedding generation\n",
|
544 |
+
" if not all_chunks:\n",
|
545 |
+
" print(\"β No chunks found. Skipping embedding generation.\")\n",
|
546 |
+
" return\n",
|
547 |
+
"\n",
|
548 |
+
" # Create embeddings with GPU acceleration\n",
|
549 |
+
" print(f\"π Creating embeddings for {len(all_chunks)} total chunks...\")\n",
|
550 |
+
" all_embeddings = create_embeddings(all_chunks)\n",
|
551 |
+
"\n",
|
552 |
+
" # Build FAISS index\n",
|
553 |
+
" dimension = all_embeddings.shape[1]\n",
|
554 |
+
" index = faiss.IndexFlatIP(dimension)\n",
|
555 |
+
" index.add(all_embeddings)\n",
|
556 |
+
" print(f\"β
FAISS index built with {index.ntotal} vectors\")\n",
|
557 |
+
"\n",
|
558 |
+
" # Save & upload embeddings\n",
|
559 |
+
" np.save(LOCAL_EMBEDDINGS_FILE, all_embeddings) # Save locally first\n",
|
560 |
+
" embeddings_blob = bucket.blob(EMBEDDINGS_PATH_GCS)\n",
|
561 |
+
" embeddings_blob.upload_from_filename(LOCAL_EMBEDDINGS_FILE)\n",
|
562 |
+
" print(f\"β
Uploaded embeddings to GCS: {EMBEDDINGS_PATH_GCS}\")\n",
|
563 |
+
"\n",
|
564 |
+
" # Save & upload FAISS index\n",
|
565 |
+
" faiss.write_index(index, LOCAL_FAISS_INDEX_FILE)\n",
|
566 |
+
" index_blob = bucket.blob(INDICES_PATH_GCS)\n",
|
567 |
+
" index_blob.upload_from_filename(LOCAL_FAISS_INDEX_FILE)\n",
|
568 |
+
" print(f\"β
Uploaded FAISS index to GCS: {INDICES_PATH_GCS}\")\n",
|
569 |
+
"\n",
|
570 |
+
" # Save and upload text chunks with metadata\n",
|
571 |
+
" with open(LOCAL_TEXT_CHUNKS_FILE, \"w\", encoding=\"utf-8\") as f:\n",
|
572 |
+
" for i, (chunk_id, book_name, author) in enumerate(all_metadata):\n",
|
573 |
+
" f.write(f\"{i}\\t{book_name}\\t{author}\\t{all_chunks[i]}\\n\")\n",
|
574 |
+
"\n",
|
575 |
+
" chunks_blob = bucket.blob(CHUNKS_PATH_GCS)\n",
|
576 |
+
" chunks_blob.upload_from_filename(LOCAL_TEXT_CHUNKS_FILE)\n",
|
577 |
+
" print(f\"β
Uploaded text chunks to GCS: {CHUNKS_PATH_GCS}\")\n",
|
578 |
+
"\n",
|
579 |
+
" # Clean up temp files\n",
|
580 |
+
" os.remove(LOCAL_EMBEDDINGS_FILE)\n",
|
581 |
+
" os.remove(LOCAL_FAISS_INDEX_FILE)\n",
|
582 |
+
" os.remove(LOCAL_TEXT_CHUNKS_FILE)"
|
583 |
+
],
|
584 |
+
"metadata": {
|
585 |
+
"id": "1Yul8p9JsN1e"
|
586 |
+
},
|
587 |
+
"execution_count": null,
|
588 |
+
"outputs": []
|
589 |
+
},
|
590 |
+
{
|
591 |
+
"cell_type": "code",
|
592 |
+
"source": [
|
593 |
+
"# =============================================================================\n",
|
594 |
+
"# PART 4: MAIN EXECUTION\n",
|
595 |
+
"# =============================================================================\n",
|
596 |
+
"\n",
|
597 |
+
"def run_pipeline():\n",
|
598 |
+
" \"\"\"\n",
|
599 |
+
" Run the complete end-to-end preprocessing pipeline.\n",
|
600 |
+
"\n",
|
601 |
+
" This function executes all steps in sequence:\n",
|
602 |
+
" 1. Upload files from local to Colab\n",
|
603 |
+
" 2. Upload raw texts and metadata to GCS\n",
|
604 |
+
" 3. Download texts from URLs specified in metadata\n",
|
605 |
+
" 4. Clean and process all texts\n",
|
606 |
+
" 5. Generate embeddings and build the FAISS index\n",
|
607 |
+
"\n",
|
608 |
+
" This is the main entry point for the preprocessing script.\n",
|
609 |
+
" \"\"\"\n",
|
610 |
+
" print(\"π Starting pipeline execution...\")\n",
|
611 |
+
"\n",
|
612 |
+
" print(\"\\n==== STEP 1: Uploading files from local to Colab ====\")\n",
|
613 |
+
" upload_successful = upload_files_to_colab()\n",
|
614 |
+
"\n",
|
615 |
+
" if not upload_successful:\n",
|
616 |
+
" print(\"β Pipeline halted due to missing metadata file.\")\n",
|
617 |
+
" return\n",
|
618 |
+
"\n",
|
619 |
+
" print(\"\\n==== STEP 2: Uploading raw texts and metadata to GCS ====\")\n",
|
620 |
+
" upload_files_to_gcs()\n",
|
621 |
+
"\n",
|
622 |
+
" print(\"\\n==== STEP 3: Downloading texts from URLs ====\")\n",
|
623 |
+
" download_text_files()\n",
|
624 |
+
"\n",
|
625 |
+
" print(\"\\n==== STEP 4: Cleaning and processing texts ====\")\n",
|
626 |
+
" clean_and_upload_texts()\n",
|
627 |
+
"\n",
|
628 |
+
" print(\"\\n==== STEP 5: Generating embeddings and building index ====\")\n",
|
629 |
+
" process_cleaned_texts()\n",
|
630 |
+
"\n",
|
631 |
+
" print(\"\\nβ
Pipeline execution completed successfully!\")\n",
|
632 |
+
"\n",
|
633 |
+
"# Execute the complete pipeline\n",
|
634 |
+
"if __name__ == \"__main__\":\n",
|
635 |
+
" run_pipeline()"
|
636 |
+
],
|
637 |
+
"metadata": {
|
638 |
+
"id": "XXB_eYvj-I0i"
|
639 |
+
},
|
640 |
+
"execution_count": null,
|
641 |
+
"outputs": []
|
642 |
+
}
|
643 |
+
]
|
644 |
+
}
|