ankanghosh commited on
Commit
e5a3f40
Β·
verified Β·
1 Parent(s): bd05b31

Upload 10 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ docs/assets/app_screenshot.png filter=lfs diff=lfs merge=lfs -text
.gitignore CHANGED
@@ -1,24 +1,49 @@
1
- # Ignore Google Cloud credentials and API keys
2
  .streamlit/secrets.toml
 
3
  temp_credentials.json
 
4
 
5
- # Ignore downloaded files (FAISS, metadata, embeddings)
6
  metadata.jsonl
7
  faiss_index.faiss
8
  text_chunks.txt
9
  all_embeddings.npy
 
 
 
 
10
 
11
- # Python cache files
12
  __pycache__/
13
  *.pyc
14
  *.pyo
15
  *.pyd
16
  *.ipynb_checkpoints/
17
 
18
- # Virtual environment (if using one locally)
19
  venv/
20
- .env
 
 
21
 
22
- # Logs & temp files
23
  logs/
24
- *.log
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Secrets and Credentials
2
  .streamlit/secrets.toml
3
+ *.env
4
  temp_credentials.json
5
+ secrets.json
6
 
7
+ # Data and Model Files
8
  metadata.jsonl
9
  faiss_index.faiss
10
  text_chunks.txt
11
  all_embeddings.npy
12
+ *.npy
13
+ *.pt
14
+ *.pth
15
+ *.bin
16
 
17
+ # Python-specific
18
  __pycache__/
19
  *.pyc
20
  *.pyo
21
  *.pyd
22
  *.ipynb_checkpoints/
23
 
24
+ # Virtual Environments
25
  venv/
26
+ .venv/
27
+ env/
28
+ .env/
29
 
30
+ # Logs and Temporary Files
31
  logs/
32
+ *.log
33
+ temp/
34
+ .tmp/
35
+
36
+ # OS-specific Files
37
+ .DS_Store
38
+ Thumbs.db
39
+
40
+ # IDE Files
41
+ .vscode/
42
+ .idea/
43
+ *.swp
44
+ *.swo
45
+
46
+ # Deployment and Build
47
+ *.egg-info/
48
+ dist/
49
+ build/
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [2025] [Ankan Ghosh]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
docs/README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Anveshak: Spirituality Q&A
2
+
3
+ [![Open in Spaces](https://img.shields.io/badge/πŸ€—-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/ankanghosh/anveshak)
4
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
5
+
6
+ A Retrieval-Augmented Generation (RAG) application that provides concise answers to spiritual questions by referencing a curated collection of Indian spiritual texts, philosophical treatises, and teachings from revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters of all genders, backgrounds, traditions, and walks of life.
7
+
8
+ <p align="center">
9
+ <img src="assets/app_screenshot.png" alt="Application Screenshot" width="800"/>
10
+ </p>
11
+
12
+ ## Overview
13
+
14
+ Anveshak (meaning "seeker" in Sanskrit) serves as a bridge between ancient Indian spiritual wisdom and modern technology, allowing users to ask questions and receive answers grounded in traditional spiritual texts. The system combines the power of modern AI with the timeless wisdom found in these texts, making spiritual knowledge more accessible to seekers.
15
+
16
+ Our goal is to make a small contribution to the journey of beings toward self-discovery by making this knowledge available and accessible within ethical, moral, and resource-based constraints. **We have no commercial or for-profit interests; this application is purely for educational purposes.**
17
+
18
+ As stated in the application: "The path and journey to the SELF is designed to be undertaken alone. The all-encompassing knowledge is internal and not external."
19
+
20
+ ### Key Features
21
+
22
+ - **Question-answering:** Ask spiritual questions and receive concise answers grounded in traditional texts
23
+ - **Source citations:** All answers include references to the original texts
24
+ - **Configurable retrieval:** Adjust the number of sources and word limit for answers
25
+ - **Responsive interface:** Built with Streamlit for a clean, accessible experience
26
+ - **Privacy-focused:** No user data or queries are saved
27
+ - **Inclusive recognition:** Acknowledges spiritual teachers from all backgrounds, genders, and traditions
28
+
29
+ ## 🧠 How It Works
30
+
31
+ Anveshak follows a classic RAG architecture:
32
+
33
+ 1. **Data processing pipeline:** Collects, cleans, and processes ~133 spiritual texts
34
+ 2. **Text embedding:** Uses the E5-large-v2 model to create vector representations of text chunks
35
+ 3. **Vector storage:** Stores embeddings in a FAISS index for fast similarity search
36
+ 4. **Retrieval system:** Finds relevant passages from the text collection based on user queries
37
+ 5. **Generation system:** Synthesizes concise answers from retrieved passages using a large language model
38
+
39
+ ## πŸš€ Getting Started
40
+
41
+ ### Prerequisites
42
+
43
+ - Python 3.8 or higher
44
+ - [Google Cloud Storage](https://cloud.google.com/storage) account for data storage
45
+ - [OpenAI API](https://openai.com/api/) key for generation
46
+
47
+ ### Installation
48
+
49
+ 1. Clone the repository
50
+ ```bash
51
+ git clone https://github.com/YourUsername/anveshak.git
52
+ cd anveshak
53
+ ```
54
+
55
+ 2. Install dependencies
56
+ ```bash
57
+ pip install -r requirements.txt
58
+ ```
59
+
60
+ 3. Configure authentication
61
+ - Create a `.streamlit/secrets.toml` file with the following structure:
62
+ ```toml
63
+ # GCP Configuration
64
+ BUCKET_NAME_GCS = "your-bucket-name"
65
+ METADATA_PATH_GCS = "metadata/metadata.jsonl"
66
+ EMBEDDINGS_PATH_GCS = "processed/embeddings/all_embeddings.npy"
67
+ INDICES_PATH_GCS = "processed/indices/faiss_index.faiss"
68
+ CHUNKS_PATH_GCS = "processed/chunks/text_chunks.txt"
69
+ EMBEDDING_MODEL = "intfloat/e5-large-v2"
70
+ LLM_MODEL = "gpt-3.5-turbo"
71
+
72
+ # OpenAI API Configuration
73
+ openai_api_key = "your-openai-api-key"
74
+
75
+ # GCP Service Account Credentials (JSON format)
76
+ [gcp_credentials]
77
+ type = "service_account"
78
+ project_id = "your-project-id"
79
+ private_key_id = "your-private-key-id"
80
+ private_key = "your-private-key"
81
+ client_email = "your-client-email"
82
+ client_id = "your-client-id"
83
+ auth_uri = "https://accounts.google.com/o/oauth2/auth"
84
+ token_uri = "https://oauth2.googleapis.com/token"
85
+ auth_provider_x509_cert_url = "https://www.googleapis.com/oauth2/v1/certs"
86
+ client_x509_cert_url = "your-client-cert-url"
87
+ ```
88
+
89
+ ### Running the Application Locally
90
+
91
+ **Important Note**: Running Anveshak locally requires above 16GB of RAM due to the embedding model. Most standard laptops will experience crashes during model loading. Hugging Face Spaces deployment is strongly recommended.
92
+
93
+ ```bash
94
+ streamlit run app.py
95
+ ```
96
+
97
+ The application will be available at http://localhost:8501.
98
+
99
+ ### Deploying to Hugging Face Spaces
100
+
101
+ This application is designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces):
102
+
103
+ 1. Fork this repository to your GitHub account
104
+ 2. Create a new Space on Hugging Face:
105
+ - Go to [huggingface.co/spaces](https://huggingface.co/spaces)
106
+ - Click "Create new Space"
107
+ - Select "Streamlit" as the SDK
108
+ - Connect your GitHub repository
109
+ 3. Configure secrets in the Hugging Face UI:
110
+ - Go to your Space settings
111
+ - Under "Repository secrets"
112
+ - Add each of the required secrets from your `.streamlit/secrets.toml` file
113
+
114
+ ## πŸ“š Project Structure
115
+
116
+ ```
117
+ anveshak/
118
+ β”œβ”€β”€ .gitignore # Specifies intentionally untracked files to ignore
119
+ β”œβ”€β”€ .gitattributes # Defines attributes for pathnames in the repository
120
+ β”œβ”€β”€ app.py # Main Streamlit application
121
+ β”œβ”€β”€ requirements.txt # Python dependencies
122
+ β”œβ”€β”€ rag_engine.py # Core RAG functionality
123
+ β”œβ”€β”€ utils.py # Utility functions for authentication
124
+ β”œβ”€β”€ pages/ # Streamlit pages
125
+ β”‚ β”œβ”€β”€ 1_Sources.py # Sources information page
126
+ β”‚ β”œβ”€β”€ 2_Publishers.py # Publisher acknowledgments page
127
+ β”‚ └── 3_Contact_us.py # Contact information page
128
+ β”œβ”€β”€ docs/ # Documentation
129
+ β”‚ β”œβ”€β”€ architecture-doc.md # Architecture details
130
+ β”‚ β”œβ”€β”€ data-handling-doc.md # Data handling explanation
131
+ β”‚ β”œβ”€β”€ configuration-doc.md # Configuration guide
132
+ β”‚ β”œβ”€β”€ changelog-doc.md # Project change log
133
+ β”‚ └── README.md # Project overview and instructions
134
+ └── scripts/ # Data processing scripts
135
+ └── preprocessing.ipynb # Text preprocessing notebook
136
+ ```
137
+
138
+ ## πŸ”’ Data Privacy & Ethics
139
+
140
+ - Anveshak: Spirituality Q&A **does not** save any user data or queries
141
+ - All texts are sourced from freely available resources with proper attribution
142
+ - Publisher acknowledgments are included within the application
143
+ - Word limits are implemented to prevent excessive content reproduction and respect copyright
144
+ - User queries are processed using OpenAI's services but not stored by Anveshak
145
+ - The application presents information with appropriate reverence for spiritual traditions
146
+ - Responses are generated by AI based on the retrieved texts and may not perfectly represent the original teachings, intended meaning, or context
147
+ - The inclusion of any spiritual teacher, text, or tradition does not imply their endorsement of Anveshak
148
+
149
+ ## πŸ”„ Data Flow
150
+
151
+ ```
152
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
153
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
154
+ β”‚ Data Pipeline │────▢│ Retrieval System │────▢│ Generation System β”‚
155
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
156
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
157
+ β–² β–² β”‚
158
+ β”‚ β”‚ β”‚
159
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
160
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
161
+ β”‚ Spiritual β”‚ β”‚ User Query β”‚ β”‚ Final Answer β”‚
162
+ β”‚ Text Corpus β”‚ β”‚ β”‚ β”‚ with Citations β”‚
163
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
164
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
165
+ ```
166
+
167
+ ## πŸ“ Notes
168
+
169
+ - Anveshak: Spirituality Q&A is designed to provide concise answers rather than lengthy explanations or lists
170
+ - The application is not a general chatbot or conversational AI. It is specifically designed to answer spiritual questions with short, concise answers based on referenced texts.
171
+ - You may receive slightly different answers when asking the same question multiple times. This variation is intentional and reflects the nuanced nature of spiritual teachings across different traditions.
172
+ - Currently, Anveshak is only available in English
173
+ - The application acknowledges and honors spiritual teachers from all backgrounds, genders, traditions, and walks of life
174
+ - **Anveshak is a tool that is not a substitute for direct spiritual guidance, personal practice, or studying original texts in their complete form.**
175
+
176
+ ## πŸ™ Acknowledgments
177
+
178
+ Anveshak: Spirituality Q&A is made possible by the wisdom contained in numerous spiritual texts and the teachings of revered Saints, Sages, and Spiritual Masters from India and beyond. We extend our sincere gratitude to:
179
+
180
+ - **The Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters** of all genders, backgrounds, traditions, and walks of life whose timeless wisdom illuminates this application
181
+ - **The Sacred Texts** that have preserved the eternal truths across millennia
182
+ - **The Publishers** who have diligently preserved and disseminated these precious teachings
183
+ - **The Authors** who have dedicated their lives to interpreting and explaining complex spiritual concepts
184
+
185
+ See the "Publishers" and "Sources" pages within the application for complete acknowledgments.
186
+
187
+ ## Future Roadmap
188
+
189
+ - **Multi-language support** (Sanskrit, Hindi, Bengali, Tamil, and more)
190
+ - **Enhanced retrieval** with hybrid retrieval methods
191
+ - **Self-hosted open-source LLM integration**
192
+ - **User feedback collection** for answer quality
193
+ - **Personalized learning paths** based on user interests (implemented with privacy-preserving approaches like client-side storage, session-based preferences, or explicit opt-in)
194
+
195
+ For a complete roadmap, see the [changelog](changelog-doc.md).
196
+
197
+ ## Blog and Additional Resources
198
+ Read our detailed blog post about the project: [Anveshak: Spirituality Q&A - Bridging Faith and Intelligence](https://researchguy.in/anveshak-spirituality-qa-bridging-faith-and-intelligence/)
199
+
200
+ ## πŸ“œ License
201
+
202
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](../LICENSE) file for details.
203
+
204
+ ## πŸ“ž Contact
205
+
206
+ For questions, feedback, or suggestions, please contact us at [email protected].
docs/architecture-doc.md ADDED
@@ -0,0 +1,554 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture Document
2
+
3
+ This document provides a detailed overview of the architecture, component interactions, and technical design decisions encompassing Anveshak: Spirituality Q&A.
4
+
5
+ ## System Architecture Overview
6
+
7
+ Anveshak: Spirituality Q&A follows a Retrieval-Augmented Generation (RAG) architecture pattern, combining information retrieval with language generation to produce factual, grounded answers to spiritual questions.
8
+
9
+ ### High-Level Architecture Diagram
10
+
11
+ ```
12
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
13
+ β”‚ FRONT-END LAYER β”‚
14
+ β”‚ β”‚
15
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
16
+ β”‚ β”‚ Main App Page β”‚ β”‚ Sources Page β”‚ β”‚ Publishers Page β”‚ β”‚
17
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
18
+ β”‚ β”‚
19
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
20
+ β”‚
21
+ β–Ό
22
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
23
+ β”‚ BACKEND LAYER β”‚
24
+ β”‚ β”‚
25
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
26
+ β”‚ β”‚ Query Processor β”‚ β”‚ Retrieval Engineβ”‚ β”‚ Generation Engine β”‚ β”‚
27
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
28
+ β”‚ β”‚
29
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
30
+ β”‚
31
+ β–Ό
32
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
33
+ β”‚ DATA LAYER β”‚
34
+ β”‚ β”‚
35
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
36
+ β”‚ β”‚ FAISS Index β”‚ β”‚ Text Chunks β”‚ β”‚ Metadata β”‚ β”‚
37
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
38
+ β”‚ β”‚
39
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
40
+ ```
41
+
42
+ ## Component Details
43
+
44
+ ### 1. Front-end Layer
45
+
46
+ The front-end layer is built with Streamlit and consists of multiple pages:
47
+
48
+ #### Main App Page (`app.py`)
49
+ - Provides the question input interface
50
+ - Displays answers and citations
51
+ - Offers configurable parameters (number of sources, word limit)
52
+ - Shows pre-selected common spiritual questions
53
+ - Contains information about the application and disclaimers
54
+ - Contains acknowledgment sections
55
+
56
+ #### Sources Page (`1_Sources.py`)
57
+ - Lists all spiritual texts and traditions used in Anveshak: Spirituality Q&A
58
+ - Provides information about the Saints and Spiritual Masters
59
+ - Organizes sources by tradition and category
60
+
61
+ #### Publishers Page (`2_Publishers.py`)
62
+ - Acknowledges all publishers whose works are referenced
63
+ - Explains copyright considerations and fair use
64
+
65
+ #### Contacts Page (`3_Contacts.py`)
66
+ - Provides contact information for feedback and questions
67
+ - Explains the purpose and limitations of Anveshak: Spirituality Q&A
68
+
69
+ ### 2. Backend Layer
70
+
71
+ The backend layer handles the core functionality of processing queries, retrieving relevant passages, and generating answers.
72
+
73
+ #### Query Processor
74
+ - Takes user queries from the front-end
75
+ - Manages the end-to-end processing flow
76
+ - Caches results to improve performance
77
+ - Formats and returns answers with citations
78
+
79
+ ```python
80
+ @st.cache_data(ttl=3600, show_spinner=False)
81
+ def cached_process_query(query, top_k=5, word_limit=100):
82
+ """
83
+ Process a user query with caching to avoid redundant computation.
84
+
85
+ This function is cached with a Time-To-Live (TTL) of 1 hour, meaning identical
86
+ queries within this time period will return cached results rather than
87
+ reprocessing, improving responsiveness.
88
+
89
+ Args:
90
+ query (str): The user's spiritual question
91
+ top_k (int): Number of sources to retrieve and use for answer generation
92
+ word_limit (int): Maximum word count for the generated answer
93
+
94
+ Returns:
95
+ dict: Dictionary containing the query, answer, and citations
96
+ """
97
+ print(f"\nπŸ” Processing query (cached): {query}")
98
+ # Load all necessary data resources (with caching)
99
+ faiss_index, text_chunks, metadata_dict = cached_load_data_files()
100
+ # Handle missing data gracefully
101
+ if faiss_index is None or text_chunks is None or metadata_dict is None:
102
+ return {
103
+ "query": query,
104
+ "answer_with_rag": "⚠️ System error: Data files not loaded properly.",
105
+ "citations": "No citations available."
106
+ }
107
+ # Step 1: Retrieve relevant passages using similarity search
108
+ retrieved_context, retrieved_sources = retrieve_passages(
109
+ query,
110
+ faiss_index,
111
+ text_chunks,
112
+ metadata_dict,
113
+ top_k=top_k
114
+ )
115
+ # Step 2: Format citations for display
116
+ sources = format_citations(retrieved_sources) if retrieved_sources else "No citation available."
117
+ # Step 3: Generate the answer if relevant context was found
118
+ if retrieved_context:
119
+ context_with_sources = list(zip(retrieved_sources, retrieved_context))
120
+ llm_answer_with_rag = answer_with_llm(query, context_with_sources, word_limit=word_limit)
121
+ else:
122
+ llm_answer_with_rag = "⚠️ No relevant context found."
123
+ # Return the complete response package
124
+ return {"query": query, "answer_with_rag": llm_answer_with_rag, "citations": sources}
125
+
126
+ def process_query(query, top_k=5, word_limit=100):
127
+ """
128
+ Process a query through the RAG pipeline with proper formatting.
129
+
130
+ This is the main entry point for query processing, wrapping the cached
131
+ query processing function.
132
+
133
+ Args:
134
+ query (str): The user's spiritual question
135
+ top_k (int): Number of sources to retrieve and use for answer generation
136
+ word_limit (int): Maximum word count for the generated answer
137
+
138
+ Returns:
139
+ dict: Dictionary containing the query, answer, and citations
140
+ """
141
+ return cached_process_query(query, top_k, word_limit)
142
+ ```
143
+
144
+ #### Retrieval Engine
145
+ - Generates embeddings for user queries
146
+ - Performs similarity search in the FAISS index
147
+ - Retrieves the most relevant text chunks
148
+ - Adds metadata to the retrieved passages
149
+
150
+ ```python
151
+ ddef retrieve_passages(query, faiss_index, text_chunks, metadata_dict, top_k=5, similarity_threshold=0.5):
152
+ """
153
+ Retrieve the most relevant passages for a given spiritual query.
154
+
155
+ This function:
156
+ 1. Embeds the user query using the same model used for text chunks
157
+ 2. Finds similar passages using the FAISS index with cosine similarity
158
+ 3. Filters results based on similarity threshold to ensure relevance
159
+ 4. Enriches results with metadata (title, author, publisher)
160
+ 5. Ensures passage diversity by including only one passage per source title
161
+
162
+ Args:
163
+ query (str): The user's spiritual question
164
+ faiss_index: FAISS index containing passage embeddings
165
+ text_chunks (dict): Dictionary mapping IDs to text chunks and metadata
166
+ metadata_dict (dict): Dictionary containing publication information
167
+ top_k (int): Maximum number of passages to retrieve
168
+ similarity_threshold (float): Minimum similarity score (0.0-1.0) for retrieved passages
169
+
170
+ Returns:
171
+ tuple: (retrieved_passages, retrieved_sources) containing the text and source information
172
+ """
173
+ try:
174
+ print(f"\nπŸ” Retrieving passages for query: {query}")
175
+ query_embedding = get_embedding(query)
176
+ distances, indices = faiss_index.search(query_embedding, top_k * 2)
177
+ print(f"Found {len(distances[0])} potential matches")
178
+ retrieved_passages = []
179
+ retrieved_sources = []
180
+ cited_titles = set()
181
+ for dist, idx in zip(distances[0], indices[0]):
182
+ print(f"Distance: {dist:.4f}, Index: {idx}")
183
+ if idx in text_chunks and dist >= similarity_threshold:
184
+ title_with_txt, author, text = text_chunks[idx]
185
+ clean_title = title_with_txt.replace(".txt", "") if title_with_txt.endswith(".txt") else title_with_txt
186
+ clean_title = unicodedata.normalize("NFC", clean_title)
187
+ if clean_title in cited_titles:
188
+ continue
189
+ metadata_entry = metadata_dict.get(clean_title, {})
190
+ author = metadata_entry.get("Author", "Unknown")
191
+ publisher = metadata_entry.get("Publisher", "Unknown")
192
+ cited_titles.add(clean_title)
193
+ retrieved_passages.append(text)
194
+ retrieved_sources.append((clean_title, author, publisher))
195
+ if len(retrieved_passages) == top_k:
196
+ break
197
+ print(f"Retrieved {len(retrieved_passages)} passages")
198
+ return retrieved_passages, retrieved_sources
199
+ except Exception as e:
200
+ print(f"❌ Error in retrieve_passages: {str(e)}")
201
+ return [], []
202
+ ```
203
+
204
+ #### Generation Engine
205
+ - Takes retrieved passages as context
206
+ - Uses OpenAI's GPT model to generate answers
207
+ - Ensures answers respect the word limit
208
+ - Formats the output with proper citations
209
+
210
+ ```python
211
+ def answer_with_llm(query, context=None, word_limit=100):
212
+ """
213
+ Generate an answer using the OpenAI GPT model with formatted citations.
214
+
215
+ This function:
216
+ 1. Formats retrieved passages with source information
217
+ 2. Creates a prompt with system and user messages
218
+ 3. Calls the OpenAI API to generate an answer
219
+ 4. Trims the response to the specified word limit
220
+
221
+ The system prompt ensures answers maintain appropriate respect for spiritual traditions,
222
+ synthesize rather than quote directly, and acknowledge gaps when relevant information
223
+ isn't available.
224
+
225
+ Args:
226
+ query (str): The user's spiritual question
227
+ context (list, optional): List of (source_info, text) tuples for context
228
+ word_limit (int): Maximum word count for the generated answer
229
+
230
+ Returns:
231
+ str: The generated answer or an error message
232
+ """
233
+ try:
234
+ if context:
235
+ formatted_contexts = []
236
+ total_chars = 0
237
+ max_context_chars = 4000 # Limit context size to avoid exceeding token limits
238
+ for (title, author, publisher), text in context:
239
+ remaining_space = max(0, max_context_chars - total_chars)
240
+ excerpt_len = min(150, remaining_space)
241
+ if excerpt_len > 50:
242
+ excerpt = text[:excerpt_len].strip() + "..." if len(text) > excerpt_len else text
243
+ formatted_context = f"[{title} by {author}, Published by {publisher}] {excerpt}"
244
+ formatted_contexts.append(formatted_context)
245
+ total_chars += len(formatted_context)
246
+ if total_chars >= max_context_chars:
247
+ break
248
+ formatted_context = "\n".join(formatted_contexts)
249
+ else:
250
+ formatted_context = "No relevant information available."
251
+
252
+ system_message = (
253
+ "You are an AI specialized in spirituality, primarily based on Indian spiritual texts and teachings."
254
+ "While your knowledge is predominantly from Indian spiritual traditions, you also have limited familiarity with spiritual concepts from other global traditions."
255
+ "Answer based on context, summarizing ideas rather than quoting verbatim."
256
+ "If no relevant information is found in the provided context, politely inform the user that this specific query may not be covered in the available spiritual texts. Suggest they try a related question or rephrase their query or try a different query."
257
+ "Avoid repetition and irrelevant details."
258
+ "Ensure proper citation and do not include direct excerpts."
259
+ "Maintain appropriate, respectful language at all times."
260
+ "Do not use profanity, expletives, obscenities, slurs, hate speech, sexually explicit content, or language promoting violence."
261
+ "As a spiritual guidance system, ensure all responses reflect dignity, peace, love, and compassion consistent with spiritual traditions."
262
+ "Provide concise, focused answers without lists or lengthy explanations."
263
+ )
264
+
265
+ user_message = f"""
266
+ Context:
267
+ {formatted_context}
268
+ Question:
269
+ {query}
270
+ """
271
+
272
+ try:
273
+ llm_model = st.secrets["LLM_MODEL"]
274
+ except KeyError:
275
+ print("❌ Error: LLM model not found in secrets")
276
+ return "I apologize, but I am unable to answer at the moment."
277
+
278
+ response = openai.chat.completions.create(
279
+ model=llm_model,
280
+ messages=[
281
+ {"role": "system", "content": system_message},
282
+ {"role": "user", "content": user_message}
283
+ ],
284
+ max_tokens=200,
285
+ temperature=0.7
286
+ )
287
+
288
+ # Extract the answer and apply word limit
289
+ answer = response.choices[0].message.content.strip()
290
+ words = answer.split()
291
+ if len(words) > word_limit:
292
+ answer = " ".join(words[:word_limit])
293
+ if not answer.endswith((".", "!", "?")):
294
+ answer += "."
295
+ return answer
296
+ except Exception as e:
297
+ print(f"❌ LLM API error: {str(e)}")
298
+ return "I apologize, but I am unable to answer at the moment."
299
+
300
+ ### 3. Data Layer
301
+
302
+ The data layer stores and manages the embedded text chunks, metadata, and vector indices:
303
+
304
+ #### FAISS Index
305
+ - Stores vector embeddings of all text chunks
306
+ - Enables efficient similarity search with cosine similarity
307
+ - Provides fast retrieval for Anveshak
308
+
309
+ ```python
310
+ # Building the FAISS index (during preprocessing)
311
+ dimension = all_embeddings.shape[1]
312
+ index = faiss.IndexFlatIP(dimension) # Inner product (cosine similarity for normalized vectors)
313
+ index.add(all_embeddings)
314
+ ```
315
+
316
+ #### Text Chunks
317
+ - Contains the actual text content split into manageable chunks
318
+ - Stores text with unique identifiers that map to the FAISS index
319
+ - Formatted as tab-separated values with IDs, titles, authors, and content
320
+
321
+ ```
322
+ # Format of text_chunks.txt
323
+ ID Title Author Text_Content
324
+ 0 Bhagavad Gita Vyasa The supreme Lord said: I have taught this imperishable yoga to Vivasvan...
325
+ 1 Yoga Sutras Patanjali Yogas chitta vritti nirodhah - Yoga is the stilling of the fluctuations...
326
+ ...
327
+ ```
328
+
329
+ #### Metadata
330
+ - Stores additional information about each source text
331
+ - Includes author information, publisher details, copyright information, and more
332
+ - Used to provide accurate citations for answers
333
+
334
+ ```json
335
+ // Example metadata.jsonl entry
336
+ {"Title": "Text_Name", "Author": "Vyasa", "Publisher": "Publisher_Name", "URL": "URL", "Uploaded": true}
337
+ ```
338
+
339
+ ## Data Flow and Processing
340
+
341
+ ### 1. Preprocessing Pipeline
342
+
343
+ The preprocessing pipeline runs offline to prepare the text corpus:
344
+
345
+ ```
346
+ Raw Texts β†’ Cleaning β†’ Chunking β†’ Embedding β†’ Indexing β†’ GCS Storage
347
+ ```
348
+
349
+ Each step is handled by specific functions in the `preprocessing.py` script:
350
+
351
+ 1. **Text Collection**: Texts are collected from various sources and uploaded to Google Cloud Storage
352
+ 2. **Text Cleaning**: HTML and formatting artifacts are removed using `rigorous_clean_text()`
353
+ 3. **Text Chunking**: Long texts are split into manageable chunks with `chunk_text()`
354
+ 4. **Embedding Generation**: Text chunks are converted to vector embeddings using `create_embeddings()`
355
+ 5. **Index Building**: Embeddings are added to a FAISS index for efficient retrieval
356
+ 6. **Storage**: All processed data is stored in Google Cloud Storage for Anveshak to access
357
+
358
+ ### 2. Query Processing Flow
359
+
360
+ When a user submits a question, the system follows this flow:
361
+
362
+ 1. **Query Embedding**: The user's question is embedded using the same model as the text corpus
363
+ 2. **Similarity Search**: The query embedding is compared against the FAISS index to find similar text chunks
364
+ 3. **Context Assembly**: Retrieved chunks are combined with their metadata to form the context
365
+ 4. **Answer Generation**: The context and query are sent to the Large Language Model (LLM) to generate an answer
366
+ 5. **Citation Formatting**: Sources are formatted as citations to accompany the answer
367
+ 6. **Result Presentation**: The answer and citations are displayed to the user
368
+
369
+ ## Caching Strategy
370
+
371
+ Anveshak implements a multi-level caching strategy to optimize performance:
372
+
373
+ ### Resource Caching
374
+ - Model and data files are cached using `@st.cache_resource`
375
+ - Ensures the embedding model and FAISS index are loaded only once during the session
376
+
377
+ ```python
378
+ @st.cache_resource(show_spinner=False)
379
+ def cached_load_model():
380
+ # Load embedding model once and cache it
381
+
382
+ @st.cache_resource(show_spinner=False)
383
+ def cached_load_data_files():
384
+ # Load FAISS index, text chunks, and metadata once and cache them
385
+ ```
386
+
387
+ ### Data Caching
388
+ - Query results are cached using `@st.cache_data` with a Time-To-Live (TTL) of 1 hour
389
+ - Prevents redundant processing of identical queries
390
+
391
+ ```python
392
+ @st.cache_data(ttl=3600, show_spinner=False)
393
+ def cached_process_query(query, top_k=5, word_limit=100):
394
+ # Cache query results for an hour
395
+ ```
396
+
397
+ ### Session State Management
398
+ - Streamlit session state is used to manage UI state and user interactions
399
+ - Prevents unnecessary recomputation during re-renders
400
+
401
+ ```python
402
+ ...
403
+ if 'initialized' not in st.session_state:
404
+ st.session_state.initialized = False
405
+ ...
406
+ if 'last_query' not in st.session_state:
407
+ st.session_state.last_query = ""
408
+ # ... and more session state variables
409
+ ```
410
+
411
+ ## Authentication and Security
412
+
413
+ Anveshak uses two authentication systems:
414
+
415
+ ### Google Cloud Storage Authentication
416
+ - Authenticates with GCS to access stored data
417
+ - Uses service account credentials stored securely
418
+
419
+ ```python
420
+ Anveshak: Spirituality Q&A uses two authentication systems:
421
+
422
+ ### Google Cloud Storage Authentication
423
+ - Authenticates with GCS to access stored data
424
+ - Uses service account credentials stored exclusively in Hugging Face Spaces secrets for production deployment
425
+ - Supports alternative authentication methods (environment variables, Streamlit secrets) for development environments
426
+
427
+ ```python
428
+ def setup_gcp_auth():
429
+ """Setup Google Cloud Platform (GCP) authentication using various methods.
430
+
431
+ This function tries multiple authentication methods in order of preference:
432
+ 1. HF Spaces environment variable (GCP_CREDENTIALS) - primary production method
433
+ 2. Local environment variable pointing to credentials file (GOOGLE_APPLICATION_CREDENTIALS)
434
+ 3. Streamlit secrets (gcp_credentials)
435
+
436
+ Note: In production, credentials are stored exclusively in HF Spaces secrets.
437
+ """
438
+ # Try multiple authentication methods and return credentials.```
439
+
440
+ ### OpenAI API Authentication
441
+ - Authenticates with OpenAI to use their LLM API
442
+ - Uses API key stored securely
443
+
444
+ ```python
445
+ def setup_openai_auth():
446
+ """Setup OpenAI API authentication using various methods.
447
+
448
+ This function tries multiple authentication methods in order of preference:
449
+ 1. Standard environment variable (OPENAI_API_KEY)
450
+ 2. HF Spaces environment variable (OPENAI_KEY) - primary production method
451
+ 3. Streamlit secrets (openai_api_key)
452
+
453
+ Note: In production, the API key is stored exclusively in HF Spaces secrets.
454
+ """
455
+ # Try multiple authentication methods to set up the API key
456
+ ```
457
+
458
+ ## Privacy Considerations
459
+
460
+ Anveshak: Spirituality Q&A is designed with privacy in mind:
461
+
462
+ 1. **No Data Collection**: The application does not save user data or queries
463
+ 2. **Stateless Operation**: Each query is processed independently
464
+ 3. **No User Tracking**: No analytics or tracking mechanisms are implemented
465
+ 4. **Local Processing**: Embedding generation happens locally when possible
466
+
467
+ ## Deployment Architecture
468
+
469
+ Anveshak: Spirituality Q&A is deployed on Hugging Face Spaces, which provides:
470
+
471
+ - Containerized environment
472
+ - Git-based deployment
473
+ - Secret management for API keys and credentials
474
+ - Persistent storage for cached files
475
+ - Continuous availability
476
+
477
+ The deployment process involves:
478
+ 1. Pushing code to GitHub
479
+ 2. Connecting the GitHub repository to Hugging Face Spaces
480
+ 3. Configuring environment variables and secrets in the Hugging Face UI
481
+ 4. Automatic deployment when changes are pushed to the repository
482
+
483
+ ## Technical Design Decisions
484
+
485
+ ### Choice of Embedding Model
486
+ - **Selected Model**: E5-large-v2
487
+ - **Justification**:
488
+ - Strong performance on information retrieval tasks
489
+ - Good balance between accuracy and computational efficiency
490
+ - Supports semantic understanding of spiritual concepts
491
+
492
+ ### Vector Search Implementation
493
+ - **Selected Technology**: FAISS with IndexFlatIP
494
+ - **Justification**:
495
+ - Optimized for inner product (cosine similarity) search
496
+ - Exact search rather than approximate for maximum accuracy
497
+ - Small enough index to fit in memory for this application
498
+
499
+ ### LLM Selection
500
+ - **Selected Model**: OpenAI GPT-3.5 Turbo
501
+ - **Justification**:
502
+ - Powerful context understanding
503
+ - Strong ability to synthesize information from multiple sources
504
+ - Good balance between accuracy and cost
505
+
506
+ ### Front-end Framework
507
+ - **Selected Technology**: Streamlit
508
+ - **Justification**:
509
+ - Rapid development of data-focused applications
510
+ - Built-in caching mechanisms
511
+ - Easy deployment on Hugging Face Spaces
512
+ - Simple, intuitive UI for non-technical users
513
+
514
+ ### Response Format
515
+ - **Design Choice**: Concise, direct answers
516
+ - **Justification**:
517
+ - Spiritual wisdom often benefits from simplicity and directness
518
+ - Avoids overwhelming users with excessive information
519
+ - Maintains focus on the core of the question
520
+
521
+ ## Limitations and Constraints
522
+
523
+ 1. **Context Window Limitations**: The LLM has a maximum context window, limiting the amount of text that can be included in each query.
524
+ - Mitigation: Text chunks are limited to 500 words, and only a subset of the most relevant chunks are included in the context.
525
+
526
+ 2. **Embedding Model Accuracy**: No embedding model perfectly captures the semantics of spiritual texts.
527
+ - Mitigation: Use of a high-quality embedding model (E5-large-v2) and a similarity threshold to filter out less relevant results.
528
+
529
+ 3. **Resource Constraints**: Hugging Face Spaces has limited computational resources.
530
+ - Mitigation: Forcing CPU usage for the embedding model, implementing aggressive caching, and optimizing memory usage.
531
+
532
+ 4. **Copyright Considerations**: Anveshak: Spirituality Q&A respects copyright while providing valuable information.
533
+ - Implementation: Word limits on responses, proper citations for all sources, and encouragement for users to purchase original texts.
534
+
535
+ 5. **Language Limitations**: Currently, Anveshak is only available in English.
536
+ - Mitigation: Future plans include support for multiple Indian languages.
537
+
538
+ ## Future Architecture Extensions
539
+
540
+ 1. **Multi-language Support**: Add capability to process and answer questions in Sanskrit, Hindi, Bengali, Tamil, and other Indian languages.
541
+
542
+ 2. **Hybrid Retrieval**: Implement a combination of dense and sparse retrieval to improve passage selection.
543
+
544
+ 3. **Local LLM Integration**: Use a self-hosted open-source alternative for the LLM.
545
+
546
+ 4. **User Feedback Loop**: Add a mechanism for users to rate answers and use this feedback to improve retrieval.
547
+
548
+ 5. **Advanced Caching**: Implement a distributed caching system for better performance at scale.
549
+
550
+ ## Conclusion
551
+
552
+ The architecture of Anveshak balances technical sophistication with simplicity and accessibility. By combining modern NLP techniques with traditional spiritual texts, it creates a bridge between ancient wisdom and contemporary technology, making spiritual knowledge more accessible to seekers around the world.
553
+
554
+ Anveshak: Spirituality Q&A acknowledges and honors Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters from all backgrounds, genders, traditions, and walks of life, understanding that wisdom transcends all such distinctions. Its focused approach on providing concise, direct answers maintains the essence of spiritual teaching while embracing modern technological capabilities.
docs/assets/app_screenshot.png ADDED

Git LFS Details

  • SHA256: f9094c61fc20e022f66ee7f3a17ecf32ef154788ef6ed65298cef19400fc9208
  • Pointer size: 131 Bytes
  • Size of remote file: 291 kB
docs/changelog-doc.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Changelog
2
+
3
+ All notable changes to Anveshak: Spirituality Q&A will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [1.0.0] - 2025-04-01
9
+
10
+ ### Added
11
+ - Initial release of Anveshak: Spirituality Q&A
12
+ - Core RAG functionality with E5-large-v2 embedding model
13
+ - FAISS index for efficient text retrieval
14
+ - Integration with OpenAI API for answer generation
15
+ - Streamlit-based user interface
16
+ - Caching mechanisms for improved performance
17
+ - Support for customizable number of sources and word limits
18
+ - Pre-selected common spiritual questions
19
+ - Comprehensive acknowledgment of sources and publishers
20
+ - Detailed documentation
21
+
22
+ ### Technical Features
23
+ - Google Cloud Storage integration for data storage
24
+ - Authentication handling for GCP and OpenAI
25
+ - Memory optimization for resource-constrained environments
26
+ - Multi-page Streamlit application structure
27
+ - Custom CSS styling for enhanced user experience
28
+ - Privacy protection with no user data storage
29
+ - Concise answer generation system
30
+ - Recognition of Saints and Spiritual Masters of all backgrounds and traditions
31
+
32
+ ## Future Roadmap
33
+
34
+ ### Planned for v1.1.0
35
+ - Multi-language support (Sanskrit, Hindi, Bengali, Tamil, and more)
36
+ - User feedback collection for answer quality
37
+ - Enhanced answer relevance with hybrid retrieval methods
38
+ - Additional spiritual texts from diverse traditions
39
+ - Improved citation formatting with page numbers where available
40
+
41
+ ### Planned for v1.2.0
42
+ - Self-hosted open-source LLM integration
43
+ - Advanced visualization of concept relationships
44
+ - Search functionality for specific texts or authors
45
+ - Audio output for visually impaired users
46
+ - Mobile-optimized interface
47
+
48
+ ### Planned for v2.0.0
49
+ - Meditation timer and guide integration
50
+ - Personalized learning paths based on user interests (implemented with privacy-preserving approaches like client-side storage, session-based preferences, or explicit opt-in)
51
+ - Interactive glossary of spiritual terms
52
+ - Spiritual practice guide with scheduler and tracker
53
+ - Community features for discussion and shared learning
docs/configuration-doc.md ADDED
@@ -0,0 +1,597 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuration Guide
2
+
3
+ This document provides detailed instructions for configuring and deploying Anveshak: Spirituality Q&A, covering environment setup, authentication, customization options, and deployment strategies.
4
+
5
+ ## Environment Configuration
6
+
7
+ ### Configuration Parameters
8
+
9
+ Anveshak: Spirituality Q&A uses the following configuration parameters, which can be set through environment variables or Hugging Face Spaces secrets:
10
+
11
+ | Parameter | Description | Example Value |
12
+ |-----------|-------------|---------------|
13
+ | `BUCKET_NAME_GCS` | GCS bucket name for data storage | `"your-bucket-name"` |
14
+ | `METADATA_PATH_GCS` | Path to metadata file in GCS | `"metadata/metadata.jsonl"` |
15
+ | `EMBEDDINGS_PATH_GCS` | Path to embeddings file in GCS | `"processed/embeddings/all_embeddings.npy"` |
16
+ | `INDICES_PATH_GCS` | Path to FAISS index in GCS | `"processed/indices/faiss_index.faiss"` |
17
+ | `CHUNKS_PATH_GCS` | Path to text chunks file in GCS | `"processed/chunks/text_chunks.txt"` |
18
+ | `RAW_TEXTS_UPLOADED_PATH_GCS` | Path to uploaded raw texts in GCS | `"raw-texts/uploaded"` |
19
+ | `RAW_TEXTS_DOWNLOADED_PATH_GCS` | Path to downloaded raw texts in GCS | `"raw-texts/downloaded/"` |
20
+ | `CLEANED_TEXTS_PATH_GCS` | Path to cleaned texts in GCS | `"cleaned-texts/"` |
21
+ | `EMBEDDING_MODEL` | Hugging Face model ID for embeddings | `"intfloat/e5-large-v2"` |
22
+ | `LLM_MODEL` | OpenAI model for answer generation | `"gpt-3.5-turbo"` |
23
+ | `OPENAI_API_KEY` | OpenAI API key | `"sk-..."` |
24
+ | `GCP_CREDENTIALS` | GCP service account credentials (JSON) | `{"type":"service_account",...}` |
25
+
26
+ ### Streamlit Secrets Configuration (Optional)
27
+
28
+ If developing locally with Streamlit, you can create a `.streamlit/secrets.toml` file with the following structure:
29
+
30
+ ```toml
31
+ # GCS Configuration
32
+ BUCKET_NAME_GCS = "your-bucket-name"
33
+ METADATA_PATH_GCS = "metadata/metadata.jsonl"
34
+ EMBEDDINGS_PATH_GCS = "processed/embeddings/all_embeddings.npy"
35
+ INDICES_PATH_GCS = "processed/indices/faiss_index.faiss"
36
+ CHUNKS_PATH_GCS = "processed/chunks/text_chunks.txt"
37
+ RAW_TEXTS_UPLOADED_PATH_GCS = "raw-texts/uploaded"
38
+ RAW_TEXTS_DOWNLOADED_PATH_GCS = "raw-texts/downloaded/"
39
+ CLEANED_TEXTS_PATH_GCS = "cleaned-texts/"
40
+ EMBEDDING_MODEL = "intfloat/e5-large-v2"
41
+ LLM_MODEL = "gpt-3.5-turbo"
42
+
43
+ # OpenAI API Configuration
44
+ openai_api_key = "your-openai-api-key"
45
+
46
+ # GCP Service Account Credentials (JSON format)
47
+ [gcp_credentials]
48
+ type = "service_account"
49
+ project_id = "your-project-id"
50
+ private_key_id = "your-private-key-id"
51
+ private_key = "your-private-key"
52
+ client_email = "your-client-email"
53
+ client_id = "your-client-id"
54
+ auth_uri = "https://accounts.google.com/o/oauth2/auth"
55
+ token_uri = "https://oauth2.googleapis.com/token"
56
+ auth_provider_x509_cert_url = "https://www.googleapis.com/oauth2/v1/certs"
57
+ client_x509_cert_url = "your-client-cert-url"
58
+ ```
59
+
60
+ ### Environment Variables for Alternative Deployments
61
+
62
+ For deployments that support environment variables (like Heroku or Docker), you can use the following environment variables:
63
+
64
+ ```bash
65
+ # GCS Configuration
66
+ export BUCKET_NAME_GCS="your-bucket-name"
67
+ export METADATA_PATH_GCS="metadata/metadata.jsonl"
68
+ export EMBEDDINGS_PATH_GCS="processed/embeddings/all_embeddings.npy"
69
+ export INDICES_PATH_GCS="processed/indices/faiss_index.faiss"
70
+ export CHUNKS_PATH_GCS="processed/chunks/text_chunks.txt"
71
+ export RAW_TEXTS_UPLOADED_PATH_GCS="raw-texts/uploaded"
72
+ export RAW_TEXTS_DOWNLOADED_PATH_GCS="raw-texts/downloaded/"
73
+ export CLEANED_TEXTS_PATH_GCS="cleaned-texts/"
74
+ export EMBEDDING_MODEL="intfloat/e5-large-v2"
75
+ export LLM_MODEL="gpt-3.5-turbo"
76
+
77
+ # OpenAI API Configuration
78
+ export OPENAI_API_KEY="your-openai-api-key"
79
+
80
+ # GCP Service Account (as a JSON string)
81
+ export GCP_CREDENTIALS='{"type":"service_account","project_id":"your-project-id",...}'
82
+ ```
83
+
84
+ ## Authentication Setup
85
+
86
+ ### Google Cloud Storage (GCS) Authentication
87
+
88
+ Anveshak: Spirituality Q&A supports multiple methods for authenticating with GCS:
89
+
90
+ #### Setting Up a GCP Service Account (Required)
91
+
92
+ Before configuring authentication methods, you'll need to create a Google Cloud Platform (GCP) service account:
93
+
94
+ 1. **Create a GCP project** (if you don't already have one):
95
+ - Go to the [Google Cloud Console](https://console.cloud.google.com/)
96
+ - Click on "Select a project" at the top right and then "New Project"
97
+ - Enter a project name and click "Create"
98
+
99
+ 2. **Enable the Cloud Storage API**:
100
+ - Go to "APIs & Services" > "Library" in the left sidebar
101
+ - Search for "Cloud Storage"
102
+ - Click on "Cloud Storage API" and then "Enable"
103
+
104
+ 3. **Create a service account**:
105
+ - Go to "IAM & Admin" > "Service Accounts" in the left sidebar
106
+ - Click "Create Service Account"
107
+ - Enter a service account name and description
108
+ - Click "Create and Continue"
109
+
110
+ 4. **Assign roles to the service account**:
111
+ - Add the "Storage Object Admin" role for access to GCS objects
112
+ - Add the "Viewer" role for basic read permissions
113
+ - Click "Continue" and then "Done"
114
+
115
+ 5. **Create and download service account key**:
116
+ - Find your new service account in the list and click on it
117
+ - Go to the "Keys" tab
118
+ - Click "Add Key" > "Create new key"
119
+ - Choose "JSON" as the key type
120
+ - Click "Create" to download the key file (This is your GCP credentials JSON file)
121
+
122
+ 6. **Create a GCS bucket**:
123
+ - Go to "Cloud Storage" > "Buckets" in the left sidebar
124
+ - Click "Create"
125
+ - Enter a globally unique bucket name
126
+ - Choose your settings for location, class, and access control
127
+ - Click "Create"
128
+
129
+ Once you have created your service account and GCS bucket, you can use any of the following authentication methods:
130
+
131
+ #### Option 1: HF Spaces Environment Variable (Recommended Production Method)
132
+
133
+ For Hugging Face Spaces, set the `GCP_CREDENTIALS` environment variable in the Spaces UI:
134
+
135
+ 1. Go to your Space settings
136
+ 2. Under "Repository secrets"
137
+ 3. Add a new secret with name `GCP_CREDENTIALS` and value containing your JSON credentials
138
+
139
+ #### Option 2: Local Development with Application Default Credentials
140
+
141
+ For local development, you can use Application Default Credentials:
142
+
143
+ ```bash
144
+ # Export path to your service account key file
145
+ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account-file.json"
146
+ ```
147
+
148
+ #### Option 3: Streamlit Secrets
149
+
150
+ Add your service account credentials to the `.streamlit/secrets.toml` file as shown in the example above.
151
+
152
+ The authentication logic is handled by the `setup_gcp_auth()` function in `utils.py`:
153
+
154
+ ```python
155
+ def setup_gcp_auth():
156
+ """
157
+ Setup Google Cloud Platform (GCP) authentication using various methods.
158
+
159
+ This function tries multiple authentication methods in order of preference:
160
+ 1. HF Spaces environment variable (GCP_CREDENTIALS) - primary production method
161
+ 2. Local environment variable pointing to credentials file (GOOGLE_APPLICATION_CREDENTIALS)
162
+ 3. Streamlit secrets (gcp_credentials)
163
+
164
+ Note: In production, credentials are stored exclusively in HF Spaces secrets.
165
+ """
166
+ try:
167
+ # Option 1: HF Spaces environment variable
168
+ if "GCP_CREDENTIALS" in os.environ:
169
+ gcp_credentials = json.loads(os.getenv("GCP_CREDENTIALS"))
170
+ print("βœ… Using GCP credentials from HF Spaces environment variable")
171
+ credentials = service_account.Credentials.from_service_account_info(gcp_credentials)
172
+ return credentials
173
+
174
+ # Option 2: Local environment variable pointing to file
175
+ elif "GOOGLE_APPLICATION_CREDENTIALS" in os.environ:
176
+ credentials_path = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
177
+ print(f"βœ… Using GCP credentials from file at {credentials_path}")
178
+ credentials = service_account.Credentials.from_service_account_file(credentials_path)
179
+ return credentials
180
+
181
+ # Option 3: Streamlit secrets
182
+ elif "gcp_credentials" in st.secrets:
183
+ gcp_credentials = st.secrets["gcp_credentials"]
184
+
185
+ # Handle different secret formats
186
+ if isinstance(gcp_credentials, dict) or hasattr(gcp_credentials, 'to_dict'):
187
+ # Convert AttrDict to dict if needed
188
+ if hasattr(gcp_credentials, 'to_dict'):
189
+ gcp_credentials = gcp_credentials.to_dict()
190
+
191
+ print("βœ… Using GCP credentials from Streamlit secrets (dict format)")
192
+ credentials = service_account.Credentials.from_service_account_info(gcp_credentials)
193
+ return credentials
194
+ else:
195
+ # Assume it's a JSON string
196
+ try:
197
+ gcp_credentials_dict = json.loads(gcp_credentials)
198
+ print("βœ… Using GCP credentials from Streamlit secrets (JSON string)")
199
+ credentials = service_account.Credentials.from_service_account_info(gcp_credentials_dict)
200
+ return credentials
201
+ except json.JSONDecodeError:
202
+ print("⚠️ GCP credentials in Streamlit secrets is not valid JSON, trying as file path")
203
+ if os.path.exists(gcp_credentials):
204
+ credentials = service_account.Credentials.from_service_account_file(gcp_credentials)
205
+ return credentials
206
+ else:
207
+ raise ValueError("GCP credentials format not recognized")
208
+
209
+ else:
210
+ raise ValueError("No GCP credentials found in environment or Streamlit secrets")
211
+
212
+ except Exception as e:
213
+ error_msg = f"❌ Authentication error: {str(e)}"
214
+ print(error_msg)
215
+ st.error(error_msg)
216
+ raise
217
+ ```
218
+
219
+ ### OpenAI API Authentication
220
+
221
+ Similarly, OpenAI API authentication can be configured in multiple ways:
222
+
223
+ #### Option 1: HF Spaces Environment Variable (Recommended Production Method)
224
+
225
+ Set the `OPENAI_API_KEY` environment variable in the Hugging Face Spaces UI.
226
+
227
+ #### Option 2: Environment Variables
228
+
229
+ Set the `OPENAI_API_KEY` environment variable:
230
+
231
+ ```bash
232
+ export OPENAI_API_KEY="your-openai-api-key"
233
+ ```
234
+
235
+ #### Option 3: Streamlit Secrets
236
+
237
+ Add your OpenAI API key to the `.streamlit/secrets.toml` file:
238
+
239
+ ```toml
240
+ openai_api_key = "your-openai-api-key"
241
+ ```
242
+
243
+ The authentication logic is handled by the `setup_openai_auth()` function in `utils.py`:
244
+
245
+ ```python
246
+ def setup_openai_auth():
247
+ """
248
+ Setup OpenAI API authentication using various methods.
249
+
250
+ This function tries multiple authentication methods in order of preference:
251
+ 1. Standard environment variable (OPENAI_API_KEY)
252
+ 2. HF Spaces environment variable (OPENAI_KEY) - primary production method
253
+ 3. Streamlit secrets (openai_api_key)
254
+
255
+ Note: In production, the API key is stored exclusively in HF Spaces secrets.
256
+ """
257
+ try:
258
+ # Option 1: Standard environment variable
259
+ if "OPENAI_API_KEY" in os.environ:
260
+ openai.api_key = os.getenv("OPENAI_API_KEY")
261
+ print("βœ… Using OpenAI API key from environment variable")
262
+ return
263
+
264
+ # Option 2: HF Spaces environment variable with different name
265
+ elif "OPENAI_KEY" in os.environ:
266
+ openai.api_key = os.getenv("OPENAI_KEY")
267
+ print("βœ… Using OpenAI API key from HF Spaces environment variable")
268
+ return
269
+
270
+ # Option 3: Streamlit secrets
271
+ elif "openai_api_key" in st.secrets:
272
+ openai.api_key = st.secrets["openai_api_key"]
273
+ print("βœ… Using OpenAI API key from Streamlit secrets")
274
+ return
275
+
276
+ else:
277
+ raise ValueError("No OpenAI API key found in environment or Streamlit secrets")
278
+
279
+ except Exception as e:
280
+ error_msg = f"❌ OpenAI authentication error: {str(e)}"
281
+ print(error_msg)
282
+ st.error(error_msg)
283
+ raise
284
+ ```
285
+
286
+ ## Application Customization
287
+
288
+ ### UI Customization
289
+
290
+ Anveshak's UI can be customized through the CSS in the `app.py` file:
291
+
292
+ ```python
293
+ # Custom CSS
294
+ st.markdown("""
295
+ <style>
296
+ .main-title {
297
+ font-size: 2.5rem;
298
+ color: #c0392b;
299
+ text-align: center;
300
+ margin-bottom: 1rem;
301
+ }
302
+ .subtitle {
303
+ font-size: 1.2rem;
304
+ color: #555;
305
+ text-align: center;
306
+ margin-bottom: 1.5rem;
307
+ font-style: italic;
308
+ }
309
+ /* More CSS rules... */
310
+ </style>
311
+ <div class="main-title">Anveshak</div>
312
+ <div class="subtitle">Spirituality Q&A</div>
313
+ """, unsafe_allow_html=True)
314
+ ```
315
+
316
+ To change the appearance:
317
+
318
+ 1. Modify the CSS variables in the `<style>` tag
319
+ 2. Update color schemes, fonts, or layouts as needed
320
+ 3. Add new CSS classes for additional UI elements
321
+
322
+ ### Common Questions Configuration
323
+
324
+ The list of pre-selected common questions can be modified in the `app.py` file:
325
+
326
+ ```python
327
+ # Common spiritual questions for users to select from
328
+ common_questions = [
329
+ "What is the Atman or the soul?",
330
+ "Are there rebirths?",
331
+ "What is Karma?",
332
+ # Add or modify questions here
333
+ ]
334
+ ```
335
+
336
+ ### Retrieval Parameters
337
+
338
+ Two key retrieval parameters can be adjusted by users through the UI:
339
+
340
+ 1. **Number of sources** (`top_k`): Controls how many distinct sources are used for generating answers
341
+ - Default: 5
342
+ - Range: 3-10
343
+ - UI Component: Slider in the main interface
344
+
345
+ 2. **Word limit** (`word_limit`): Controls the maximum length of generated answers
346
+ - Default: 200
347
+ - Range: 50-500
348
+ - UI Component: Slider in the main interface
349
+
350
+ These parameters are implemented in the Streamlit UI:
351
+
352
+ ```python
353
+ # Sliders for customization
354
+ col1, col2 = st.columns(2)
355
+ with col1:
356
+ top_k = st.slider("Number of sources:", 3, 10, 5)
357
+ with col2:
358
+ word_limit = st.slider("Word limit:", 50, 500, 200)
359
+ ```
360
+
361
+ ## Deployment Options
362
+
363
+ ### Recommended: Hugging Face Spaces Deployment
364
+
365
+ The recommended and tested deployment method for Anveshak: Spirituality Q&A is Hugging Face Spaces, which provides the necessary resources for running the application efficiently.
366
+
367
+ To deploy on Hugging Face Spaces:
368
+
369
+ 1. Fork the repository to your GitHub account
370
+
371
+ 2. Create a new Space on Hugging Face:
372
+ - Go to [huggingface.co/spaces](https://huggingface.co/spaces)
373
+ - Click "Create new Space"
374
+ - Select "Streamlit" as the SDK
375
+ - Connect your GitHub repository
376
+
377
+ 3. Configure secrets in the Hugging Face UI:
378
+ - Go to your Space settings
379
+ - Under "Repository secrets"
380
+ - Add each of the following secrets:
381
+ - `OPENAI_API_KEY`
382
+ - `GCP_CREDENTIALS` (the entire JSON as a string)
383
+ - `BUCKET_NAME_GCS`
384
+ - `LLM_MODEL`
385
+ - `METADATA_PATH_GCS`
386
+ - `RAW_TEXTS_UPLOADED_PATH_GCS`
387
+ - `RAW_TEXTS_DOWNLOADED_PATH_GCS`
388
+ - `CLEANED_TEXTS_PATH_GCS`
389
+ - `EMBEDDINGS_PATH_GCS`
390
+ - `INDICES_PATH_GCS`
391
+ - `CHUNKS_PATH_GCS`
392
+ - `EMBEDDING_MODEL`
393
+
394
+ 4. The app should automatically deploy. If needed, manually trigger a rebuild from the Spaces UI.
395
+
396
+ ### Local Development (Not Recommended)
397
+
398
+ **Important Note**: Running Anveshak: Spirituality Q&A locally requires above 16GB of RAM due to the embedding model. Most standard laptops will experience crashes during model loading. Hugging Face Spaces deployment is strongly recommended.
399
+
400
+ If you still want to run it locally for development purposes:
401
+
402
+ 1. Clone the repository
403
+ ```bash
404
+ git clone https://github.com/YourUsername/anveshak.git
405
+ cd anveshak
406
+ ```
407
+
408
+ 2. Install dependencies
409
+ ```bash
410
+ pip install -r requirements.txt
411
+ ```
412
+
413
+ 3. Create the `.streamlit/secrets.toml` file as described above
414
+
415
+ 4. Run the application
416
+ ```bash
417
+ streamlit run app.py
418
+ ```
419
+
420
+ ### Alternative: Docker Deployment
421
+
422
+ For containerized deployment (not tested in production):
423
+
424
+ 1. Create a `Dockerfile`:
425
+
426
+ ```dockerfile
427
+ FROM python:3.9-slim
428
+
429
+ WORKDIR /app
430
+
431
+ COPY requirements.txt .
432
+ RUN pip install -r requirements.txt
433
+
434
+ COPY . .
435
+
436
+ EXPOSE 8501
437
+
438
+ CMD ["streamlit", "run", "app.py"]
439
+ ```
440
+
441
+ 2. Build the Docker image:
442
+ ```bash
443
+ docker build -t anveshak .
444
+ ```
445
+
446
+ 3. Run the container:
447
+ ```bash
448
+ docker run -p 8501:8501 \
449
+ -e BUCKET_NAME_GCS=your-bucket-name \
450
+ -e METADATA_PATH_GCS=metadata/metadata.jsonl \
451
+ -e EMBEDDINGS_PATH_GCS=processed/embeddings/all_embeddings.npy \
452
+ -e INDICES_PATH_GCS=processed/indices/faiss_index.faiss \
453
+ -e CHUNKS_PATH_GCS=processed/chunks/text_chunks.txt \
454
+ -e RAW_TEXTS_UPLOADED_PATH_GCS=raw-texts/uploaded \
455
+ -e RAW_TEXTS_DOWNLOADED_PATH_GCS=raw-texts/downloaded/ \
456
+ -e CLEANED_TEXTS_PATH_GCS=cleaned-texts/ \
457
+ -e EMBEDDING_MODEL=intfloat/e5-large-v2 \
458
+ -e LLM_MODEL=gpt-3.5-turbo \
459
+ -e OPENAI_API_KEY=your-openai-api-key \
460
+ -e GCP_CREDENTIALS='{"type":"service_account",...}' \
461
+ anveshak
462
+ ```
463
+
464
+ ## Performance Tuning
465
+
466
+ ### Caching Configuration
467
+
468
+ Anveshak: Spirituality Q&A uses Streamlit's caching mechanisms to optimize performance:
469
+
470
+ #### Resource Caching
471
+ Used for loading models and data files that remain constant:
472
+
473
+ ```python
474
+ @st.cache_resource(show_spinner=False)
475
+ def cached_load_model():
476
+ # Load embedding model once and cache it
477
+ ```
478
+
479
+ This cache persists for the lifetime of the application.
480
+
481
+ #### Data Caching
482
+ Used for caching query results with a time-to-live (TTL):
483
+
484
+ ```python
485
+ @st.cache_data(ttl=3600, show_spinner=False)
486
+ def cached_process_query(query, top_k=5, word_limit=100):
487
+ # Cache query results for an hour
488
+ ```
489
+
490
+ The TTL (3600 seconds = 1 hour) can be adjusted based on your needs.
491
+
492
+ ### Memory Optimization
493
+
494
+ For deployments with limited memory:
495
+
496
+ 1. **Force CPU Usage**: Anveshak already forces CPU usage for the embedding model to avoid GPU memory issues:
497
+ ```python
498
+ os.environ["CUDA_VISIBLE_DEVICES"] = ""
499
+ ```
500
+
501
+ 2. **Adjust Batch Size**: If you're recreating the embeddings, consider reducing the batch size:
502
+ ```python
503
+ def create_embeddings(text_chunks, batch_size=16): # Reduced from 32
504
+ ```
505
+
506
+ 3. **Garbage Collection**: Anveshak performs explicit garbage collection after operations:
507
+ ```python
508
+ del outputs, inputs
509
+ gc.collect()
510
+ ```
511
+
512
+ ## Troubleshooting
513
+
514
+ ### Common Issues
515
+
516
+ #### Authentication Errors
517
+
518
+ **Symptom**: Error message about invalid credentials or permission denied.
519
+
520
+ **Solution**:
521
+ 1. Verify that your service account has the correct permissions (Storage Object Admin)
522
+ 2. Check that your API keys are correctly formatted and not expired
523
+ 3. Ensure that your GCP credentials JSON is valid and properly formatted
524
+
525
+ #### Missing Files
526
+
527
+ **Symptom**: Error about missing files or "File not found" when accessing GCS.
528
+
529
+ **Solution**:
530
+ 1. Verify the correct bucket name and file paths in your configuration
531
+ 2. Check that all required files exist in your GCS bucket
532
+ 3. Ensure your service account has access to the specified bucket
533
+
534
+ #### Memory Issues
535
+
536
+ **Symptom**: Application crashes with out-of-memory errors.
537
+
538
+ **Solution**:
539
+ 1. Increase the memory allocation for your deployment (if possible)
540
+ 2. Ensure that `os.environ["CUDA_VISIBLE_DEVICES"] = ""` is set to force CPU usage
541
+ 3. Implement additional garbage collection calls in high-memory operations
542
+
543
+ #### OpenAI API Rate Limits
544
+
545
+ **Symptom**: Errors about rate limits or exceeding quotas with OpenAI.
546
+
547
+ **Solution**:
548
+ 1. Implement retry logic with exponential backoff
549
+ 2. Consider using a paid tier OpenAI account with higher rate limits
550
+ 3. Add caching to reduce the number of API calls
551
+
552
+ ### Logs and Debugging
553
+
554
+ Anveshak includes comprehensive logging:
555
+
556
+ ```python
557
+ print(f"βœ… Model loaded successfully (cached)")
558
+ print(f"❌ Error loading model: {str(e)}")
559
+ ```
560
+
561
+ To enable more detailed logging, you can use Streamlit's built-in logging configuration:
562
+
563
+ ```python
564
+ import logging
565
+ logging.basicConfig(level=logging.INFO)
566
+ logger = logging.getLogger(__name__)
567
+
568
+ # Then use logger instead of print
569
+ logger.info("Model loaded successfully")
570
+ logger.error(f"Error loading model: {str(e)}")
571
+ ```
572
+
573
+ ## Special Considerations
574
+
575
+ ### Privacy
576
+
577
+ Anveshak: Spirituality Q&A is designed to not save or store any user queries or data. This is important for spiritual questions, which may be of a personal nature. No additional configuration is needed for this - the application simply does not implement any data storage functionality.
578
+
579
+ ### Language Support
580
+
581
+ Currently, Anveshak is only available in English. This is a known limitation of the current implementation. Future versions may include support for Sanskrit, Hindi, Bengali, Tamil, and other Indian languages.
582
+
583
+ ### Concise Answers
584
+
585
+ Anveshak generates concise answers rather than lengthy explanations. This is by design, to respect both copyright constraints and the nature of spiritual wisdom, which often benefits from clarity and simplicity.
586
+
587
+ ## Conclusion
588
+
589
+ This configuration guide provides all the necessary information to set up, customize, and deploy Anveshak: Spirituality Q&A. By following these instructions, you should be able to:
590
+
591
+ 1. Configure the necessary authentication for GCS and OpenAI
592
+ 2. Customize Anveshak's appearance and behavior
593
+ 3. Deploy the application on Hugging Face Spaces (recommended) or other platforms
594
+ 4. Optimize performance for your specific use case
595
+ 5. Troubleshoot common issues
596
+
597
+ The flexibility of the configuration options allows you to adapt the application to different deployment environments while maintaining the core functionality of providing spiritually informed answers based on traditional texts from diverse traditions and teachers of all backgrounds.
docs/data-handling-doc.md ADDED
@@ -0,0 +1,687 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Handling Explanation
2
+
3
+ This document explains how data is processed, stored, and handled in Anveshak: Spirituality Q&A, with special attention to ethical considerations and copyright respect.
4
+
5
+ ## Data Sources
6
+
7
+ ### Text Corpus Overview
8
+
9
+ Anveshak: Spirituality Q&A uses approximately 133 digitized spiritual texts sourced from freely available resources. These texts include:
10
+
11
+ - Ancient sacred literature (Vedas, Upanishads, Puranas, Sutras, Dharmaśāstras, and Agamas)
12
+ - Classical Indian texts (The Bhagavad Gita, The Śrīmad Bhāgavatam, and others)
13
+ - Indian historical texts (The Mahabharata and The Ramayana)
14
+ - Teachings of revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters of all genders, backgrounds, traditions, and walks of life
15
+
16
+ As stated in app.py:
17
+
18
+ > "Anveshak draws from a rich tapestry of spiritual wisdom found in classical Indian texts, philosophical treatises, and the teachings of revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters across centuries. The knowledge presented here spans multiple traditions, schools of thought, and spiritual lineages that have flourished in the Indian subcontinent and beyond."
19
+
20
+ ### Ethical Sourcing
21
+
22
+ All texts included in Anveshak meet the following criteria:
23
+
24
+ 1. **Public availability**: All texts were freely available from sources like archive.org
25
+ 2. **Educational use**: Texts are used solely for educational purposes
26
+ 3. **Proper attribution**: All sources are credited with author and publisher information
27
+ 4. **Respect for copyright**: Implementation of word limits and other copyright-respecting measures
28
+
29
+ As mentioned in app.py:
30
+
31
+ > "Note that the sources consist of about 133 digitized texts, all of which were freely available over the internet (on sites like archive.org). Many of the texts are English translations of original (and in some cases, ancient) sacred and spiritual texts. All of the copyrights belong to the respective authors and publishers and we bow down in gratitude to their selfless work. Anveshak merely re-presents the ocean of spiritual knowledge and wisdom contained in the original works with relevant citations in a limited number of words."
32
+
33
+ ## Data Processing Pipeline
34
+
35
+ ### 1. Data Collection
36
+
37
+ The data collection process involves two methods as implemented in preprocessing.py:
38
+
39
+ #### Manual Upload
40
+ Texts are manually uploaded to Google Cloud Storage (GCS) through a preprocessing script:
41
+
42
+ ```python
43
+ def upload_files_to_colab():
44
+ """Upload raw text files and metadata from local machine to Colab."""
45
+ # First, upload text files
46
+ print("Step 1: Please upload your text files...")
47
+ uploaded_text_files = files.upload() # This will prompt the user to upload files
48
+
49
+ # Create directory structure if it doesn't exist
50
+ os.makedirs(LOCAL_RAW_TEXTS_FOLDER, exist_ok=True)
51
+
52
+ # Move uploaded text files to the raw-texts folder
53
+ for filename, content in uploaded_text_files.items():
54
+ if filename.endswith(".txt"):
55
+ with open(os.path.join(LOCAL_RAW_TEXTS_FOLDER, filename), "wb") as f:
56
+ f.write(content)
57
+ print(f"βœ… Saved {filename} to {LOCAL_RAW_TEXTS_FOLDER}")
58
+ ```
59
+
60
+ #### Web Downloading
61
+ Some texts are automatically downloaded from URLs listed in the metadata file:
62
+
63
+ ```python
64
+ def download_text_files():
65
+ """Fetch metadata, filter unuploaded files, and download text files."""
66
+ metadata = fetch_metadata_from_gcs()
67
+ # Filter entries where Uploaded is False
68
+ files_to_download = [item for item in metadata if item["Uploaded"] == False]
69
+
70
+ # Process only necessary files
71
+ for item in files_to_download:
72
+ name, author, url = item["Title"], item["Author"], item["URL"]
73
+ if url.lower() == "not available":
74
+ print(f"❌ Skipping {name} - No URL available.")
75
+ continue
76
+
77
+ try:
78
+ response = requests.get(url)
79
+ if response.status_code == 200:
80
+ raw_text = response.text
81
+ filename = "{}.txt".format(name.replace(" ", "_"))
82
+ # Save to local first
83
+ local_path = f"/tmp/{filename}"
84
+ with open(local_path, "w", encoding="utf-8") as file:
85
+ file.write(raw_text)
86
+ # Upload to GCS
87
+ gcs_path = f"{RAW_TEXTS_DOWNLOADED_PATH_GCS}{filename}"
88
+ upload_to_gcs(local_path, gcs_path)
89
+ print(f"βœ… Downloaded & uploaded: {filename} ({len(raw_text.split())} words)")
90
+ else:
91
+ print(f"❌ Failed to download {name}: {url} (Status {response.status_code})")
92
+ except Exception as e:
93
+ print(f"❌ Error processing {name}: {e}")
94
+ ```
95
+
96
+ ### 2. Text Cleaning
97
+
98
+ Raw texts often contain HTML tags, OCR errors, and formatting issues. The cleaning process removes these artifacts using the exact implementation from preprocessing.py:
99
+
100
+ ```python
101
+ def rigorous_clean_text(text):
102
+ """
103
+ Clean text by removing metadata, junk text, and formatting issues.
104
+
105
+ This function:
106
+ 1. Removes HTML tags using BeautifulSoup
107
+ 2. Removes URLs and standalone numbers
108
+ 3. Removes all-caps OCR noise words
109
+ 4. Deduplicates adjacent identical lines
110
+ 5. Normalizes Unicode characters
111
+ 6. Standardizes whitespace and newlines
112
+
113
+ Args:
114
+ text (str): The raw text to clean
115
+
116
+ Returns:
117
+ str: The cleaned text
118
+ """
119
+ text = BeautifulSoup(text, "html.parser").get_text()
120
+ text = re.sub(r"https?:\/\/\S+", "", text) # Remove links
121
+ text = re.sub(r"\b\d+\b", "", text) # Remove standalone numbers
122
+ text = re.sub(r"\b[A-Z]{5,}\b", "", text) # Remove all-caps OCR noise words
123
+ lines = text.split("\n")
124
+ cleaned_lines = []
125
+ last_line = None
126
+
127
+ for line in lines:
128
+ line = line.strip()
129
+ if line and line != last_line:
130
+ cleaned_lines.append(line)
131
+ last_line = line
132
+
133
+ text = "\n".join(cleaned_lines)
134
+ text = unicodedata.normalize("NFKD", text)
135
+ text = re.sub(r"\s+", " ", text).strip()
136
+ text = re.sub(r"\n{2,}", "\n", text)
137
+ return text
138
+ ```
139
+
140
+ The cleaning process:
141
+ - Removes HTML tags using BeautifulSoup
142
+ - Eliminates URLs and standalone numbers
143
+ - Removes all-caps OCR noise words (common in digitized texts)
144
+ - Deduplicates adjacent identical lines
145
+ - Normalizes Unicode characters
146
+ - Standardizes whitespace and newlines
147
+
148
+ ### 3. Text Chunking
149
+
150
+ Clean texts are split into smaller, manageable chunks for processing using the exact implementation from preprocessing.py:
151
+
152
+ ```python
153
+ def chunk_text(text, chunk_size=500, overlap=50):
154
+ """
155
+ Split text into smaller, overlapping chunks for better retrieval.
156
+
157
+ Args:
158
+ text (str): The text to chunk
159
+ chunk_size (int): Maximum number of words per chunk
160
+ overlap (int): Number of words to overlap between chunks
161
+
162
+ Returns:
163
+ list: List of text chunks
164
+ """
165
+ words = text.split()
166
+ chunks = []
167
+ i = 0
168
+
169
+ while i < len(words):
170
+ chunk = " ".join(words[i:i + chunk_size])
171
+ chunks.append(chunk)
172
+ i += chunk_size - overlap
173
+
174
+ return chunks
175
+ ```
176
+
177
+ Chunking characteristics:
178
+ - **Chunk size**: 500 words per chunk, balancing context and retrieval precision
179
+ - **Overlap**: 50-word overlap between chunks to maintain context across chunk boundaries
180
+ - **Context preservation**: Ensures that passages aren't arbitrarily cut in the middle of important concepts
181
+
182
+ ### 4. Text Embedding
183
+
184
+ Chunks are converted to vector embeddings using the E5-large-v2 model with the actual implementation from preprocessing.py:
185
+
186
+ ```python
187
+ def create_embeddings(text_chunks, batch_size=32):
188
+ """
189
+ Generate embeddings for the given chunks of text using the specified embedding model.
190
+
191
+ This function:
192
+ 1. Uses SentenceTransformer to load the embedding model
193
+ 2. Prefixes each chunk with "passage:" as required by the E5 model
194
+ 3. Processes chunks in batches to manage memory usage
195
+ 4. Normalizes embeddings for cosine similarity search
196
+
197
+ Args:
198
+ text_chunks (list): List of text chunks to embed
199
+ batch_size (int): Number of chunks to process at once
200
+
201
+ Returns:
202
+ numpy.ndarray: Matrix of embeddings, one per text chunk
203
+ """
204
+ # Load the model with GPU optimization
205
+ model = SentenceTransformer(EMBEDDING_MODEL)
206
+ device = "cuda" if torch.cuda.is_available() else "cpu"
207
+ model = model.to(device)
208
+ print(f"πŸš€ Using device for embeddings: {device}")
209
+
210
+ prefixed_chunks = [f"passage: {text}" for text in text_chunks]
211
+ all_embeddings = []
212
+
213
+ for i in range(0, len(prefixed_chunks), batch_size):
214
+ batch = prefixed_chunks[i:i+batch_size]
215
+ # Move batch to GPU (if available) for faster processing
216
+ with torch.no_grad():
217
+ batch_embeddings = model.encode(batch, convert_to_numpy=True, normalize_embeddings=True)
218
+ all_embeddings.append(batch_embeddings)
219
+
220
+ if (i + batch_size) % 100 == 0 or (i + batch_size) >= len(prefixed_chunks):
221
+ print(f"πŸ“Œ Processed {i + min(batch_size, len(prefixed_chunks) - i)}/{len(prefixed_chunks)} documents")
222
+
223
+ return np.vstack(all_embeddings).astype("float32")
224
+ ```
225
+
226
+ Embedding process details:
227
+ - **Model**: E5-large-v2, a state-of-the-art embedding model for retrieval tasks
228
+ - **Prefix**: "passage:" prefix is added to each chunk for optimal embedding
229
+ - **Batching**: Processing in batches of 32 for memory efficiency
230
+ - **Normalization**: Embeddings are normalized for cosine similarity search
231
+ - **Output**: Each text chunk becomes a 1024-dimensional vector
232
+
233
+ ### 5. FAISS Index Creation
234
+
235
+ Embeddings are stored in a Facebook AI Similarity Search (FAISS) index for efficient similarity search:
236
+
237
+ ```python
238
+ # Build FAISS index
239
+ dimension = all_embeddings.shape[1]
240
+ index = faiss.IndexFlatIP(dimension)
241
+ index.add(all_embeddings)
242
+ ```
243
+
244
+ FAISS index characteristics:
245
+ - **Index type**: IndexFlatIP (Inner Product) for cosine similarity search
246
+ - **Exact search**: Uses exact search rather than approximate for maximum accuracy
247
+ - **Dimension**: 1024-dimensional vectors from the E5-large-v2 model
248
+
249
+ ### 6. Metadata Management
250
+
251
+ The system maintains metadata for each text to provide proper citations, using the implementation from rag_engine.py:
252
+
253
+ ```python
254
+ def fetch_metadata_from_gcs():
255
+ """
256
+ Fetch metadata.jsonl from GCS and return as a list of dictionaries.
257
+
258
+ Each dictionary represents a text entry with metadata like title, author, etc.
259
+
260
+ Returns:
261
+ list: List of dictionaries containing metadata for each text
262
+ """
263
+ blob = bucket.blob(METADATA_PATH_GCS)
264
+ # Download metadata file
265
+ metadata_jsonl = blob.download_as_text()
266
+ # Parse JSONL
267
+ metadata = [json.loads(line) for line in metadata_jsonl.splitlines()]
268
+ return metadata
269
+ ```
270
+
271
+ Metadata structure (JSONL format):
272
+ ```json
273
+ {"Title": "Bhagavad Gita", "Author": "Vyasa", "Publisher": "Gita Press, Gorakhpur, India", "URL": "https://archive.org/details/bhagavad-gita", "Uploaded": true}
274
+ {"Title": "Yoga Sutras", "Author": "Patanjali", "Publisher": "DIVINE LIFE SOCIETY", "URL": "https://archive.org/details/yoga-sutras", "Uploaded": true}
275
+ ```
276
+
277
+ ## Data Storage Architecture
278
+
279
+ ### Google Cloud Storage Structure
280
+
281
+ Anveshak: Spirituality Q&A uses Google Cloud Storage (GCS) as its primary data store, organized as follows:
282
+
283
+ ```
284
+ bucket_name/
285
+ β”œβ”€β”€ metadata/
286
+ β”‚ └── metadata.jsonl # Metadata for all texts
287
+ β”œβ”€β”€ raw-texts/
288
+ β”‚ β”œβ”€β”€ uploaded/ # Manually uploaded texts
289
+ β”‚ └── downloaded/ # Automatically downloaded texts
290
+ β”œβ”€β”€ cleaned-texts/ # Cleaned versions of all texts
291
+ └── processed/
292
+ β”œβ”€β”€ embeddings/
293
+ β”‚ └── all_embeddings.npy # Numpy array of embeddings
294
+ β”œβ”€β”€ indices/
295
+ β”‚ └── faiss_index.faiss # FAISS index file
296
+ └── chunks/
297
+ └── text_chunks.txt # Text chunks with metadata
298
+ ```
299
+
300
+ ### Local Caching
301
+
302
+ For deployment on Hugging Face Spaces, essential files are downloaded to local storage using the implementation from rag_engine.py:
303
+
304
+ ```python
305
+ # Local Paths
306
+ local_embeddings_file = "all_embeddings.npy"
307
+ local_faiss_index_file = "faiss_index.faiss"
308
+ local_text_chunks_file = "text_chunks.txt"
309
+ local_metadata_file = "metadata.jsonl"
310
+ ```
311
+
312
+ These files are loaded with caching to improve performance, using the actual implementation from rag_engine.py:
313
+
314
+ ```python
315
+ @st.cache_resource(show_spinner=False)
316
+ def cached_load_data_files():
317
+ """
318
+ Cached version of load_data_files() for FAISS index, text chunks, and metadata.
319
+
320
+ This function loads:
321
+ - FAISS index for vector similarity search
322
+ - Text chunks containing the original spiritual text passages
323
+ - Metadata dictionary with publication and author information
324
+
325
+ All files are downloaded from Google Cloud Storage if not already present locally.
326
+
327
+ Returns:
328
+ tuple: (faiss_index, text_chunks, metadata_dict) or (None, None, None) if loading fails
329
+ """
330
+ # Initialize GCP and OpenAI clients
331
+ bucket = setup_gcp_client()
332
+ openai_initialized = setup_openai_client()
333
+
334
+ if not bucket or not openai_initialized:
335
+ print("Failed to initialize required services")
336
+ return None, None, None
337
+
338
+ # Get GCS paths from secrets - required
339
+ try:
340
+ metadata_file_gcs = st.secrets["METADATA_PATH_GCS"]
341
+ embeddings_file_gcs = st.secrets["EMBEDDINGS_PATH_GCS"]
342
+ faiss_index_file_gcs = st.secrets["INDICES_PATH_GCS"]
343
+ text_chunks_file_gcs = st.secrets["CHUNKS_PATH_GCS"]
344
+ except KeyError as e:
345
+ print(f"❌ Error: Required GCS path not found in secrets: {e}")
346
+ return None, None, None
347
+
348
+ # Download necessary files if not already present locally
349
+ success = True
350
+ success &= download_file_from_gcs(bucket, faiss_index_file_gcs, local_faiss_index_file)
351
+ success &= download_file_from_gcs(bucket, text_chunks_file_gcs, local_text_chunks_file)
352
+ success &= download_file_from_gcs(bucket, metadata_file_gcs, local_metadata_file)
353
+
354
+ if not success:
355
+ print("Failed to download required files")
356
+ return None, None, None
357
+
358
+ # Load FAISS index, text chunks, and metadata
359
+ try:
360
+ faiss_index = faiss.read_index(local_faiss_index_file)
361
+ except Exception as e:
362
+ print(f"❌ Error loading FAISS index: {str(e)}")
363
+ return None, None, None
364
+
365
+ # Load text chunks
366
+ try:
367
+ text_chunks = {} # Mapping: ID -> (Title, Author, Text)
368
+ with open(local_text_chunks_file, "r", encoding="utf-8") as f:
369
+ for line in f:
370
+ parts = line.strip().split("\t")
371
+ if len(parts) == 4:
372
+ text_chunks[int(parts[0])] = (parts[1], parts[2], parts[3])
373
+ except Exception as e:
374
+ print(f"❌ Error loading text chunks: {str(e)}")
375
+ return None, None, None
376
+
377
+ # Load metadata
378
+ try:
379
+ metadata_dict = {}
380
+ with open(local_metadata_file, "r", encoding="utf-8") as f:
381
+ for line in f:
382
+ item = json.loads(line)
383
+ metadata_dict[item["Title"]] = item
384
+ except Exception as e:
385
+ print(f"❌ Error loading metadata: {str(e)}")
386
+ return None, None, None
387
+
388
+ print(f"βœ… Data loaded successfully (cached): {len(text_chunks)} passages available")
389
+ return faiss_index, text_chunks, metadata_dict
390
+ ```
391
+
392
+ ## Data Access During Query Processing
393
+
394
+ ### Query Embedding
395
+
396
+ User queries are embedded using the same model as the text corpus, with the actual implementation from rag_engine.py:
397
+
398
+ ```python
399
+ def get_embedding(text):
400
+ """
401
+ Generate embeddings for a text query using the cached model.
402
+
403
+ Uses an in-memory cache to avoid redundant embedding generation for repeated queries.
404
+ Properly prefixes inputs with "query:" or "passage:" as required by the E5 model.
405
+
406
+ Args:
407
+ text (str): The query text to embed
408
+
409
+ Returns:
410
+ numpy.ndarray: The embedding vector or a zero vector if embedding fails
411
+ """
412
+ if text in query_embedding_cache:
413
+ return query_embedding_cache[text]
414
+
415
+ try:
416
+ tokenizer, model = cached_load_model()
417
+ if model is None:
418
+ print("Model is None, returning zero embedding")
419
+ return np.zeros((1, 384), dtype=np.float32)
420
+
421
+ # Format input based on text length
422
+ # For E5 models, "query:" prefix is for questions, "passage:" for documents
423
+ input_text = f"query: {text}" if len(text) < 512 else f"passage: {text}"
424
+ inputs = tokenizer(
425
+ input_text,
426
+ padding=True,
427
+ truncation=True,
428
+ return_tensors="pt",
429
+ max_length=512,
430
+ return_attention_mask=True
431
+ )
432
+ with torch.no_grad():
433
+ outputs = model(**inputs)
434
+ embeddings = average_pool(outputs.last_hidden_state, inputs['attention_mask'])
435
+ embeddings = nn.functional.normalize(embeddings, p=2, dim=1)
436
+ embeddings = embeddings.detach().cpu().numpy()
437
+ del outputs, inputs
438
+ gc.collect()
439
+ query_embedding_cache[text] = embeddings
440
+ return embeddings
441
+ except Exception as e:
442
+ print(f"❌ Embedding error: {str(e)}")
443
+ return np.zeros((1, 384), dtype=np.float32)
444
+ ```
445
+
446
+ Note the use of:
447
+ - **Query prefix**: "query:" is added to distinguish query embeddings from passage embeddings
448
+ - **Truncation**: Queries are truncated to 512 tokens if necessary
449
+ - **Memory management**: Tensors are detached and moved to CPU after computation
450
+ - **Caching**: Query embeddings are cached to avoid redundant computation
451
+
452
+ ### Passage Retrieval
453
+
454
+ The system retrieves relevant passages based on query embedding similarity using the implementation from rag_engine.py:
455
+
456
+ ```python
457
+ def retrieve_passages(query, faiss_index, text_chunks, metadata_dict, top_k=5, similarity_threshold=0.5):
458
+ """
459
+ Retrieve the most relevant passages for a given spiritual query.
460
+
461
+ This function:
462
+ 1. Embeds the user query using the same model used for text chunks
463
+ 2. Finds similar passages using the FAISS index with cosine similarity
464
+ 3. Filters results based on similarity threshold to ensure relevance
465
+ 4. Enriches results with metadata (title, author, publisher)
466
+ 5. Ensures passage diversity by including only one passage per source title
467
+
468
+ Args:
469
+ query (str): The user's spiritual question
470
+ faiss_index: FAISS index containing passage embeddings
471
+ text_chunks (dict): Dictionary mapping IDs to text chunks and metadata
472
+ metadata_dict (dict): Dictionary containing publication information
473
+ top_k (int): Maximum number of passages to retrieve
474
+ similarity_threshold (float): Minimum similarity score (0.0-1.0) for retrieved passages
475
+
476
+ Returns:
477
+ tuple: (retrieved_passages, retrieved_sources) containing the text and source information
478
+ """
479
+ try:
480
+ print(f"\nπŸ” Retrieving passages for query: {query}")
481
+ query_embedding = get_embedding(query)
482
+ distances, indices = faiss_index.search(query_embedding, top_k * 2)
483
+ print(f"Found {len(distances[0])} potential matches")
484
+ retrieved_passages = []
485
+ retrieved_sources = []
486
+ cited_titles = set()
487
+ for dist, idx in zip(distances[0], indices[0]):
488
+ print(f"Distance: {dist:.4f}, Index: {idx}")
489
+ if idx in text_chunks and dist >= similarity_threshold:
490
+ title_with_txt, author, text = text_chunks[idx]
491
+ clean_title = title_with_txt.replace(".txt", "") if title_with_txt.endswith(".txt") else title_with_txt
492
+ clean_title = unicodedata.normalize("NFC", clean_title)
493
+ if clean_title in cited_titles:
494
+ continue
495
+ metadata_entry = metadata_dict.get(clean_title, {})
496
+ author = metadata_entry.get("Author", "Unknown")
497
+ publisher = metadata_entry.get("Publisher", "Unknown")
498
+ cited_titles.add(clean_title)
499
+ retrieved_passages.append(text)
500
+ retrieved_sources.append((clean_title, author, publisher))
501
+ if len(retrieved_passages) == top_k:
502
+ break
503
+ print(f"Retrieved {len(retrieved_passages)} passages")
504
+ return retrieved_passages, retrieved_sources
505
+ except Exception as e:
506
+ print(f"❌ Error in retrieve_passages: {str(e)}")
507
+ return [], []
508
+ ```
509
+
510
+ Important aspects:
511
+ - **Similarity threshold**: Passages must have a similarity score >= 0.5 to be included
512
+ - **Diversity**: Only one passage per source title is included in the results
513
+ - **Metadata enrichment**: Publisher information is added from the metadata
514
+ - **Configurable retrieval**: The `top_k` parameter allows users to adjust how many sources to use
515
+
516
+ ## User Data Privacy
517
+
518
+ ### No Data Collection
519
+
520
+ Anveshak is designed to respect user privacy by not collecting or storing any user data:
521
+
522
+ 1. **No Query Storage**: User questions are processed in memory and not saved
523
+ 2. **No User Identification**: No user accounts or identification is required
524
+ 3. **No Analytics**: No usage tracking or analytics are implemented
525
+ 4. **No Cookies**: No browser cookies are used to track users
526
+
527
+ As stated in app.py:
528
+
529
+ > "We do not save any user data or queries. However, user questions are processed using OpenAI's LLM service to generate responses. While we do not store this information, please be aware that interactions are processed through OpenAI's platform and are subject to their privacy policies and data handling practices."
530
+
531
+ This privacy-first approach ensures that users can freely explore spiritual questions without concerns about their queries being stored or analyzed.
532
+
533
+ ## Copyright and Ethical Considerations
534
+
535
+ ### Word Limit Implementation
536
+
537
+ To respect copyright and ensure fair use, answers are limited to a configurable word count using the actual implementation from rag_engine.py:
538
+
539
+ ```python
540
+ def answer_with_llm(query, context=None, word_limit=100):
541
+ # ... LLM processing ...
542
+
543
+ # Extract and format the answer
544
+ answer = response.choices[0].message.content.strip()
545
+ words = answer.split()
546
+ if len(words) > word_limit:
547
+ answer = " ".join(words[:word_limit])
548
+ if not answer.endswith((".", "!", "?")):
549
+ answer += "."
550
+
551
+ return answer
552
+ ```
553
+
554
+ Users can adjust the word limit from 50 to 500 words, ensuring that responses are:
555
+ - Short enough to respect copyright
556
+ - Long enough to provide meaningful information
557
+ - Always properly cited to the original source
558
+
559
+ ### Citation Format
560
+
561
+ Every answer includes citations to the original sources using the implementation from rag_engine.py:
562
+
563
+ ```python
564
+ def format_citations(sources):
565
+ """
566
+ Format citations for display to the user.
567
+
568
+ Creates properly formatted citations for each source used in generating the answer.
569
+ Each citation appears on a new line with consistent formatting.
570
+
571
+ Args:
572
+ sources (list): List of (title, author, publisher) tuples
573
+
574
+ Returns:
575
+ str: Formatted citations as a string with each citation on a new line
576
+ """
577
+ formatted_citations = []
578
+ for title, author, publisher in sources:
579
+ if publisher.endswith(('.', '!', '?')):
580
+ formatted_citations.append(f"πŸ“š {title} by {author}, Published by {publisher}")
581
+ else:
582
+ formatted_citations.append(f"πŸ“š {title} by {author}, Published by {publisher}.")
583
+ return "\n".join(formatted_citations)
584
+ ```
585
+
586
+ Citations include:
587
+ - Book/text title
588
+ - Author name
589
+ - Publisher information
590
+
591
+ ### Acknowledgment of Sources
592
+
593
+ Anveshak: Spirituality Q&A includes dedicated pages for acknowledging:
594
+ - Publishers of the original texts
595
+ - Saints, Sages, and Spiritual Masters whose teachings are referenced
596
+ - The origins and traditions of the spiritual texts
597
+
598
+ A thank-you note is also prominently featured on the main page, as shown in app.py:
599
+
600
+ ```python
601
+ st.markdown('<div class="acknowledgment-header">A Heartfelt Thank You</div>', unsafe_allow_html=True)
602
+ st.markdown("""
603
+ It is believed that one cannot be in a spiritual path without the will of the Lord. One need not be a believer or a non-believer, merely proceeding to thoughtlessness and observation is enough to evolve and shape perspectives. But that happens through grace. It is believed that without the will of the Lord, one cannot be blessed by real Saints, and without the will of the Saints, one cannot get close to them or God.
604
+
605
+ Therefore, with deepest reverence, we express our gratitude to:
606
+
607
+ **The Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters** of all genders, backgrounds, traditions, and walks of life whose timeless wisdom illuminates Anveshak. From ancient Sages to modern Masters, their selfless dedication to uplift humanity through selfless love and spiritual knowledge continues to guide seekers on the path.
608
+ # ...
609
+ """)
610
+ ```
611
+
612
+ ### Inclusive Recognition
613
+
614
+ Anveshak explicitly acknowledges and honors spiritual teachers from all backgrounds:
615
+
616
+ - All references to spiritual figures capitalize the first letter (Saints, Sages, etc.)
617
+ - The application includes language acknowledging Masters of "all genders, backgrounds, traditions, and walks of life"
618
+ - The selection of texts aims to represent diverse spiritual traditions
619
+
620
+ From the Sources.py file:
621
+
622
+ > "Additionally, there are and there have been many other great Saints, enlightened beings, Sadhus, Sages, and Gurus who have worked tirelessly to uplift humanity and guide beings to their true SELF and path, of whom little is known and documented. We thank them and acknowledge their contribution to the world."
623
+
624
+ ## Data Replication and Backup
625
+
626
+ ### GCS as Primary Storage
627
+
628
+ Google Cloud Storage serves as both the primary storage and backup system:
629
+
630
+ - All preprocessed data is stored in GCS buckets
631
+ - GCS provides built-in redundancy and backup capabilities
632
+ - Data is loaded from GCS at application startup
633
+
634
+ ### Local Caching
635
+
636
+ For performance, Anveshak caches data locally using the implementation from rag_engine.py:
637
+
638
+ ```python
639
+ def download_file_from_gcs(bucket, gcs_path, local_path):
640
+ """
641
+ Download a file from GCS to local storage if not already present.
642
+
643
+ Only downloads if the file isn't already present locally, avoiding redundant downloads.
644
+
645
+ Args:
646
+ bucket: GCS bucket object
647
+ gcs_path (str): Path to the file in GCS
648
+ local_path (str): Local path where the file should be saved
649
+
650
+ Returns:
651
+ bool: True if download was successful or file already exists, False otherwise
652
+ """
653
+ try:
654
+ if os.path.exists(local_path):
655
+ print(f"File already exists locally: {local_path}")
656
+ return True
657
+
658
+ blob = bucket.blob(gcs_path)
659
+ blob.download_to_filename(local_path)
660
+ print(f"βœ… Downloaded {gcs_path} β†’ {local_path}")
661
+ return True
662
+ except Exception as e:
663
+ print(f"❌ Error downloading {gcs_path}: {str(e)}")
664
+ return False
665
+ ```
666
+
667
+ This approach:
668
+ - Avoids redundant downloads
669
+ - Preserves data across application restarts
670
+ - Reduces API calls to GCS
671
+
672
+ ## Conclusion
673
+
674
+ Anveshak: Spirituality Q&A implements a comprehensive data handling strategy that:
675
+
676
+ 1. **Respects Copyright**: Through word limits, citations, and acknowledgments
677
+ 2. **Preserves Source Integrity**: By maintaining accurate metadata and citations
678
+ 3. **Optimizes Performance**: Through efficient storage, retrieval, and caching
679
+ 4. **Ensures Ethical Use**: By focusing on educational purposes and proper attribution
680
+ 5. **Protects Privacy**: By not collecting or storing user data
681
+ 6. **Honors Diversity**: By acknowledging spiritual teachers of all backgrounds and traditions
682
+
683
+ This balance between technical efficiency and ethical responsibility allows Anveshak to serve as a bridge to spiritual knowledge while respecting the original sources, traditions, and user privacy. The system is designed not to replace personal spiritual inquiry but to supplement it by making traditional wisdom more accessible.
684
+
685
+ As stated in the conclusion of the blog post:
686
+
687
+ > "The core philosophy guiding this project is that while technology can facilitate access to spiritual knowledge, the journey to self-discovery remains deeply personal. As Anveshak states: 'The path and journey to the SELF is designed to be undertaken alone. The all-encompassing knowledge is internal and not external.'"
scripts/preprocessing.ipynb ADDED
@@ -0,0 +1,644 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "colab": {
6
+ "provenance": [],
7
+ "gpuType": "L4"
8
+ },
9
+ "kernelspec": {
10
+ "name": "python3",
11
+ "display_name": "Python 3"
12
+ },
13
+ "language_info": {
14
+ "name": "python"
15
+ },
16
+ "accelerator": "GPU"
17
+ },
18
+ "cells": [
19
+ {
20
+ "cell_type": "code",
21
+ "source": [
22
+ "\"\"\"\n",
23
+ "Anveshak: Spirituality Q&A - Data Preprocessing Pipeline\n",
24
+ "\n",
25
+ "This script processes the spiritual text corpus for the Anveshak application:\n",
26
+ "1. Uploads and downloads text files from various sources\n",
27
+ "2. Cleans and processes the texts to remove artifacts and noise\n",
28
+ "3. Chunks texts into smaller, manageable pieces\n",
29
+ "4. Generates embeddings using the E5-large-v2 model\n",
30
+ "5. Creates a FAISS index for efficient similarity search\n",
31
+ "6. Uploads all processed data to Google Cloud Storage\n",
32
+ "\n",
33
+ "Usage:\n",
34
+ "- Run in Google Colab with GPU runtime for faster embedding generation\n",
35
+ "- Ensure GCP authentication is set up before running\n",
36
+ "- Configure the constants below with your actual settings\n",
37
+ "\"\"\""
38
+ ],
39
+ "metadata": {
40
+ "id": "Cyjr-eDz9GmH"
41
+ },
42
+ "execution_count": null,
43
+ "outputs": []
44
+ },
45
+ {
46
+ "cell_type": "code",
47
+ "source": [
48
+ "# =============================================================================\n",
49
+ "# CONFIGURATION SETTINGS\n",
50
+ "# =============================================================================\n",
51
+ "# Update these values with your actual settings\n",
52
+ "# Before open-sourcing, clear these values or replace with placeholders\n",
53
+ "BUCKET_NAME_GCS = \"your-bucket-name\" # e.g., \"spiritual-texts-bucket\"\n",
54
+ "EMBEDDING_MODEL = \"your-embedding-model\" # e.g., \"intfloat/e5-large-v2\"\n",
55
+ "# LLM_MODEL = \"your-llm-model\" # e.g., \"gpt-3.5-turbo\"\n",
56
+ "\n",
57
+ "# GCS Paths - update these with your folder structure\n",
58
+ "METADATA_PATH_GCS = \"metadata/metadata.jsonl\"\n",
59
+ "RAW_TEXTS_UPLOADED_PATH_GCS = \"raw-texts/uploaded\"\n",
60
+ "RAW_TEXTS_DOWNLOADED_PATH_GCS = \"raw-texts/downloaded/\"\n",
61
+ "CLEANED_TEXTS_PATH_GCS = \"cleaned-texts/\"\n",
62
+ "EMBEDDINGS_PATH_GCS = \"processed/embeddings/all_embeddings.npy\"\n",
63
+ "INDICES_PATH_GCS = \"processed/indices/faiss_index.faiss\"\n",
64
+ "CHUNKS_PATH_GCS = \"processed/chunks/text_chunks.txt\"\n",
65
+ "\n",
66
+ "# Local file paths in Colab environment - update these with your folder structure\n",
67
+ "LOCAL_METADATA_FILE = \"/content/metadata.jsonl\"\n",
68
+ "LOCAL_RAW_TEXTS_FOLDER = \"/content/raw-texts/uploaded\"\n",
69
+ "LOCAL_EMBEDDINGS_FILE = \"/tmp/all_embeddings.npy\"\n",
70
+ "LOCAL_FAISS_INDEX_FILE = \"/tmp/faiss_index.faiss\"\n",
71
+ "LOCAL_TEXT_CHUNKS_FILE = \"/tmp/text_chunks.txt\""
72
+ ],
73
+ "metadata": {
74
+ "id": "YEDyIvmoXsPB"
75
+ },
76
+ "execution_count": null,
77
+ "outputs": []
78
+ },
79
+ {
80
+ "cell_type": "code",
81
+ "execution_count": null,
82
+ "metadata": {
83
+ "id": "H1tEbKhur8xf"
84
+ },
85
+ "outputs": [],
86
+ "source": [
87
+ "# Install required packages\n",
88
+ "!pip install faiss-cpu"
89
+ ]
90
+ },
91
+ {
92
+ "cell_type": "code",
93
+ "source": [
94
+ "# Import necessary libraries\n",
95
+ "from google.colab import files\n",
96
+ "from google.colab import auth\n",
97
+ "from google.cloud import storage\n",
98
+ "import os\n",
99
+ "import json\n",
100
+ "import requests\n",
101
+ "import re\n",
102
+ "import unicodedata\n",
103
+ "from bs4 import BeautifulSoup\n",
104
+ "import numpy as np\n",
105
+ "import faiss\n",
106
+ "import torch\n",
107
+ "from sentence_transformers import SentenceTransformer"
108
+ ],
109
+ "metadata": {
110
+ "id": "xCDTvZJRse4-"
111
+ },
112
+ "execution_count": null,
113
+ "outputs": []
114
+ },
115
+ {
116
+ "cell_type": "code",
117
+ "source": [
118
+ "# =============================================================================\n",
119
+ "# AUTHENTICATION & INITIALIZATION\n",
120
+ "# =============================================================================\n",
121
+ "\n",
122
+ "# Authenticate with Google Cloud (only needed in Colab)\n",
123
+ "auth.authenticate_user()\n",
124
+ "\n",
125
+ "# Initialize GCS client (single initialization)\n",
126
+ "storage_client = storage.Client()\n",
127
+ "bucket = storage_client.bucket(BUCKET_NAME_GCS)"
128
+ ],
129
+ "metadata": {
130
+ "id": "hSYQ0ZSasjLd"
131
+ },
132
+ "execution_count": null,
133
+ "outputs": []
134
+ },
135
+ {
136
+ "cell_type": "code",
137
+ "source": [
138
+ "# =============================================================================\n",
139
+ "# PART 1: UPLOAD RAW TEXTS AND METADATA\n",
140
+ "# =============================================================================\n",
141
+ "\n",
142
+ "def upload_files_to_colab():\n",
143
+ " \"\"\"\n",
144
+ " Upload raw text files and metadata from local machine to Colab.\n",
145
+ "\n",
146
+ " This function:\n",
147
+ " 1. Prompts the user to upload text files\n",
148
+ " 2. Saves the uploaded files to a local directory\n",
149
+ " 3. Prompts the user to upload the metadata.jsonl file\n",
150
+ " 4. Saves the metadata file to the specified location\n",
151
+ "\n",
152
+ " Returns:\n",
153
+ " bool: True if upload was successful, False otherwise\n",
154
+ " \"\"\"\n",
155
+ " # First, upload text files\n",
156
+ " print(\"Step 1: Please upload your text files...\")\n",
157
+ " uploaded_text_files = files.upload() # This will prompt the user to upload files\n",
158
+ "\n",
159
+ " # Create directory structure if it doesn't exist\n",
160
+ " os.makedirs(LOCAL_RAW_TEXTS_FOLDER, exist_ok=True)\n",
161
+ "\n",
162
+ " # Move uploaded text files to the raw-texts folder\n",
163
+ " for filename, content in uploaded_text_files.items():\n",
164
+ " if filename.endswith(\".txt\"):\n",
165
+ " with open(os.path.join(LOCAL_RAW_TEXTS_FOLDER, filename), \"wb\") as f:\n",
166
+ " f.write(content)\n",
167
+ " print(f\"βœ… Saved {filename} to {LOCAL_RAW_TEXTS_FOLDER}\")\n",
168
+ "\n",
169
+ " print(\"Text files upload complete!\")\n",
170
+ "\n",
171
+ " # Next, upload metadata file\n",
172
+ " print(\"\\nStep 2: Please upload your metadata.jsonl file...\")\n",
173
+ " uploaded_metadata = files.upload() # This will prompt the user to upload files\n",
174
+ "\n",
175
+ " # Save metadata file\n",
176
+ " metadata_uploaded = False\n",
177
+ " for filename, content in uploaded_metadata.items():\n",
178
+ " if filename == \"metadata.jsonl\":\n",
179
+ " # Ensure the directory for metadata file exists\n",
180
+ " os.makedirs(os.path.dirname(LOCAL_METADATA_FILE), exist_ok=True)\n",
181
+ " with open(LOCAL_METADATA_FILE, \"wb\") as f:\n",
182
+ " f.write(content)\n",
183
+ " print(f\"βœ… Saved metadata.jsonl to {LOCAL_METADATA_FILE}\")\n",
184
+ " metadata_uploaded = True\n",
185
+ "\n",
186
+ " if not metadata_uploaded:\n",
187
+ " print(\"⚠️ Warning: metadata.jsonl was not uploaded. Please upload it to continue.\")\n",
188
+ " return False\n",
189
+ "\n",
190
+ " print(\"Upload to Colab complete!\")\n",
191
+ " return True\n",
192
+ "\n",
193
+ "def upload_files_to_gcs():\n",
194
+ " \"\"\"\n",
195
+ " Upload raw text files and metadata from Colab to Google Cloud Storage.\n",
196
+ "\n",
197
+ " This function:\n",
198
+ " 1. Uploads each text file from the local directory to GCS\n",
199
+ " 2. Uploads the metadata.jsonl file to GCS\n",
200
+ "\n",
201
+ " All files are uploaded to the paths specified in the configuration constants.\n",
202
+ " \"\"\"\n",
203
+ " # Upload each file from the local raw-texts folder to GCS\n",
204
+ " for filename in os.listdir(LOCAL_RAW_TEXTS_FOLDER):\n",
205
+ " local_path = os.path.join(LOCAL_RAW_TEXTS_FOLDER, filename)\n",
206
+ " blob_path = f\"{RAW_TEXTS_UPLOADED_PATH_GCS}/{filename}\" # GCS path\n",
207
+ " blob = bucket.blob(blob_path)\n",
208
+ " try:\n",
209
+ " blob.upload_from_filename(local_path)\n",
210
+ " print(f\"βœ… Uploaded: {filename} -> gs://{BUCKET_NAME_GCS}/{blob_path}\")\n",
211
+ " except Exception as e:\n",
212
+ " print(f\"❌ Failed to upload {filename}: {e}\")\n",
213
+ "\n",
214
+ " # Upload metadata file\n",
215
+ " blob = bucket.blob(METADATA_PATH_GCS)\n",
216
+ " try:\n",
217
+ " blob.upload_from_filename(LOCAL_METADATA_FILE)\n",
218
+ " print(f\"βœ… Uploaded metadata.jsonl -> gs://{BUCKET_NAME_GCS}/{METADATA_PATH_GCS}\")\n",
219
+ " except Exception as e:\n",
220
+ " print(f\"❌ Failed to upload metadata: {e}\")"
221
+ ],
222
+ "metadata": {
223
+ "id": "cShc029islmO"
224
+ },
225
+ "execution_count": null,
226
+ "outputs": []
227
+ },
228
+ {
229
+ "cell_type": "code",
230
+ "source": [
231
+ "# =============================================================================\n",
232
+ "# PART 2: DOWNLOAD AND CLEAN TEXTS\n",
233
+ "# =============================================================================\n",
234
+ "\n",
235
+ "def fetch_metadata_from_gcs():\n",
236
+ " \"\"\"\n",
237
+ " Fetch metadata.jsonl from GCS and return as a list of dictionaries.\n",
238
+ "\n",
239
+ " Each dictionary represents a text entry with metadata like title, author, etc.\n",
240
+ "\n",
241
+ " Returns:\n",
242
+ " list: List of dictionaries containing metadata for each text\n",
243
+ " \"\"\"\n",
244
+ " blob = bucket.blob(METADATA_PATH_GCS)\n",
245
+ " # Download metadata file\n",
246
+ " metadata_jsonl = blob.download_as_text()\n",
247
+ " # Parse JSONL\n",
248
+ " metadata = [json.loads(line) for line in metadata_jsonl.splitlines()]\n",
249
+ " return metadata\n",
250
+ "\n",
251
+ "def upload_to_gcs(source_file, destination_path):\n",
252
+ " \"\"\"\n",
253
+ " Upload a local file to Google Cloud Storage.\n",
254
+ "\n",
255
+ " Args:\n",
256
+ " source_file (str): Path to the local file\n",
257
+ " destination_path (str): Path in GCS where the file should be uploaded\n",
258
+ " \"\"\"\n",
259
+ " blob = bucket.blob(destination_path)\n",
260
+ " blob.upload_from_filename(source_file)\n",
261
+ " print(f\"πŸ“€ Uploaded to GCS: {destination_path}\")\n",
262
+ "\n",
263
+ "def download_text_files():\n",
264
+ " \"\"\"\n",
265
+ " Download text files from URLs specified in the metadata.\n",
266
+ "\n",
267
+ " This function:\n",
268
+ " 1. Fetches metadata from GCS\n",
269
+ " 2. Filters entries where Uploaded=False (texts to be downloaded)\n",
270
+ " 3. Downloads each text from its URL\n",
271
+ " 4. Uploads the downloaded text to GCS\n",
272
+ "\n",
273
+ " This allows automated collection of texts that weren't manually uploaded.\n",
274
+ " \"\"\"\n",
275
+ " metadata = fetch_metadata_from_gcs()\n",
276
+ " # Filter entries where Uploaded is False\n",
277
+ " files_to_download = [item for item in metadata if item[\"Uploaded\"] == False]\n",
278
+ " print(f\"πŸ” Found {len(files_to_download)} files to download\")\n",
279
+ "\n",
280
+ " # Process only necessary files\n",
281
+ " for item in files_to_download:\n",
282
+ " name, author, url = item[\"Title\"], item[\"Author\"], item[\"URL\"]\n",
283
+ " if url.lower() == \"not available\":\n",
284
+ " print(f\"❌ Skipping {name} - No URL available.\")\n",
285
+ " continue\n",
286
+ "\n",
287
+ " try:\n",
288
+ " response = requests.get(url)\n",
289
+ " if response.status_code == 200:\n",
290
+ " raw_text = response.text\n",
291
+ " filename = \"{}.txt\".format(name.replace(\" \", \"_\"))\n",
292
+ " # Save to local first\n",
293
+ " local_path = f\"/tmp/{filename}\"\n",
294
+ " with open(local_path, \"w\", encoding=\"utf-8\") as file:\n",
295
+ " file.write(raw_text)\n",
296
+ " # Upload to GCS\n",
297
+ " gcs_path = f\"{RAW_TEXTS_DOWNLOADED_PATH_GCS}{filename}\"\n",
298
+ " upload_to_gcs(local_path, gcs_path)\n",
299
+ " print(f\"βœ… Downloaded & uploaded: {filename} ({len(raw_text.split())} words)\")\n",
300
+ " # Clean up temp file\n",
301
+ " os.remove(local_path)\n",
302
+ " else:\n",
303
+ " print(f\"❌ Failed to download {name}: {url} (Status {response.status_code})\")\n",
304
+ " except Exception as e:\n",
305
+ " print(f\"❌ Error processing {name}: {e}\")\n",
306
+ "\n",
307
+ "def rigorous_clean_text(text):\n",
308
+ " \"\"\"\n",
309
+ " Clean text by removing metadata, junk text, and formatting issues.\n",
310
+ "\n",
311
+ " This function:\n",
312
+ " 1. Removes HTML tags using BeautifulSoup\n",
313
+ " 2. Removes URLs and standalone numbers\n",
314
+ " 3. Removes all-caps OCR noise words\n",
315
+ " 4. Deduplicates adjacent identical lines\n",
316
+ " 5. Normalizes Unicode characters\n",
317
+ " 6. Standardizes whitespace and newlines\n",
318
+ "\n",
319
+ " Args:\n",
320
+ " text (str): The raw text to clean\n",
321
+ "\n",
322
+ " Returns:\n",
323
+ " str: The cleaned text\n",
324
+ " \"\"\"\n",
325
+ " text = BeautifulSoup(text, \"html.parser\").get_text()\n",
326
+ " text = re.sub(r\"https?:\\/\\/\\S+\", \"\", text) # Remove links\n",
327
+ " text = re.sub(r\"\\b\\d+\\b\", \"\", text) # Remove standalone numbers\n",
328
+ " text = re.sub(r\"\\b[A-Z]{5,}\\b\", \"\", text) # Remove all-caps OCR noise words\n",
329
+ " lines = text.split(\"\\n\")\n",
330
+ " cleaned_lines = []\n",
331
+ " last_line = None\n",
332
+ "\n",
333
+ " for line in lines:\n",
334
+ " line = line.strip()\n",
335
+ " if line and line != last_line:\n",
336
+ " cleaned_lines.append(line)\n",
337
+ " last_line = line\n",
338
+ "\n",
339
+ " text = \"\\n\".join(cleaned_lines)\n",
340
+ " text = unicodedata.normalize(\"NFKD\", text)\n",
341
+ " text = re.sub(r\"\\s+\", \" \", text).strip()\n",
342
+ " text = re.sub(r\"\\n{2,}\", \"\\n\", text)\n",
343
+ " return text\n",
344
+ "\n",
345
+ "def clean_and_upload_texts():\n",
346
+ " \"\"\"\n",
347
+ " Download raw texts from GCS, clean them, and upload cleaned versions back to GCS.\n",
348
+ "\n",
349
+ " This function processes all texts in both the uploaded and downloaded folders:\n",
350
+ " 1. For each text file, downloads it from GCS\n",
351
+ " 2. Cleans the text using rigorous_clean_text()\n",
352
+ " 3. Uploads the cleaned version back to GCS in the cleaned-texts folder\n",
353
+ "\n",
354
+ " This step ensures that all texts are properly formatted before embedding generation.\n",
355
+ " \"\"\"\n",
356
+ " raw_texts_folders = [RAW_TEXTS_DOWNLOADED_PATH_GCS, RAW_TEXTS_UPLOADED_PATH_GCS] # Process both folders\n",
357
+ " total_files = 0 # Counter to track number of processed files\n",
358
+ "\n",
359
+ " for raw_texts_folder in raw_texts_folders:\n",
360
+ " # List all files in the current raw-texts folder\n",
361
+ " blobs = list(bucket.list_blobs(prefix=raw_texts_folder))\n",
362
+ " print(f\"πŸ” Found {len(blobs)} files in {raw_texts_folder}\")\n",
363
+ "\n",
364
+ " for blob in blobs:\n",
365
+ " if not blob.name.endswith(\".txt\"): # Skip non-text files\n",
366
+ " continue\n",
367
+ "\n",
368
+ " try:\n",
369
+ " # Download file\n",
370
+ " raw_text = blob.download_as_text().strip()\n",
371
+ " if not raw_text: # Skip empty files\n",
372
+ " print(f\"⚠️ Skipping empty file: {blob.name}\")\n",
373
+ " continue\n",
374
+ "\n",
375
+ " # Clean text\n",
376
+ " cleaned_text = rigorous_clean_text(raw_text)\n",
377
+ "\n",
378
+ " # Save cleaned text back to GCS\n",
379
+ " cleaned_blob_name = blob.name.replace(raw_texts_folder, CLEANED_TEXTS_PATH_GCS)\n",
380
+ " cleaned_blob = bucket.blob(cleaned_blob_name)\n",
381
+ " cleaned_blob.upload_from_string(cleaned_text, content_type=\"text/plain\")\n",
382
+ " print(f\"βœ… Cleaned & uploaded: {cleaned_blob_name} ({len(cleaned_text.split())} words, {len(cleaned_text)} characters)\")\n",
383
+ " total_files += 1\n",
384
+ " except Exception as e:\n",
385
+ " print(f\"❌ Error processing {blob.name}: {e}\")\n",
386
+ "\n",
387
+ " print(f\"πŸš€ Cleaning process completed! Total cleaned & uploaded files: {total_files}\")"
388
+ ],
389
+ "metadata": {
390
+ "id": "Vskwg984s25K"
391
+ },
392
+ "execution_count": null,
393
+ "outputs": []
394
+ },
395
+ {
396
+ "cell_type": "code",
397
+ "source": [
398
+ "# =============================================================================\n",
399
+ "# PART 3: GENERATE EMBEDDINGS AND INDEX\n",
400
+ "# =============================================================================\n",
401
+ "\n",
402
+ "def fetch_metadata_dict_from_gcs():\n",
403
+ " \"\"\"\n",
404
+ " Fetch metadata.jsonl from GCS and return as a dictionary.\n",
405
+ "\n",
406
+ " The dictionary is keyed by title for easy lookup during text processing.\n",
407
+ "\n",
408
+ " Returns:\n",
409
+ " dict: Dictionary mapping text titles to their metadata\n",
410
+ " \"\"\"\n",
411
+ " metadata_blob = bucket.blob(METADATA_PATH_GCS)\n",
412
+ " metadata_dict = {}\n",
413
+ "\n",
414
+ " if metadata_blob.exists():\n",
415
+ " metadata_content = metadata_blob.download_as_text()\n",
416
+ " for line in metadata_content.splitlines():\n",
417
+ " item = json.loads(line)\n",
418
+ " metadata_dict[item[\"Title\"]] = item # Keep space-based lookup\n",
419
+ " else:\n",
420
+ " print(\"❌ Metadata file not found in GCS\")\n",
421
+ "\n",
422
+ " return metadata_dict\n",
423
+ "\n",
424
+ "def chunk_text(text, chunk_size=500, overlap=50):\n",
425
+ " \"\"\"\n",
426
+ " Split text into smaller, overlapping chunks for better retrieval.\n",
427
+ "\n",
428
+ " Args:\n",
429
+ " text (str): The text to chunk\n",
430
+ " chunk_size (int): Maximum number of words per chunk\n",
431
+ " overlap (int): Number of words to overlap between chunks\n",
432
+ "\n",
433
+ " Returns:\n",
434
+ " list: List of text chunks\n",
435
+ " \"\"\"\n",
436
+ " words = text.split()\n",
437
+ " chunks = []\n",
438
+ " i = 0\n",
439
+ "\n",
440
+ " while i < len(words):\n",
441
+ " chunk = \" \".join(words[i:i + chunk_size])\n",
442
+ " chunks.append(chunk)\n",
443
+ " i += chunk_size - overlap\n",
444
+ "\n",
445
+ " return chunks\n",
446
+ "\n",
447
+ "def create_embeddings(text_chunks, batch_size=32):\n",
448
+ " \"\"\"\n",
449
+ " Generate embeddings for the given chunks of text using the specified embedding model.\n",
450
+ "\n",
451
+ " This function:\n",
452
+ " 1. Uses SentenceTransformer to load the embedding model\n",
453
+ " 2. Prefixes each chunk with \"passage:\" as required by the E5 model\n",
454
+ " 3. Processes chunks in batches to manage memory usage\n",
455
+ " 4. Normalizes embeddings for cosine similarity search\n",
456
+ "\n",
457
+ " Args:\n",
458
+ " text_chunks (list): List of text chunks to embed\n",
459
+ " batch_size (int): Number of chunks to process at once\n",
460
+ "\n",
461
+ " Returns:\n",
462
+ " numpy.ndarray: Matrix of embeddings, one per text chunk\n",
463
+ " \"\"\"\n",
464
+ " # Load the model with GPU optimization\n",
465
+ " model = SentenceTransformer(EMBEDDING_MODEL)\n",
466
+ " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
467
+ " model = model.to(device)\n",
468
+ " print(f\"πŸš€ Using device for embeddings: {device}\")\n",
469
+ "\n",
470
+ " prefixed_chunks = [f\"passage: {text}\" for text in text_chunks]\n",
471
+ " all_embeddings = []\n",
472
+ "\n",
473
+ " for i in range(0, len(prefixed_chunks), batch_size):\n",
474
+ " batch = prefixed_chunks[i:i+batch_size]\n",
475
+ "\n",
476
+ " # Move batch to GPU (if available) for faster processing\n",
477
+ " with torch.no_grad():\n",
478
+ " batch_embeddings = model.encode(batch, convert_to_numpy=True, normalize_embeddings=True)\n",
479
+ "\n",
480
+ " all_embeddings.append(batch_embeddings)\n",
481
+ "\n",
482
+ " if (i + batch_size) % 100 == 0 or (i + batch_size) >= len(prefixed_chunks):\n",
483
+ " print(f\"πŸ“Œ Processed {i + min(batch_size, len(prefixed_chunks) - i)}/{len(prefixed_chunks)} documents\")\n",
484
+ "\n",
485
+ " return np.vstack(all_embeddings).astype(\"float32\")\n",
486
+ "\n",
487
+ "def process_cleaned_texts():\n",
488
+ " \"\"\"\n",
489
+ " Process cleaned texts to create embeddings, FAISS index, and text chunks with metadata.\n",
490
+ "\n",
491
+ " This function:\n",
492
+ " 1. Downloads all cleaned texts from GCS\n",
493
+ " 2. Chunks each text into smaller pieces\n",
494
+ " 3. Generates embeddings for each chunk\n",
495
+ " 4. Creates a FAISS index for similarity search\n",
496
+ " 5. Saves and uploads all processed data back to GCS\n",
497
+ "\n",
498
+ " This is the core processing step that prepares data for the RAG system.\n",
499
+ " \"\"\"\n",
500
+ " all_chunks = []\n",
501
+ " all_metadata = []\n",
502
+ " chunk_counter = 0\n",
503
+ "\n",
504
+ " metadata_dict = fetch_metadata_dict_from_gcs() # Load metadata\n",
505
+ "\n",
506
+ " # Optimized listing of blobs in cleaned-texts folder\n",
507
+ " blobs = list(storage_client.list_blobs(BUCKET_NAME_GCS, prefix=CLEANED_TEXTS_PATH_GCS))\n",
508
+ " print(f\"πŸ” Found {len(blobs)} files in {CLEANED_TEXTS_PATH_GCS}\")\n",
509
+ "\n",
510
+ " if not blobs:\n",
511
+ " print(f\"❌ No files found in {CLEANED_TEXTS_PATH_GCS}. Exiting.\")\n",
512
+ " return\n",
513
+ "\n",
514
+ " for blob in blobs:\n",
515
+ " file_name = blob.name.split(\"/\")[-1]\n",
516
+ " if not file_name or file_name.startswith(\".\"):\n",
517
+ " continue # Skip empty or hidden files\n",
518
+ "\n",
519
+ " # Convert filename back to space-based title for metadata lookup\n",
520
+ " book_name = file_name.replace(\"_\", \" \")\n",
521
+ " metadata = metadata_dict.get(book_name, {\"Author\": \"Unknown\", \"Publisher\": \"Unknown\"})\n",
522
+ " author = metadata.get(\"Author\", \"Unknown\")\n",
523
+ "\n",
524
+ " try:\n",
525
+ " # Download and read text\n",
526
+ " raw_text = blob.download_as_text().strip()\n",
527
+ "\n",
528
+ " # Skip empty or corrupt files\n",
529
+ " if not raw_text:\n",
530
+ " print(f\"❌ Skipping empty file: {file_name}\")\n",
531
+ " continue\n",
532
+ "\n",
533
+ " chunks = chunk_text(raw_text)\n",
534
+ " print(f\"βœ… Processed {book_name}: {len(chunks)} chunks\")\n",
535
+ "\n",
536
+ " for chunk in chunks:\n",
537
+ " all_chunks.append(chunk)\n",
538
+ " all_metadata.append((chunk_counter, book_name, author))\n",
539
+ " chunk_counter += 1\n",
540
+ " except Exception as e:\n",
541
+ " print(f\"❌ Error processing {file_name}: {e}\")\n",
542
+ "\n",
543
+ " # Ensure there are chunks before embedding generation\n",
544
+ " if not all_chunks:\n",
545
+ " print(\"❌ No chunks found. Skipping embedding generation.\")\n",
546
+ " return\n",
547
+ "\n",
548
+ " # Create embeddings with GPU acceleration\n",
549
+ " print(f\"πŸ“ Creating embeddings for {len(all_chunks)} total chunks...\")\n",
550
+ " all_embeddings = create_embeddings(all_chunks)\n",
551
+ "\n",
552
+ " # Build FAISS index\n",
553
+ " dimension = all_embeddings.shape[1]\n",
554
+ " index = faiss.IndexFlatIP(dimension)\n",
555
+ " index.add(all_embeddings)\n",
556
+ " print(f\"βœ… FAISS index built with {index.ntotal} vectors\")\n",
557
+ "\n",
558
+ " # Save & upload embeddings\n",
559
+ " np.save(LOCAL_EMBEDDINGS_FILE, all_embeddings) # Save locally first\n",
560
+ " embeddings_blob = bucket.blob(EMBEDDINGS_PATH_GCS)\n",
561
+ " embeddings_blob.upload_from_filename(LOCAL_EMBEDDINGS_FILE)\n",
562
+ " print(f\"βœ… Uploaded embeddings to GCS: {EMBEDDINGS_PATH_GCS}\")\n",
563
+ "\n",
564
+ " # Save & upload FAISS index\n",
565
+ " faiss.write_index(index, LOCAL_FAISS_INDEX_FILE)\n",
566
+ " index_blob = bucket.blob(INDICES_PATH_GCS)\n",
567
+ " index_blob.upload_from_filename(LOCAL_FAISS_INDEX_FILE)\n",
568
+ " print(f\"βœ… Uploaded FAISS index to GCS: {INDICES_PATH_GCS}\")\n",
569
+ "\n",
570
+ " # Save and upload text chunks with metadata\n",
571
+ " with open(LOCAL_TEXT_CHUNKS_FILE, \"w\", encoding=\"utf-8\") as f:\n",
572
+ " for i, (chunk_id, book_name, author) in enumerate(all_metadata):\n",
573
+ " f.write(f\"{i}\\t{book_name}\\t{author}\\t{all_chunks[i]}\\n\")\n",
574
+ "\n",
575
+ " chunks_blob = bucket.blob(CHUNKS_PATH_GCS)\n",
576
+ " chunks_blob.upload_from_filename(LOCAL_TEXT_CHUNKS_FILE)\n",
577
+ " print(f\"βœ… Uploaded text chunks to GCS: {CHUNKS_PATH_GCS}\")\n",
578
+ "\n",
579
+ " # Clean up temp files\n",
580
+ " os.remove(LOCAL_EMBEDDINGS_FILE)\n",
581
+ " os.remove(LOCAL_FAISS_INDEX_FILE)\n",
582
+ " os.remove(LOCAL_TEXT_CHUNKS_FILE)"
583
+ ],
584
+ "metadata": {
585
+ "id": "1Yul8p9JsN1e"
586
+ },
587
+ "execution_count": null,
588
+ "outputs": []
589
+ },
590
+ {
591
+ "cell_type": "code",
592
+ "source": [
593
+ "# =============================================================================\n",
594
+ "# PART 4: MAIN EXECUTION\n",
595
+ "# =============================================================================\n",
596
+ "\n",
597
+ "def run_pipeline():\n",
598
+ " \"\"\"\n",
599
+ " Run the complete end-to-end preprocessing pipeline.\n",
600
+ "\n",
601
+ " This function executes all steps in sequence:\n",
602
+ " 1. Upload files from local to Colab\n",
603
+ " 2. Upload raw texts and metadata to GCS\n",
604
+ " 3. Download texts from URLs specified in metadata\n",
605
+ " 4. Clean and process all texts\n",
606
+ " 5. Generate embeddings and build the FAISS index\n",
607
+ "\n",
608
+ " This is the main entry point for the preprocessing script.\n",
609
+ " \"\"\"\n",
610
+ " print(\"πŸš€ Starting pipeline execution...\")\n",
611
+ "\n",
612
+ " print(\"\\n==== STEP 1: Uploading files from local to Colab ====\")\n",
613
+ " upload_successful = upload_files_to_colab()\n",
614
+ "\n",
615
+ " if not upload_successful:\n",
616
+ " print(\"❌ Pipeline halted due to missing metadata file.\")\n",
617
+ " return\n",
618
+ "\n",
619
+ " print(\"\\n==== STEP 2: Uploading raw texts and metadata to GCS ====\")\n",
620
+ " upload_files_to_gcs()\n",
621
+ "\n",
622
+ " print(\"\\n==== STEP 3: Downloading texts from URLs ====\")\n",
623
+ " download_text_files()\n",
624
+ "\n",
625
+ " print(\"\\n==== STEP 4: Cleaning and processing texts ====\")\n",
626
+ " clean_and_upload_texts()\n",
627
+ "\n",
628
+ " print(\"\\n==== STEP 5: Generating embeddings and building index ====\")\n",
629
+ " process_cleaned_texts()\n",
630
+ "\n",
631
+ " print(\"\\nβœ… Pipeline execution completed successfully!\")\n",
632
+ "\n",
633
+ "# Execute the complete pipeline\n",
634
+ "if __name__ == \"__main__\":\n",
635
+ " run_pipeline()"
636
+ ],
637
+ "metadata": {
638
+ "id": "XXB_eYvj-I0i"
639
+ },
640
+ "execution_count": null,
641
+ "outputs": []
642
+ }
643
+ ]
644
+ }