Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.45.0
Data Handling Explanation
This document explains how data is processed, stored, and handled in Anveshak: Spirituality Q&A, with special attention to ethical considerations and copyright respect.
Data Sources
Text Corpus Overview
Anveshak: Spirituality Q&A uses approximately 133 digitized spiritual texts sourced from freely available resources. These texts include:
- Ancient sacred literature (Vedas, Upanishads, Puranas, Sutras, DharmaΕΔstras, and Agamas)
- Classical Indian texts (The Bhagavad Gita, The ΕrΔ«mad BhΔgavatam, and others)
- Indian historical texts (The Mahabharata and The Ramayana)
- Teachings of revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters of all genders, backgrounds, traditions, and walks of life
As stated in app.py:
"Anveshak draws from a rich tapestry of spiritual wisdom found in classical Indian texts, philosophical treatises, and the teachings of revered Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters across centuries. The knowledge presented here spans multiple traditions, schools of thought, and spiritual lineages that have flourished in the Indian subcontinent and beyond."
Ethical Sourcing
All texts included in Anveshak meet the following criteria:
- Public availability: All texts were freely available from sources like archive.org
- Educational use: Texts are used solely for educational purposes
- Proper attribution: All sources are credited with author and publisher information
- Respect for copyright: Implementation of word limits and other copyright-respecting measures
As mentioned in app.py:
"Note that the sources consist of about 133 digitized texts, all of which were freely available over the internet (on sites like archive.org). Many of the texts are English translations of original (and in some cases, ancient) sacred and spiritual texts. All of the copyrights belong to the respective authors and publishers and we bow down in gratitude to their selfless work. Anveshak merely re-presents the ocean of spiritual knowledge and wisdom contained in the original works with relevant citations in a limited number of words."
Data Processing Pipeline
1. Data Collection
The data collection process involves two methods as implemented in preprocessing.py:
Manual Upload
Texts are manually uploaded to Google Cloud Storage (GCS) through a preprocessing script:
def upload_files_to_colab():
"""Upload raw text files and metadata from local machine to Colab."""
# First, upload text files
print("Step 1: Please upload your text files...")
uploaded_text_files = files.upload() # This will prompt the user to upload files
# Create directory structure if it doesn't exist
os.makedirs(LOCAL_RAW_TEXTS_FOLDER, exist_ok=True)
# Move uploaded text files to the raw-texts folder
for filename, content in uploaded_text_files.items():
if filename.endswith(".txt"):
with open(os.path.join(LOCAL_RAW_TEXTS_FOLDER, filename), "wb") as f:
f.write(content)
print(f"β
Saved {filename} to {LOCAL_RAW_TEXTS_FOLDER}")
Web Downloading
Some texts are automatically downloaded from URLs listed in the metadata file:
def download_text_files():
"""Fetch metadata, filter unuploaded files, and download text files."""
metadata = fetch_metadata_from_gcs()
# Filter entries where Uploaded is False
files_to_download = [item for item in metadata if item["Uploaded"] == False]
# Process only necessary files
for item in files_to_download:
name, author, url = item["Title"], item["Author"], item["URL"]
if url.lower() == "not available":
print(f"β Skipping {name} - No URL available.")
continue
try:
response = requests.get(url)
if response.status_code == 200:
raw_text = response.text
filename = "{}.txt".format(name.replace(" ", "_"))
# Save to local first
local_path = f"/tmp/{filename}"
with open(local_path, "w", encoding="utf-8") as file:
file.write(raw_text)
# Upload to GCS
gcs_path = f"{RAW_TEXTS_DOWNLOADED_PATH_GCS}{filename}"
upload_to_gcs(local_path, gcs_path)
print(f"β
Downloaded & uploaded: {filename} ({len(raw_text.split())} words)")
else:
print(f"β Failed to download {name}: {url} (Status {response.status_code})")
except Exception as e:
print(f"β Error processing {name}: {e}")
2. Text Cleaning
Raw texts often contain HTML tags, OCR errors, and formatting issues. The cleaning process removes these artifacts using the exact implementation from preprocessing.py:
def rigorous_clean_text(text):
"""
Clean text by removing metadata, junk text, and formatting issues.
This function:
1. Removes HTML tags using BeautifulSoup
2. Removes URLs and standalone numbers
3. Removes all-caps OCR noise words
4. Deduplicates adjacent identical lines
5. Normalizes Unicode characters
6. Standardizes whitespace and newlines
Args:
text (str): The raw text to clean
Returns:
str: The cleaned text
"""
text = BeautifulSoup(text, "html.parser").get_text()
text = re.sub(r"https?:\/\/\S+", "", text) # Remove links
text = re.sub(r"\b\d+\b", "", text) # Remove standalone numbers
text = re.sub(r"\b[A-Z]{5,}\b", "", text) # Remove all-caps OCR noise words
lines = text.split("\n")
cleaned_lines = []
last_line = None
for line in lines:
line = line.strip()
if line and line != last_line:
cleaned_lines.append(line)
last_line = line
text = "\n".join(cleaned_lines)
text = unicodedata.normalize("NFKD", text)
text = re.sub(r"\s+", " ", text).strip()
text = re.sub(r"\n{2,}", "\n", text)
return text
The cleaning process:
- Removes HTML tags using BeautifulSoup
- Eliminates URLs and standalone numbers
- Removes all-caps OCR noise words (common in digitized texts)
- Deduplicates adjacent identical lines
- Normalizes Unicode characters
- Standardizes whitespace and newlines
3. Text Chunking
Clean texts are split into smaller, manageable chunks for processing using the exact implementation from preprocessing.py:
def chunk_text(text, chunk_size=500, overlap=50):
"""
Split text into smaller, overlapping chunks for better retrieval.
Args:
text (str): The text to chunk
chunk_size (int): Maximum number of words per chunk
overlap (int): Number of words to overlap between chunks
Returns:
list: List of text chunks
"""
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
i += chunk_size - overlap
return chunks
Chunking characteristics:
- Chunk size: 500 words per chunk, balancing context and retrieval precision
- Overlap: 50-word overlap between chunks to maintain context across chunk boundaries
- Context preservation: Ensures that passages aren't arbitrarily cut in the middle of important concepts
4. Text Embedding
Chunks are converted to vector embeddings using the E5-large-v2 model with the actual implementation from preprocessing.py:
def create_embeddings(text_chunks, batch_size=32):
"""
Generate embeddings for the given chunks of text using the specified embedding model.
This function:
1. Uses SentenceTransformer to load the embedding model
2. Prefixes each chunk with "passage:" as required by the E5 model
3. Processes chunks in batches to manage memory usage
4. Normalizes embeddings for cosine similarity search
Args:
text_chunks (list): List of text chunks to embed
batch_size (int): Number of chunks to process at once
Returns:
numpy.ndarray: Matrix of embeddings, one per text chunk
"""
# Load the model with GPU optimization
model = SentenceTransformer(EMBEDDING_MODEL)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
print(f"π Using device for embeddings: {device}")
prefixed_chunks = [f"passage: {text}" for text in text_chunks]
all_embeddings = []
for i in range(0, len(prefixed_chunks), batch_size):
batch = prefixed_chunks[i:i+batch_size]
# Move batch to GPU (if available) for faster processing
with torch.no_grad():
batch_embeddings = model.encode(batch, convert_to_numpy=True, normalize_embeddings=True)
all_embeddings.append(batch_embeddings)
if (i + batch_size) % 100 == 0 or (i + batch_size) >= len(prefixed_chunks):
print(f"π Processed {i + min(batch_size, len(prefixed_chunks) - i)}/{len(prefixed_chunks)} documents")
return np.vstack(all_embeddings).astype("float32")
Embedding process details:
- Model: E5-large-v2, a state-of-the-art embedding model for retrieval tasks
- Prefix: "passage:" prefix is added to each chunk for optimal embedding
- Batching: Processing in batches of 32 for memory efficiency
- Normalization: Embeddings are normalized for cosine similarity search
- Output: Each text chunk becomes a 1024-dimensional vector
5. FAISS Index Creation
Embeddings are stored in a Facebook AI Similarity Search (FAISS) index for efficient similarity search:
# Build FAISS index
dimension = all_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(all_embeddings)
FAISS index characteristics:
- Index type: IndexFlatIP (Inner Product) for cosine similarity search
- Exact search: Uses exact search rather than approximate for maximum accuracy
- Dimension: 1024-dimensional vectors from the E5-large-v2 model
6. Metadata Management
The system maintains metadata for each text to provide proper citations, using the implementation from rag_engine.py:
def fetch_metadata_from_gcs():
"""
Fetch metadata.jsonl from GCS and return as a list of dictionaries.
Each dictionary represents a text entry with metadata like title, author, etc.
Returns:
list: List of dictionaries containing metadata for each text
"""
blob = bucket.blob(METADATA_PATH_GCS)
# Download metadata file
metadata_jsonl = blob.download_as_text()
# Parse JSONL
metadata = [json.loads(line) for line in metadata_jsonl.splitlines()]
return metadata
Metadata structure (JSONL format):
{"Title": "Bhagavad Gita", "Author": "Vyasa", "Publisher": "Gita Press, Gorakhpur, India", "URL": "https://archive.org/details/bhagavad-gita", "Uploaded": true}
{"Title": "Yoga Sutras", "Author": "Patanjali", "Publisher": "DIVINE LIFE SOCIETY", "URL": "https://archive.org/details/yoga-sutras", "Uploaded": true}
Data Storage Architecture
Google Cloud Storage Structure
Anveshak: Spirituality Q&A uses Google Cloud Storage (GCS) as its primary data store, organized as follows:
bucket_name/
βββ metadata/
β βββ metadata.jsonl # Metadata for all texts
βββ raw-texts/
β βββ uploaded/ # Manually uploaded texts
β βββ downloaded/ # Automatically downloaded texts
βββ cleaned-texts/ # Cleaned versions of all texts
βββ processed/
βββ embeddings/
β βββ all_embeddings.npy # Numpy array of embeddings
βββ indices/
β βββ faiss_index.faiss # FAISS index file
βββ chunks/
βββ text_chunks.txt # Text chunks with metadata
Local Caching
For deployment on Hugging Face Spaces, essential files are downloaded to local storage using the implementation from rag_engine.py:
# Local Paths
local_embeddings_file = "all_embeddings.npy"
local_faiss_index_file = "faiss_index.faiss"
local_text_chunks_file = "text_chunks.txt"
local_metadata_file = "metadata.jsonl"
These files are loaded with caching to improve performance, using the actual implementation from rag_engine.py:
@st.cache_resource(show_spinner=False)
def cached_load_data_files():
"""
Cached version of load_data_files() for FAISS index, text chunks, and metadata.
This function loads:
- FAISS index for vector similarity search
- Text chunks containing the original spiritual text passages
- Metadata dictionary with publication and author information
All files are downloaded from Google Cloud Storage if not already present locally.
Returns:
tuple: (faiss_index, text_chunks, metadata_dict) or (None, None, None) if loading fails
"""
# Initialize GCP and OpenAI clients
bucket = setup_gcp_client()
openai_initialized = setup_openai_client()
if not bucket or not openai_initialized:
print("Failed to initialize required services")
return None, None, None
# Get GCS paths from secrets - required
try:
metadata_file_gcs = st.secrets["METADATA_PATH_GCS"]
embeddings_file_gcs = st.secrets["EMBEDDINGS_PATH_GCS"]
faiss_index_file_gcs = st.secrets["INDICES_PATH_GCS"]
text_chunks_file_gcs = st.secrets["CHUNKS_PATH_GCS"]
except KeyError as e:
print(f"β Error: Required GCS path not found in secrets: {e}")
return None, None, None
# Download necessary files if not already present locally
success = True
success &= download_file_from_gcs(bucket, faiss_index_file_gcs, local_faiss_index_file)
success &= download_file_from_gcs(bucket, text_chunks_file_gcs, local_text_chunks_file)
success &= download_file_from_gcs(bucket, metadata_file_gcs, local_metadata_file)
if not success:
print("Failed to download required files")
return None, None, None
# Load FAISS index, text chunks, and metadata
try:
faiss_index = faiss.read_index(local_faiss_index_file)
except Exception as e:
print(f"β Error loading FAISS index: {str(e)}")
return None, None, None
# Load text chunks
try:
text_chunks = {} # Mapping: ID -> (Title, Author, Text)
with open(local_text_chunks_file, "r", encoding="utf-8") as f:
for line in f:
parts = line.strip().split("\t")
if len(parts) == 4:
text_chunks[int(parts[0])] = (parts[1], parts[2], parts[3])
except Exception as e:
print(f"β Error loading text chunks: {str(e)}")
return None, None, None
# Load metadata
try:
metadata_dict = {}
with open(local_metadata_file, "r", encoding="utf-8") as f:
for line in f:
item = json.loads(line)
metadata_dict[item["Title"]] = item
except Exception as e:
print(f"β Error loading metadata: {str(e)}")
return None, None, None
print(f"β
Data loaded successfully (cached): {len(text_chunks)} passages available")
return faiss_index, text_chunks, metadata_dict
Data Access During Query Processing
Query Embedding
User queries are embedded using the same model as the text corpus, with the actual implementation from rag_engine.py:
def get_embedding(text):
"""
Generate embeddings for a text query using the cached model.
Uses an in-memory cache to avoid redundant embedding generation for repeated queries.
Properly prefixes inputs with "query:" or "passage:" as required by the E5 model.
Args:
text (str): The query text to embed
Returns:
numpy.ndarray: The embedding vector or a zero vector if embedding fails
"""
if text in query_embedding_cache:
return query_embedding_cache[text]
try:
tokenizer, model = cached_load_model()
if model is None:
print("Model is None, returning zero embedding")
return np.zeros((1, 384), dtype=np.float32)
# Format input based on text length
# For E5 models, "query:" prefix is for questions, "passage:" for documents
input_text = f"query: {text}" if len(text) < 512 else f"passage: {text}"
inputs = tokenizer(
input_text,
padding=True,
truncation=True,
return_tensors="pt",
max_length=512,
return_attention_mask=True
)
with torch.no_grad():
outputs = model(**inputs)
embeddings = average_pool(outputs.last_hidden_state, inputs['attention_mask'])
embeddings = nn.functional.normalize(embeddings, p=2, dim=1)
embeddings = embeddings.detach().cpu().numpy()
del outputs, inputs
gc.collect()
query_embedding_cache[text] = embeddings
return embeddings
except Exception as e:
print(f"β Embedding error: {str(e)}")
return np.zeros((1, 384), dtype=np.float32)
Note the use of:
- Query prefix: "query:" is added to distinguish query embeddings from passage embeddings
- Truncation: Queries are truncated to 512 tokens if necessary
- Memory management: Tensors are detached and moved to CPU after computation
- Caching: Query embeddings are cached to avoid redundant computation
Passage Retrieval
The system retrieves relevant passages based on query embedding similarity using the implementation from rag_engine.py:
def retrieve_passages(query, faiss_index, text_chunks, metadata_dict, top_k=5, similarity_threshold=0.5):
"""
Retrieve the most relevant passages for a given spiritual query.
This function:
1. Embeds the user query using the same model used for text chunks
2. Finds similar passages using the FAISS index with cosine similarity
3. Filters results based on similarity threshold to ensure relevance
4. Enriches results with metadata (title, author, publisher)
5. Ensures passage diversity by including only one passage per source title
Args:
query (str): The user's spiritual question
faiss_index: FAISS index containing passage embeddings
text_chunks (dict): Dictionary mapping IDs to text chunks and metadata
metadata_dict (dict): Dictionary containing publication information
top_k (int): Maximum number of passages to retrieve
similarity_threshold (float): Minimum similarity score (0.0-1.0) for retrieved passages
Returns:
tuple: (retrieved_passages, retrieved_sources) containing the text and source information
"""
try:
print(f"\nπ Retrieving passages for query: {query}")
query_embedding = get_embedding(query)
distances, indices = faiss_index.search(query_embedding, top_k * 2)
print(f"Found {len(distances[0])} potential matches")
retrieved_passages = []
retrieved_sources = []
cited_titles = set()
for dist, idx in zip(distances[0], indices[0]):
print(f"Distance: {dist:.4f}, Index: {idx}")
if idx in text_chunks and dist >= similarity_threshold:
title_with_txt, author, text = text_chunks[idx]
clean_title = title_with_txt.replace(".txt", "") if title_with_txt.endswith(".txt") else title_with_txt
clean_title = unicodedata.normalize("NFC", clean_title)
if clean_title in cited_titles:
continue
metadata_entry = metadata_dict.get(clean_title, {})
author = metadata_entry.get("Author", "Unknown")
publisher = metadata_entry.get("Publisher", "Unknown")
cited_titles.add(clean_title)
retrieved_passages.append(text)
retrieved_sources.append((clean_title, author, publisher))
if len(retrieved_passages) == top_k:
break
print(f"Retrieved {len(retrieved_passages)} passages")
return retrieved_passages, retrieved_sources
except Exception as e:
print(f"β Error in retrieve_passages: {str(e)}")
return [], []
Important aspects:
- Similarity threshold: Passages must have a similarity score >= 0.5 to be included
- Diversity: Only one passage per source title is included in the results
- Metadata enrichment: Publisher information is added from the metadata
- Configurable retrieval: The
top_k
parameter allows users to adjust how many sources to use
User Data Privacy
No Data Collection
Anveshak is designed to respect user privacy by not collecting or storing any user data:
- No Query Storage: User questions are processed in memory and not saved
- No User Identification: No user accounts or identification is required
- No Analytics: No usage tracking or analytics are implemented
- No Cookies: No browser cookies are used to track users
As stated in app.py:
"We do not save any user data or queries. However, user questions are processed using OpenAI's LLM service to generate responses. While we do not store this information, please be aware that interactions are processed through OpenAI's platform and are subject to their privacy policies and data handling practices."
This privacy-first approach ensures that users can freely explore spiritual questions without concerns about their queries being stored or analyzed.
Copyright and Ethical Considerations
Word Limit Implementation
To respect copyright and ensure fair use, answers are limited to a configurable word count using the actual implementation from rag_engine.py:
def answer_with_llm(query, context=None, word_limit=100):
# ... LLM processing ...
# Extract and format the answer
answer = response.choices[0].message.content.strip()
words = answer.split()
if len(words) > word_limit:
answer = " ".join(words[:word_limit])
if not answer.endswith((".", "!", "?")):
answer += "."
return answer
Users can adjust the word limit from 50 to 500 words, ensuring that responses are:
- Short enough to respect copyright
- Long enough to provide meaningful information
- Always properly cited to the original source
Citation Format
Every answer includes citations to the original sources using the implementation from rag_engine.py:
def format_citations(sources):
"""
Format citations for display to the user.
Creates properly formatted citations for each source used in generating the answer.
Each citation appears on a new line with consistent formatting.
Args:
sources (list): List of (title, author, publisher) tuples
Returns:
str: Formatted citations as a string with each citation on a new line
"""
formatted_citations = []
for title, author, publisher in sources:
if publisher.endswith(('.', '!', '?')):
formatted_citations.append(f"π {title} by {author}, Published by {publisher}")
else:
formatted_citations.append(f"π {title} by {author}, Published by {publisher}.")
return "\n".join(formatted_citations)
Citations include:
- Book/text title
- Author name
- Publisher information
Acknowledgment of Sources
Anveshak: Spirituality Q&A includes dedicated pages for acknowledging:
- Publishers of the original texts
- Saints, Sages, and Spiritual Masters whose teachings are referenced
- The origins and traditions of the spiritual texts
A thank-you note is also prominently featured on the main page, as shown in app.py:
st.markdown('<div class="acknowledgment-header">A Heartfelt Thank You</div>', unsafe_allow_html=True)
st.markdown("""
It is believed that one cannot be in a spiritual path without the will of the Lord. One need not be a believer or a non-believer, merely proceeding to thoughtlessness and observation is enough to evolve and shape perspectives. But that happens through grace. It is believed that without the will of the Lord, one cannot be blessed by real Saints, and without the will of the Saints, one cannot get close to them or God.
Therefore, with deepest reverence, we express our gratitude to:
**The Saints, Sages, Siddhas, Yogis, Sadhus, Rishis, Gurus, Mystics, and Spiritual Masters** of all genders, backgrounds, traditions, and walks of life whose timeless wisdom illuminates Anveshak. From ancient Sages to modern Masters, their selfless dedication to uplift humanity through selfless love and spiritual knowledge continues to guide seekers on the path.
# ...
""")
Inclusive Recognition
Anveshak explicitly acknowledges and honors spiritual teachers from all backgrounds:
- All references to spiritual figures capitalize the first letter (Saints, Sages, etc.)
- The application includes language acknowledging Masters of "all genders, backgrounds, traditions, and walks of life"
- The selection of texts aims to represent diverse spiritual traditions
From the Sources.py file:
"Additionally, there are and there have been many other great Saints, enlightened beings, Sadhus, Sages, and Gurus who have worked tirelessly to uplift humanity and guide beings to their true SELF and path, of whom little is known and documented. We thank them and acknowledge their contribution to the world."
Data Replication and Backup
GCS as Primary Storage
Google Cloud Storage serves as both the primary storage and backup system:
- All preprocessed data is stored in GCS buckets
- GCS provides built-in redundancy and backup capabilities
- Data is loaded from GCS at application startup
Local Caching
For performance, Anveshak caches data locally using the implementation from rag_engine.py:
def download_file_from_gcs(bucket, gcs_path, local_path):
"""
Download a file from GCS to local storage if not already present.
Only downloads if the file isn't already present locally, avoiding redundant downloads.
Args:
bucket: GCS bucket object
gcs_path (str): Path to the file in GCS
local_path (str): Local path where the file should be saved
Returns:
bool: True if download was successful or file already exists, False otherwise
"""
try:
if os.path.exists(local_path):
print(f"File already exists locally: {local_path}")
return True
blob = bucket.blob(gcs_path)
blob.download_to_filename(local_path)
print(f"β
Downloaded {gcs_path} β {local_path}")
return True
except Exception as e:
print(f"β Error downloading {gcs_path}: {str(e)}")
return False
This approach:
- Avoids redundant downloads
- Preserves data across application restarts
- Reduces API calls to GCS
Conclusion
Anveshak: Spirituality Q&A implements a comprehensive data handling strategy that:
- Respects Copyright: Through word limits, citations, and acknowledgments
- Preserves Source Integrity: By maintaining accurate metadata and citations
- Optimizes Performance: Through efficient storage, retrieval, and caching
- Ensures Ethical Use: By focusing on educational purposes and proper attribution
- Protects Privacy: By not collecting or storing user data
- Honors Diversity: By acknowledging spiritual teachers of all backgrounds and traditions
This balance between technical efficiency and ethical responsibility allows Anveshak to serve as a bridge to spiritual knowledge while respecting the original sources, traditions, and user privacy. The system is designed not to replace personal spiritual inquiry but to supplement it by making traditional wisdom more accessible.
As stated in the conclusion of the blog post:
"The core philosophy guiding this project is that while technology can facilitate access to spiritual knowledge, the journey to self-discovery remains deeply personal. As Anveshak states: 'The path and journey to the SELF is designed to be undertaken alone. The all-encompassing knowledge is internal and not external.'"