Spaces:

PinoCorgi
/

PDFRetriever

Runtime error

App Files Files Community

PDFRetriever / app.py

PinoCorgi

Update app.py

db20efe almost 2 years ago

raw

history blame contribute delete

10.7 kB

	import streamlit as st
	import requests
	import json
	import time
	import random
	import time
	import pickle
	from langchain.embeddings import HuggingFaceEmbeddings
	from langchain.vectorstores import FAISS
	from langchain.chains.question_answering import load_qa_chain
	from langchain.chains.qa_with_sources import load_qa_with_sources_chain
	from langchain.llms import OpenAI
	import os
	import torch
	import io
	from langchain.document_loaders import PyPDFLoader, TextLoader
	from langchain.text_splitter import CharacterTextSplitter
	import requests
	from langchain.document_loaders import OnlinePDFLoader

	st.set_page_config(page_title="Luminary-AI QnA",
	page_icon="https://endlessicons.com/wp-content/uploads/2012/12/fountain-pen-icon-614x460.png",
	layout="wide",
	initial_sidebar_state="expanded"
	)

	def p_title(title):
	st.markdown(f'<h2 style="text-align: left; color:#F63366; font-size:28px;">{title}</h2>', unsafe_allow_html=True)

	#########
	#SIDEBAR
	########

	st.sidebar.header('I would Like to')
	nav = st.sidebar.radio('',['Go to homepage', 'QnA over Custom Docs', 'QnA over Luminary-AI Docs'])
	st.sidebar.write('')
	st.sidebar.write('')
	st.sidebar.write('')
	st.sidebar.write('')
	st.sidebar.write('')

	@st.cache_resource
	def faiss_loader():
	hf = HuggingFaceEmbeddings()
	new_db = FAISS.load_local("faiss_index", hf)
	return new_db

	if not os.path.exists("./tempfolder"):
	os.makedirs("./tempfolder")
	def save_uploadedfile(uploadedfile):
	with open(
	os.path.join("tempfolder", uploadedfile.name),
	"wb",
	) as f:
	f.write(uploadedfile.getbuffer())
	return st.sidebar.success("Saved File")


	def custom_vector_db(file_contents):
	loader = OnlinePDFLoader(file_contents)
	docs = loader.load()
	text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
	texts = text_splitter.split_documents(docs)
	embeddings = HuggingFaceEmbeddings()
	db = FAISS.from_documents(texts, embeddings)
	db.save_local("custom_db")
	return db

	def get_results_from_transformer(context, question):
	response = requests.post("https://tloen-alpaca-lora.hf.space/run/predict", json={
	"data": [
	f"Summarize :",
	context[:100],
	0.1,
	0.75,
	40,
	4,
	100,
	]
	}).json()
	print(response)
	return response['data']

	def get_results_from_longformer(text):
	from transformers import pipeline
	summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
	summary = summarizer(text, max_length=150, min_length=40, do_sample=True, top_k=10)[0]['summary_text']
	return summary

	# Example usage
	if nav == 'Go to homepage':
	st.title("Nishant T QnA Project")


	st.write(
	"""
	Longformer is a transformer-based model introduced by AllenAI that is designed to handle long-range dependencies in text. It extends the standard transformer architecture by incorporating a mechanism to capture
	information from distant positions, making it well-suited for tasks involving long documents or sequences.
	"""
	)

	st.write(
	"""
	Self-Attention with Global Attention: In standard transformers, self-attention operates over the entire input sequence, resulting in quadratic complexity with respect to the sequence length. Longformer introduces a "global attention" mechanism that allows the model to focus on relevant information while attending to only a subset of the input tokens. This significantly reduces the computational complexity.

	Sliding Window Approach: Longformer divides the input sequence into overlapping chunks or "windows." Each window captures information from a local context while also attending to the global context. By processing the sequence in a sliding window manner, Longformer can capture long-range dependencies more effectively.

	Long-Range Attention: In addition to the global attention mechanism, Longformer also employs a "long-range attention" mechanism. This mechanism enables the model to attend to tokens that are far apart, even beyond the window size. It achieves this by introducing special tokens called "global attention tokens" that allow the model to attend to distant positions in the sequence.

	Position Embeddings: Longformer uses a combination of absolute position embeddings and relative position embeddings. Absolute position embeddings encode the absolute positions of tokens in the input sequence. Relative position embeddings capture the relative distances between tokens and help the model generalize better to longer sequences.

	Pretraining and Fine-tuning: Longformer can be pretrained on large-scale datasets using methods such as masked language modeling (MLM). After pretraining, the model can be fine-tuned on specific downstream tasks, such as text classification or question answering.
	"""
	)

	st.write(
	"""
	While Longformer offers significant improvements in handling long-range dependencies, it also has certain limitations and potential pitfalls. Here are some of the pitfalls associated with the Longformer model:

	Increased Memory Requirements: Longformer requires more memory compared to standard transformers due to the overlapping windows and global attention mechanisms. The memory requirements increase with longer sequences, making it more resource-intensive.

	Increased Training Time: Training Longformer models can be more time-consuming compared to standard transformers. The sliding window approach and global attention require additional computations, leading to longer training times.

	Limited Contextual Understanding: Although Longformer can capture information from a larger context, it still has limitations in understanding very long-range dependencies. The global attention mechanism has a fixed window size and is constrained by computational limitations, which may restrict the model's ability to capture extremely distant relationships.

	Loss of Local Context: In the sliding window approach, each window attends to a limited local context. While global attention helps to incorporate information from other windows, there may still be cases where important local context is missed, especially if the information is spread across multiple windows.

	Sensitivity to Window Size: The performance of Longformer models can be sensitive to the choice of window size. A smaller window size may result in a loss of long-range dependencies, while a larger window size increases memory requirements and computational complexity.

	Fine-tuning Challenges: Fine-tuning Longformer models can be challenging, especially when using large window sizes or on tasks that require precise local information. Adjusting the model's hyperparameters and balancing the global and local context can be non-trivial and may require extensive experimentation.
	"""
	)

	st.write(
	"""
	A Potential Solution - Langchain:
	LangChain is a framework for developing applications powered by language models¹². It allows users to connect a language model to other sources of data and interact with its environment¹. Longformer is a transformer-based model that can process long sequences by using a combination of local and global attention⁴⁵. LangChain can improve over models like Longformer by providing modular abstractions and use-case specific chains for working with language models¹. LangChain also supports multiple integrations with systems and data sources, such as cloud storage, web scraping, code generation, PDF manipulation, and more²³.

	Source: Conversation with Bing, 19/5/2023
	(1) 🦜️🔗 LangChain \| 🦜️🔗 LangChain. https://docs.langchain.com/docs/.
	(2) LangChain - Wikipedia. https://en.wikipedia.org/wiki/LangChain.
	(3) [2004.05150] Longformer: The Long-Document Transformer - arXiv.org. https://arxiv.org/abs/2004.05150.
	(4) Longformer - Hugging Face. https://huggingface.co/docs/transformers/model_doc/longformer.
	(5) Welcome to LangChain — 🦜🔗 LangChain 0.0.173. https://python.langchain.com/en/latest/index.html.
	(6) allenai/longformer: Longformer: The Long-Document Transformer - GitHub. https://github.com/allenai/longformer.
	"""
	)




	if nav == 'QnA over Custom Docs':
	st.markdown("<h4 style='text-align: center; color:grey;'>Accelerate knowledge with Catalyst 🤖</h4>", unsafe_allow_html=True)
	st.text('')
	st.title("QnA over Custom Docs")
	st.text('')


	file = st.file_uploader('Upload your file here',type=['pdf'])
	if file is not None:
	with st.spinner('Converting your document to VectorDB...'):
	time.sleep(2)
	#stringio = io.StringIO(file.getvalue().decode("utf-8"))
	#string_data = stringio.read()
	save_uploadedfile(file)
	db = custom_vector_db("tempfolder/" + file.name)
	input_su = st.text_area("Please Ask a Question", max_chars=1000, height=100)
	if input_su:
	docs = db.similarity_search(input_su)
	print(docs)
	docs = [txt.page_content for txt in docs]
	docs = "".join(docs)
	res = get_results_from_transformer(docs, input_su)['data']
	#res = get_results_from_longformer(docs)
	st.write(res)
	time.sleep(2)
	st.markdown('___')
	st.caption("")
	st.balloons()

	if nav == 'QnA over Luminary-AI Docs':
	st.markdown("<h4 style='text-align: center; color:grey;'>Search for Specific Documentation 🤖</h4>", unsafe_allow_html=True)
	st.text('')
	st.title("Search Luminary AI")
	st.text('')
	input_su = st.text_area("Hi! I am here to help you with your doc search.", max_chars=1000, height=100)
	if st.button('Search Docs'):
	with st.spinner('Searching For Relevant Docs...'):
	#time.sleep(2)
	st.markdown('___')
	db = faiss_loader()
	docs = db.similarity_search(input_su)
	docs = [txt.page_content for txt in docs]
	docs = "".join(docs)
	res = get_results_from_transformer(docs, input_su)['data']
	#res = get_results_from_longformer(docs)
	#st.write('Results Produced!')
	#convert string to json
	#doc_vector_store = load_doc_vectorstore()
	st.write(res)
	st.caption("Hurray!")
	st.balloons()