Spaces:
Running
Running
Add subjects and add more tokens for the model to digest
Browse files- README.md +8 -13
- config.yaml +2 -2
- readme_images/example_custom_1.png +0 -0
- src/action.py +1 -1
- src/download_new_papers.py +1 -1
- src/relevancy.py +4 -3
- src/relevancy_prompt.txt +1 -1
- src/utils.py +1 -1
README.md
CHANGED
@@ -45,22 +45,16 @@ You can also send yourself an email of the digest by creating a SendGrid account
|
|
45 |
|
46 |
#### Digest Configuration:
|
47 |
- Subject/Topic: Computer Science
|
48 |
-
- Categories: Artificial Intelligence, Computation and Language
|
49 |
- Interest:
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
|
|
54 |
|
55 |
#### Result:
|
56 |
-
<p align="left"><img src="./readme_images/
|
57 |
-
|
58 |
-
#### Digest Configuration:
|
59 |
-
- Subject/Topic: Quantitative Finance
|
60 |
-
- Interest: "making lots of money"
|
61 |
-
|
62 |
-
#### Result:
|
63 |
-
<p align="left"><img src="./readme_images/example_2.png" width=580 /></p>
|
64 |
|
65 |
## 💡 Usage
|
66 |
|
@@ -96,6 +90,7 @@ To locally run the same UI as the Huggign Face space:
|
|
96 |
|
97 |
- [x] Support personalized paper recommendation using LLM.
|
98 |
- [x] Send emails for daily digest.
|
|
|
99 |
- [ ] Implement a ranking factor to prioritize content from specific authors.
|
100 |
- [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
|
101 |
- [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..
|
|
|
45 |
|
46 |
#### Digest Configuration:
|
47 |
- Subject/Topic: Computer Science
|
48 |
+
- Categories: Artificial Intelligence, Computation and Language, Machine Learning
|
49 |
- Interest:
|
50 |
+
1. Large language model pretraining and finetunings
|
51 |
+
2. Multimodal machine learning
|
52 |
+
3. RAGs, Information retrieval
|
53 |
+
4. Optimization of LLM and GenAI
|
54 |
+
5. Do not care about specific application, for example, information extraction, summarization, etc.
|
55 |
|
56 |
#### Result:
|
57 |
+
<p align="left"><img src="./readme_images/example_custom_1.png" width=580 /></p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
## 💡 Usage
|
60 |
|
|
|
90 |
|
91 |
- [x] Support personalized paper recommendation using LLM.
|
92 |
- [x] Send emails for daily digest.
|
93 |
+
- [x] Further read from the paper itself via its HTML format (.pdf version will be implemented in the next phase)
|
94 |
- [ ] Implement a ranking factor to prioritize content from specific authors.
|
95 |
- [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
|
96 |
- [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..
|
config.yaml
CHANGED
@@ -3,7 +3,7 @@ topic: "Computer Science"
|
|
3 |
# An empty list here will include all categories in a topic
|
4 |
# Use the natural language names of the topics, found here: https://arxiv.org
|
5 |
# Including more categories will result in more calls to the large language model
|
6 |
-
categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning"]
|
7 |
|
8 |
# Relevance score threshold. abstracts that receive a score less than this from the large language model
|
9 |
# will have their papers filtered out.
|
@@ -23,6 +23,6 @@ threshold: 6
|
|
23 |
interest: |
|
24 |
1. Large language model pretraining and finetunings
|
25 |
2. Multimodal machine learning
|
26 |
-
3. RAGs
|
27 |
4. Optimization of LLM and GenAI
|
28 |
5. Do not care about specific application, for example, information extraction, summarization, etc.
|
|
|
3 |
# An empty list here will include all categories in a topic
|
4 |
# Use the natural language names of the topics, found here: https://arxiv.org
|
5 |
# Including more categories will result in more calls to the large language model
|
6 |
+
categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning", "Information Retrieval"]
|
7 |
|
8 |
# Relevance score threshold. abstracts that receive a score less than this from the large language model
|
9 |
# will have their papers filtered out.
|
|
|
23 |
interest: |
|
24 |
1. Large language model pretraining and finetunings
|
25 |
2. Multimodal machine learning
|
26 |
+
3. RAGs, Information retrieval
|
27 |
4. Optimization of LLM and GenAI
|
28 |
5. Do not care about specific application, for example, information extraction, summarization, etc.
|
readme_images/example_custom_1.png
ADDED
![]() |
src/action.py
CHANGED
@@ -251,7 +251,7 @@ def generate_body(topic, categories, interest, threshold):
|
|
251 |
)
|
252 |
body = "<br><br>".join(
|
253 |
[
|
254 |
-
f'<b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
|
255 |
f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
|
256 |
f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
|
257 |
f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'
|
|
|
251 |
)
|
252 |
body = "<br><br>".join(
|
253 |
[
|
254 |
+
f'<b>Subject: </b>{paper["subjects"]}<br><b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
|
255 |
f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
|
256 |
f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
|
257 |
f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'
|
src/download_new_papers.py
CHANGED
@@ -22,7 +22,7 @@ def crawl_html_version(html_link):
|
|
22 |
|
23 |
for each in para_list:
|
24 |
main_content.append(each.text.strip())
|
25 |
-
return ' '.join(main_content)[:
|
26 |
#if len(main_content >)
|
27 |
#return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
|
28 |
def _download_new_papers(field_abbr):
|
|
|
22 |
|
23 |
for each in para_list:
|
24 |
main_content.append(each.text.strip())
|
25 |
+
return ' '.join(main_content)[:10000]
|
26 |
#if len(main_content >)
|
27 |
#return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
|
28 |
def _download_new_papers(field_abbr):
|
src/relevancy.py
CHANGED
@@ -39,7 +39,7 @@ def encode_prompt(query, prompt_papers):
|
|
39 |
def is_json(myjson):
|
40 |
try:
|
41 |
json.loads(myjson)
|
42 |
-
except
|
43 |
return False
|
44 |
return True
|
45 |
|
@@ -97,7 +97,8 @@ def post_process_chat_gpt_response(paper_data, response, threshold_score=7):
|
|
97 |
# if the decoding stops due to length, the last example is likely truncated so we discard it
|
98 |
if scores[idx] < threshold_score:
|
99 |
continue
|
100 |
-
output_str = "
|
|
|
101 |
output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
|
102 |
output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
|
103 |
for key, value in inst.items():
|
@@ -166,7 +167,7 @@ def generate_relevance_score(
|
|
166 |
return ans_data, hallucination
|
167 |
|
168 |
def run_all_day_paper(
|
169 |
-
query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence"]},
|
170 |
date=None,
|
171 |
data_dir="../data",
|
172 |
model_name="gpt-3.5-turbo-16k",
|
|
|
39 |
def is_json(myjson):
|
40 |
try:
|
41 |
json.loads(myjson)
|
42 |
+
except Exception as e:
|
43 |
return False
|
44 |
return True
|
45 |
|
|
|
97 |
# if the decoding stops due to length, the last example is likely truncated so we discard it
|
98 |
if scores[idx] < threshold_score:
|
99 |
continue
|
100 |
+
output_str = "Subject: " + paper_data[idx]["subjects"] + "\n"
|
101 |
+
output_str += "Title: " + paper_data[idx]["title"] + "\n"
|
102 |
output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
|
103 |
output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
|
104 |
for key, value in inst.items():
|
|
|
167 |
return ans_data, hallucination
|
168 |
|
169 |
def run_all_day_paper(
|
170 |
+
query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence", "Information Retrieval"]},
|
171 |
date=None,
|
172 |
data_dir="../data",
|
173 |
model_name="gpt-3.5-turbo-16k",
|
src/relevancy_prompt.txt
CHANGED
@@ -5,4 +5,4 @@ Please keep the paper order the same as in the input list, with one json format
|
|
5 |
|
6 |
1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
|
7 |
|
8 |
-
My research interests are: NLP, RAGs, LLM, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...
|
|
|
5 |
|
6 |
1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
|
7 |
|
8 |
+
My research interests are: NLP, RAGs, LLM, Information Retrieval, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...
|
src/utils.py
CHANGED
@@ -25,7 +25,7 @@ if openai_org is not None:
|
|
25 |
@dataclasses.dataclass
|
26 |
class OpenAIDecodingArguments(object):
|
27 |
#max_tokens: int = 1800
|
28 |
-
max_tokens: int =
|
29 |
temperature: float = 0.2
|
30 |
top_p: float = 1.0
|
31 |
n: int = 1
|
|
|
25 |
@dataclasses.dataclass
|
26 |
class OpenAIDecodingArguments(object):
|
27 |
#max_tokens: int = 1800
|
28 |
+
max_tokens: int = 5400
|
29 |
temperature: float = 0.2
|
30 |
top_p: float = 1.0
|
31 |
n: int = 1
|