linhkid91 commited on
Commit
fc807c3
·
1 Parent(s): fc0e67e

Add subjects and add more tokens for the model to digest

Browse files
README.md CHANGED
@@ -45,22 +45,16 @@ You can also send yourself an email of the digest by creating a SendGrid account
45
 
46
  #### Digest Configuration:
47
  - Subject/Topic: Computer Science
48
- - Categories: Artificial Intelligence, Computation and Language
49
  - Interest:
50
- - Large language model pretraining and finetunings
51
- - Multimodal machine learning
52
- - Do not care about specific application, for example, information extraction, summarization, etc.
53
- - Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc.
 
54
 
55
  #### Result:
56
- <p align="left"><img src="./readme_images/example_1.png" width=580 /></p>
57
-
58
- #### Digest Configuration:
59
- - Subject/Topic: Quantitative Finance
60
- - Interest: "making lots of money"
61
-
62
- #### Result:
63
- <p align="left"><img src="./readme_images/example_2.png" width=580 /></p>
64
 
65
  ## 💡 Usage
66
 
@@ -96,6 +90,7 @@ To locally run the same UI as the Huggign Face space:
96
 
97
  - [x] Support personalized paper recommendation using LLM.
98
  - [x] Send emails for daily digest.
 
99
  - [ ] Implement a ranking factor to prioritize content from specific authors.
100
  - [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
101
  - [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..
 
45
 
46
  #### Digest Configuration:
47
  - Subject/Topic: Computer Science
48
+ - Categories: Artificial Intelligence, Computation and Language, Machine Learning
49
  - Interest:
50
+ 1. Large language model pretraining and finetunings
51
+ 2. Multimodal machine learning
52
+ 3. RAGs, Information retrieval
53
+ 4. Optimization of LLM and GenAI
54
+ 5. Do not care about specific application, for example, information extraction, summarization, etc.
55
 
56
  #### Result:
57
+ <p align="left"><img src="./readme_images/example_custom_1.png" width=580 /></p>
 
 
 
 
 
 
 
58
 
59
  ## 💡 Usage
60
 
 
90
 
91
  - [x] Support personalized paper recommendation using LLM.
92
  - [x] Send emails for daily digest.
93
+ - [x] Further read from the paper itself via its HTML format (.pdf version will be implemented in the next phase)
94
  - [ ] Implement a ranking factor to prioritize content from specific authors.
95
  - [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
96
  - [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..
config.yaml CHANGED
@@ -3,7 +3,7 @@ topic: "Computer Science"
3
  # An empty list here will include all categories in a topic
4
  # Use the natural language names of the topics, found here: https://arxiv.org
5
  # Including more categories will result in more calls to the large language model
6
- categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning"]
7
 
8
  # Relevance score threshold. abstracts that receive a score less than this from the large language model
9
  # will have their papers filtered out.
@@ -23,6 +23,6 @@ threshold: 6
23
  interest: |
24
  1. Large language model pretraining and finetunings
25
  2. Multimodal machine learning
26
- 3. RAGs
27
  4. Optimization of LLM and GenAI
28
  5. Do not care about specific application, for example, information extraction, summarization, etc.
 
3
  # An empty list here will include all categories in a topic
4
  # Use the natural language names of the topics, found here: https://arxiv.org
5
  # Including more categories will result in more calls to the large language model
6
+ categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning", "Information Retrieval"]
7
 
8
  # Relevance score threshold. abstracts that receive a score less than this from the large language model
9
  # will have their papers filtered out.
 
23
  interest: |
24
  1. Large language model pretraining and finetunings
25
  2. Multimodal machine learning
26
+ 3. RAGs, Information retrieval
27
  4. Optimization of LLM and GenAI
28
  5. Do not care about specific application, for example, information extraction, summarization, etc.
readme_images/example_custom_1.png ADDED
src/action.py CHANGED
@@ -251,7 +251,7 @@ def generate_body(topic, categories, interest, threshold):
251
  )
252
  body = "<br><br>".join(
253
  [
254
- f'<b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
255
  f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
256
  f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
257
  f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'
 
251
  )
252
  body = "<br><br>".join(
253
  [
254
+ f'<b>Subject: </b>{paper["subjects"]}<br><b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
255
  f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
256
  f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
257
  f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'
src/download_new_papers.py CHANGED
@@ -22,7 +22,7 @@ def crawl_html_version(html_link):
22
 
23
  for each in para_list:
24
  main_content.append(each.text.strip())
25
- return ' '.join(main_content)[:8000]
26
  #if len(main_content >)
27
  #return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
28
  def _download_new_papers(field_abbr):
 
22
 
23
  for each in para_list:
24
  main_content.append(each.text.strip())
25
+ return ' '.join(main_content)[:10000]
26
  #if len(main_content >)
27
  #return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
28
  def _download_new_papers(field_abbr):
src/relevancy.py CHANGED
@@ -39,7 +39,7 @@ def encode_prompt(query, prompt_papers):
39
  def is_json(myjson):
40
  try:
41
  json.loads(myjson)
42
- except ValueError as e:
43
  return False
44
  return True
45
 
@@ -97,7 +97,8 @@ def post_process_chat_gpt_response(paper_data, response, threshold_score=7):
97
  # if the decoding stops due to length, the last example is likely truncated so we discard it
98
  if scores[idx] < threshold_score:
99
  continue
100
- output_str = "Title: " + paper_data[idx]["title"] + "\n"
 
101
  output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
102
  output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
103
  for key, value in inst.items():
@@ -166,7 +167,7 @@ def generate_relevance_score(
166
  return ans_data, hallucination
167
 
168
  def run_all_day_paper(
169
- query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence"]},
170
  date=None,
171
  data_dir="../data",
172
  model_name="gpt-3.5-turbo-16k",
 
39
  def is_json(myjson):
40
  try:
41
  json.loads(myjson)
42
+ except Exception as e:
43
  return False
44
  return True
45
 
 
97
  # if the decoding stops due to length, the last example is likely truncated so we discard it
98
  if scores[idx] < threshold_score:
99
  continue
100
+ output_str = "Subject: " + paper_data[idx]["subjects"] + "\n"
101
+ output_str += "Title: " + paper_data[idx]["title"] + "\n"
102
  output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
103
  output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
104
  for key, value in inst.items():
 
167
  return ans_data, hallucination
168
 
169
  def run_all_day_paper(
170
+ query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence", "Information Retrieval"]},
171
  date=None,
172
  data_dir="../data",
173
  model_name="gpt-3.5-turbo-16k",
src/relevancy_prompt.txt CHANGED
@@ -5,4 +5,4 @@ Please keep the paper order the same as in the input list, with one json format
5
 
6
  1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
7
 
8
- My research interests are: NLP, RAGs, LLM, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...
 
5
 
6
  1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
7
 
8
+ My research interests are: NLP, RAGs, LLM, Information Retrieval, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...
src/utils.py CHANGED
@@ -25,7 +25,7 @@ if openai_org is not None:
25
  @dataclasses.dataclass
26
  class OpenAIDecodingArguments(object):
27
  #max_tokens: int = 1800
28
- max_tokens: int = 4800
29
  temperature: float = 0.2
30
  top_p: float = 1.0
31
  n: int = 1
 
25
  @dataclasses.dataclass
26
  class OpenAIDecodingArguments(object):
27
  #max_tokens: int = 1800
28
+ max_tokens: int = 5400
29
  temperature: float = 0.2
30
  top_p: float = 1.0
31
  n: int = 1