OxbridgeEconomics commited on
Commit
dc5ddb1
·
2 Parent(s): 98d9fcd ef71343

Merge branch 'main' of https://github.com/oxbridge-econ/data-collection-china

Browse files
Files changed (17) hide show
  1. README.md +64 -1
  2. cbirc.py +14 -4
  3. chinatax.py +14 -1
  4. csrc.py +23 -2
  5. daily.py +29 -15
  6. eastmoney.py +22 -0
  7. glue.py +6 -2
  8. gov.py +20 -0
  9. manual_upload.py +23 -6
  10. mof.py +15 -1
  11. mofcom.py +6 -0
  12. ndrc.py +13 -1
  13. patterns.json +78 -12
  14. pbc.py +15 -0
  15. safe.py +23 -1
  16. stats.py +16 -2
  17. utils.py +244 -54
README.md CHANGED
@@ -1 +1,64 @@
1
- # security-report-collection
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Security Report Collection
2
+
3
+ The Security Report Collection repository contains a series of Python scripts designed to automate the collection, processing, and storage of financial and policy data from various Chinese government and financial websites. This data is vital for understanding changes in policy, financial news, and regulatory measures that could impact markets and investments.
4
+
5
+ ## Repository Structure
6
+
7
+ - **Python Scripts**: Each script is tailored to specific sources and tasks, ranging from data scraping to sentiment analysis and database operations.
8
+ - **GitHub Workflows**: Automated workflows to execute the Python scripts on a schedule or trigger specific events, excluding `utils.py` and `manual_upload.py`.
9
+ - **requirements.txt**: Lists all Python dependencies required for the scripts to run.
10
+
11
+ ## Python Scripts Overview
12
+
13
+ Each script targets different data sources or handles distinct aspects of data management:
14
+
15
+ ### Data Collection Scripts
16
+
17
+ 1. **CBIRC, Chinatax, CSRCV, Daily, Eastmoney, Glue, Gov, Manual_Upload, MOF, MOFCOM, PBC, SAFE, Stats**:
18
+ - These scripts scrape data from their respective websites, handling tasks such as extracting article URLs, downloading articles, translating content, and calculating sentiment scores.
19
+ - They use utilities provided by `utils.py` to interact with databases, manage files, and perform translations and sentiment analysis.
20
+
21
+ ### Utility Scripts
22
+
23
+ - **utils.py**:
24
+ - A central utility script that supports database operations, file handling, content translation, and other shared functionalities across various scripts.
25
+ - It includes custom functions for working with AWS DynamoDB, handling PDFs, fetching URLs, and more.
26
+
27
+ ### Special Scripts
28
+
29
+ - **manual_upload.py**:
30
+ - Allows manual data entry into the database, facilitating the addition of articles not captured through automated scripts.
31
+ - Provides a command-line interface for inputting article details and saving them to DynamoDB.
32
+
33
+ ## GitHub Workflows
34
+
35
+ - Automated workflows are set up for all Python scripts except `utils.py` and `manual_upload.py`.
36
+ - These workflows ensure that data collection and processing tasks are executed periodically or in response to specific triggers, maintaining an up-to-date database.
37
+
38
+ ## Requirements
39
+
40
+ - The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
41
+ ```pip install -r requirements.txt```
42
+
43
+ ## Setup and Configuration
44
+
45
+ 1. **AWS Configuration**:
46
+ - Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
47
+ - Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
48
+
49
+ 2. **Database Setup**:
50
+ - Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
51
+
52
+ 3. **Python Environment**:
53
+ - It's recommended to set up a virtual environment for Python to manage dependencies:
54
+ ```
55
+ python -m venv venv
56
+ source venv/bin/activate # On Unix/macOS
57
+ venv\Scripts\activate # On Windows
58
+ ```
59
+
60
+ 4. **Running Scripts**:
61
+ - To run a script manually, navigate to the script’s directory and execute:
62
+ ```
63
+ python <script_name>.py
64
+ ```
cbirc.py CHANGED
@@ -1,9 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import json
2
- import ssl
3
  import uuid
4
  import time
5
- import urllib.request
6
- import urllib3
7
  from datetime import datetime, timedelta
8
  from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
9
 
@@ -34,7 +44,7 @@ while i > -1:
34
  article['titleCN'] = article['docSubtitle']
35
  article['title'] = translate(article['docSubtitle'])
36
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
37
- article['category']= "Policy Interpretation"
38
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
39
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
40
  article['attachment'] = ''
 
1
+ """
2
+ This script fetches data from the China Banking and Insurance Regulatory Commission (CBIRC) website and extracts relevant information from the fetched data.
3
+ The extracted information is then processed and stored in a database.
4
+
5
+ The script performs the following steps:
6
+ 1. Fetches data from the CBIRC website by making HTTP requests.
7
+ 2. Parses the fetched data and extracts relevant information.
8
+ 3. Translates the extracted information to English.
9
+ 4. Computes sentiment scores for the translated content.
10
+ 5. Stores the processed information in a database.
11
+
12
+ Note: The script also includes commented code for fetching data from the State Taxation Administration of China website, but it is currently disabled.
13
+ """
14
  import json
 
15
  import uuid
16
  import time
 
 
17
  from datetime import datetime, timedelta
18
  from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
19
 
 
44
  article['titleCN'] = article['docSubtitle']
45
  article['title'] = translate(article['docSubtitle'])
46
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
47
+ article['category']= "Policy Interpretation"
48
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
49
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
50
  article['attachment'] = ''
chinatax.py CHANGED
@@ -1,3 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import json
2
  import ssl
3
  import uuid
@@ -6,7 +19,7 @@ import time
6
  import urllib.request
7
  import urllib3
8
  from lxml import etree
9
- from utils import encode, translate, sentiment_computation, upsert_content, encode_content
10
 
11
  ssl._create_default_https_context = ssl._create_stdlib_context
12
 
 
1
+ """
2
+ This script is used for data collection from the China Taxation website. It retrieves policy interpretation articles and processes them for further analysis.
3
+
4
+ The script performs the following steps:
5
+ 1. Imports necessary modules and libraries.
6
+ 2. Defines the base URL for retrieving policy interpretation articles.
7
+ 3. Iterates through the pages of the search results.
8
+ 4. Retrieves the content of each article.
9
+ 5. Processes the content by translating it to English and performing sentiment analysis.
10
+ 6. Stores the processed data in a database.
11
+
12
+ Note: The script also retrieves additional articles from a different URL and follows a similar process.
13
+ """
14
  import json
15
  import ssl
16
  import uuid
 
19
  import urllib.request
20
  import urllib3
21
  from lxml import etree
22
+ from utils import translate, sentiment_computation, upsert_content, encode_content
23
 
24
  ssl._create_default_https_context = ssl._create_stdlib_context
25
 
csrc.py CHANGED
@@ -1,3 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import uuid
2
  import json
3
  import time
@@ -35,7 +56,7 @@ while i > -1:
35
  article['category']= "Policy Interpretation"
36
  crawl(url, article)
37
  except Exception as error:
38
- print(error)
39
 
40
  i = 1
41
  while i > -1:
@@ -70,4 +91,4 @@ while i > -1:
70
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
71
  upsert_content(article)
72
  except Exception as error:
73
- print(error)
 
1
+ """
2
+ This script is used to crawl and collect data from the website of the China Securities Regulatory Commission (CSRC).
3
+ It retrieves policy interpretation articles and financial news articles from the CSRC website.
4
+ The collected data is then processed and stored in a database.
5
+
6
+ The script consists of two main parts:
7
+ 1. Crawl and process policy interpretation articles from the CSRC website.
8
+ 2. Crawl and process financial news articles from the CSRC website.
9
+
10
+ The script uses various libraries and functions to handle web scraping, data processing, and database operations.
11
+
12
+ Note: This script assumes the presence of the following dependencies:
13
+ - urllib
14
+ - lxml
15
+ - json
16
+ - datetime
17
+ - time
18
+ - utils (custom module)
19
+
20
+ Please make sure to install these dependencies before running the script.
21
+ """
22
  import uuid
23
  import json
24
  import time
 
56
  article['category']= "Policy Interpretation"
57
  crawl(url, article)
58
  except Exception as error:
59
+ print(error)
60
 
61
  i = 1
62
  while i > -1:
 
91
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
92
  upsert_content(article)
93
  except Exception as error:
94
+ print(error)
daily.py CHANGED
@@ -1,21 +1,21 @@
1
- import os
 
 
 
 
2
  import json
3
- import uuid
4
  import time
5
  import urllib.request
6
- from lxml import etree
7
  from datetime import datetime, timedelta
8
  from urllib.parse import urlparse
9
- from utils import (encode,
10
- translate,
11
- sentiment_computation,
12
- fetch_url,
13
- extract_from_pdf,
14
- crawl,
15
- datemodifier,
16
- encode_content,
17
- update_content,
18
- extract_reference)
19
 
20
  with open('xpath.json', 'r', encoding='UTF-8') as f:
21
  xpath_dict = json.load(f)
@@ -50,7 +50,7 @@ while i > -1:
50
  article['titleCN'] = article['docSubtitle']
51
  article['title'] = translate(article['docSubtitle'])
52
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
53
- article['category']= "Policy Interpretation"
54
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
55
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
56
  article['attachment'] = ''
@@ -133,6 +133,20 @@ while i > -1:
133
 
134
  print("data.eastmoney.com")
135
  def crawl_eastmoney(url, article):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  domain = urlparse(url).netloc
137
  req = urllib.request.urlopen(url)
138
  text = req.read()
@@ -499,4 +513,4 @@ while i > -1:
499
  article['category']= "Data Interpretation"
500
  crawl(url, article)
501
  except Exception as error:
502
- print(error)
 
1
+ """
2
+ This script is responsible for collecting data from various websites related to financial and policy information in China.
3
+ It fetches data from different sources, extracts relevant information, translates it, and updates the content accordingly.
4
+ The collected data includes policy interpretations, financial news, macroeconomic research, and more.
5
+ """
6
  import json
7
+ import os
8
  import time
9
  import urllib.request
10
+ import uuid
11
  from datetime import datetime, timedelta
12
  from urllib.parse import urlparse
13
+
14
+ from lxml import etree
15
+
16
+ from utils import (crawl, datemodifier, encode, encode_content,
17
+ extract_from_pdf, extract_reference, fetch_url,
18
+ sentiment_computation, translate, update_content)
 
 
 
 
19
 
20
  with open('xpath.json', 'r', encoding='UTF-8') as f:
21
  xpath_dict = json.load(f)
 
50
  article['titleCN'] = article['docSubtitle']
51
  article['title'] = translate(article['docSubtitle'])
52
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
53
+ article['category']= "Policy Interpretation"
54
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
55
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
56
  article['attachment'] = ''
 
133
 
134
  print("data.eastmoney.com")
135
  def crawl_eastmoney(url, article):
136
+ """
137
+ Crawls the given URL and extracts information from the webpage.
138
+
139
+ Args:
140
+ url (str): The URL of the webpage to crawl.
141
+ article (dict): A dictionary to store the extracted information.
142
+
143
+ Returns:
144
+ None: If the length of the extracted content is less than 10 characters.
145
+
146
+ Raises:
147
+ None.
148
+
149
+ """
150
  domain = urlparse(url).netloc
151
  req = urllib.request.urlopen(url)
152
  text = req.read()
 
513
  article['category']= "Data Interpretation"
514
  crawl(url, article)
515
  except Exception as error:
516
+ print(error)
eastmoney.py CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  import uuid
2
  import json
3
  import urllib.request
@@ -6,10 +12,26 @@ from datetime import datetime, timedelta
6
  from lxml import etree
7
  from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
8
 
 
9
  with open('xpath.json', 'r', encoding='UTF-8') as f:
10
  xpath_dict = json.load(f)
11
 
12
  def crawl(url, article):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  domain = urlparse(url).netloc
14
  req = urllib.request.urlopen(url)
15
  text = req.read()
 
1
+ """
2
+ This script is used to crawl a webpage and extract relevant information from it. It defines a function `crawl` that takes a URL and a dictionary to store the extracted information. The function crawls the webpage, extracts the content, translates it to English, and stores it in the dictionary.
3
+
4
+ The script also includes a main loop that fetches data from a specific URL and calls the `crawl` function for each article in the fetched data.
5
+ """
6
+
7
  import uuid
8
  import json
9
  import urllib.request
 
12
  from lxml import etree
13
  from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
14
 
15
+ # Load XPath dictionary from a JSON file
16
  with open('xpath.json', 'r', encoding='UTF-8') as f:
17
  xpath_dict = json.load(f)
18
 
19
  def crawl(url, article):
20
+ """
21
+ Crawls the given URL and extracts relevant information from the webpage.
22
+
23
+ Args:
24
+ url (str): The URL of the webpage to crawl.
25
+ article (dict): A dictionary to store the extracted information.
26
+
27
+ Returns:
28
+ None: If the length of the extracted content is less than 10 characters.
29
+ str: The extracted content in English if successful.
30
+
31
+ Raises:
32
+ None
33
+
34
+ """
35
  domain = urlparse(url).netloc
36
  req = urllib.request.urlopen(url)
37
  text = req.read()
glue.py CHANGED
@@ -6,7 +6,11 @@ AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
6
  AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
7
 
8
  def get_client_connection():
9
- """Get dynamoDB connection"""
 
 
 
 
10
  return boto3.client(
11
  service_name='glue',
12
  region_name='us-east-1',
@@ -22,4 +26,4 @@ print(response)
22
  response = glue.start_job_run(
23
  JobName='Reference China'
24
  )
25
- print(response)
 
6
  AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
7
 
8
  def get_client_connection():
9
+ """
10
+ Returns a client connection to the AWS Glue service.
11
+
12
+ :return: AWS Glue client connection
13
+ """
14
  return boto3.client(
15
  service_name='glue',
16
  region_name='us-east-1',
 
26
  response = glue.start_job_run(
27
  JobName='Reference China'
28
  )
29
+ print(response)
gov.py CHANGED
@@ -1,3 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from datetime import datetime, timedelta
2
  import time
3
  import urllib.request
 
1
+ """
2
+ This script is used to crawl and collect policy articles from the official website of the State Council of China (https://www.gov.cn).
3
+
4
+ The script contains two main functions:
5
+ 1. crawl(url, article): This function is responsible for crawling a specific policy article given its URL and extracting relevant information such as title, author, content, publish date, etc.
6
+ 2. main(): This function is the entry point of the script. It iterates over different pages of policy articles and calls the crawl function to collect the information.
7
+
8
+ Note: The script imports the following modules: datetime, timedelta, time, urllib.request, lxml.etree, and utils (custom module).
9
+ """
10
+
11
+ from datetime import datetime, timedelta
12
+ import time
13
+ import urllib.request
14
+ from lxml import etree
15
+ from utils import crawl
16
+
17
+ # Rest of the code...
18
+ """
19
+
20
+ """
21
  from datetime import datetime, timedelta
22
  import time
23
  import urllib.request
manual_upload.py CHANGED
@@ -1,7 +1,27 @@
1
- from decimal import Decimal
2
- from utils import translate, sentiment_computation, get_db_connection
3
- from datetime import datetime
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  import uuid
 
 
 
 
5
 
6
  # User input for the article content
7
  article_titleCN = input("Enter the title of the article: ")
@@ -12,7 +32,6 @@ article_publish_date = input("Enter the publish date of the article (YYYY-MM-DD)
12
  article_link = input("Enter the link to the article: ")
13
  article_siteCN = input("Enter the site of the article: ")
14
 
15
-
16
  # Compute sentiment of the translated content
17
  sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
18
 
@@ -30,8 +49,6 @@ article= {
30
  'publishDate': article_publish_date,
31
  'link': article_link,
32
  'attachment': '',
33
- # 'authorID': str(report['authorid']),
34
- # 'entityList': report['entitylist'],
35
  'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
36
  'sentimentLabel': sentiment_label,
37
  'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
 
1
+ """
2
+ This script allows the user to manually upload an article to a database. It prompts the user to enter various details about the article, such as the title, content, subtitle, publish date, link, and site. It then computes the sentiment of the article's translated content and constructs a dictionary representing the article. Finally, it inserts or updates the article in the database.
3
+
4
+ Dependencies:
5
+ - decimal
6
+ - utils (custom module)
7
+ - datetime
8
+ - uuid
9
+
10
+ Usage:
11
+ 1. Run the script.
12
+ 2. Enter the required details about the article when prompted.
13
+ 3. The script will compute the sentiment of the translated content and construct a dictionary representing the article.
14
+ 4. The article will be inserted or updated in the database.
15
+ 5. The article dictionary and the response from the database operation will be printed.
16
+
17
+ Note: Make sure to configure the database connection and table name before running the script.
18
+ """
19
+
20
  import uuid
21
+ from datetime import datetime
22
+ from decimal import Decimal
23
+
24
+ from utils import get_db_connection, sentiment_computation, translate
25
 
26
  # User input for the article content
27
  article_titleCN = input("Enter the title of the article: ")
 
32
  article_link = input("Enter the link to the article: ")
33
  article_siteCN = input("Enter the site of the article: ")
34
 
 
35
  # Compute sentiment of the translated content
36
  sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
37
 
 
49
  'publishDate': article_publish_date,
50
  'link': article_link,
51
  'attachment': '',
 
 
52
  'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
53
  'sentimentLabel': sentiment_label,
54
  'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
mof.py CHANGED
@@ -1,9 +1,22 @@
 
 
 
 
 
 
 
 
 
 
1
  import time
2
  import urllib.request
3
- from lxml import etree
4
  from datetime import datetime, timedelta
 
 
 
5
  from utils import crawl
6
 
 
7
  i = 0
8
  while i > -1:
9
  if i == 0:
@@ -38,6 +51,7 @@ while i > -1:
38
  except Exception as error:
39
  print(error)
40
 
 
41
  i = 0
42
  while i > -1:
43
  if i == 0:
 
1
+ """
2
+ This script is used to crawl and collect financial news and policy interpretation articles from the website of the Ministry of Finance of China (https://www.mof.gov.cn/).
3
+
4
+ The script iterates through the pages of the "Financial News" and "Policy Interpretation" categories on the website and extracts the articles' URLs. It then calls the `crawl` function from the `utils` module to crawl and collect the article data.
5
+
6
+ The script uses the `lxml` library to parse the HTML content of the website and extract the necessary information.
7
+
8
+ Note: The script assumes the existence of a `crawl` function in the `utils` module.
9
+ """
10
+
11
  import time
12
  import urllib.request
 
13
  from datetime import datetime, timedelta
14
+
15
+ from lxml import etree
16
+
17
  from utils import crawl
18
 
19
+ # Crawl Financial News articles
20
  i = 0
21
  while i > -1:
22
  if i == 0:
 
51
  except Exception as error:
52
  print(error)
53
 
54
+ # Crawl Policy Interpretation articles
55
  i = 0
56
  while i > -1:
57
  if i == 0:
mofcom.py CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  import time
2
  import urllib.request
3
  from datetime import datetime, timedelta
 
1
+ """
2
+ This script is used to crawl and collect data from the Ministry of Commerce of the People's Republic of China (MOFCOM) website.
3
+ It retrieves articles from different categories and extracts relevant information such as date and URL.
4
+ The collected data is then passed to the 'crawl' function for further processing.
5
+ """
6
+
7
  import time
8
  import urllib.request
9
  from datetime import datetime, timedelta
ndrc.py CHANGED
@@ -1,5 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from datetime import datetime, timedelta
2
- import uuid
3
  import time
4
  import urllib.request
5
  from lxml import etree
 
1
+ """
2
+ This script is used to crawl and collect data from the National Development and Reform Commission (NDRC) website.
3
+ It retrieves articles from the website and categorizes them as either "Policy Release" or "Policy Interpretation".
4
+ The script starts by iterating through the pages of the website, starting from the first page.
5
+ For each page, it retrieves the HTML content and parses it using lxml library.
6
+ It then extracts the article list from the parsed HTML and iterates through each article.
7
+ For each article, it extracts the publication date, converts it to a datetime object, and checks if it is within the last 183 days.
8
+ If the article is older than 183 days, the script stops iterating through the pages.
9
+ Otherwise, it extracts the URL of the article and categorizes it based on the URL pattern.
10
+ The script then calls the 'crawl' function from the 'utils' module to crawl the article and collect data.
11
+ Any exceptions that occur during the crawling process are caught and printed.
12
+ """
13
+
14
  from datetime import datetime, timedelta
 
15
  import time
16
  import urllib.request
17
  from lxml import etree
patterns.json CHANGED
@@ -275,7 +275,13 @@
275
  "keyword": "相关研究",
276
  "article_regex": "《(.*?)》",
277
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
278
- "date_format": "%Y.%m.%d"
 
 
 
 
 
 
279
  },
280
  {
281
  "site": "Yongxing Securities Co., Ltd.",
@@ -293,16 +299,28 @@
293
  "keyword": "相关研究",
294
  "article_regex": "《(.*?)》",
295
  "date_regex": "(d{4}\\s/\\d{2}/\\d{2}) ",
296
- "date_format": "(%Y/%m/%d) "
 
 
 
 
 
 
297
  },
298
  {
299
  "site": "Hualong Securities Co., Ltd.",
300
  "pages": [0],
301
- "date_range": 1,
302
  "keyword": "相关阅读",
303
  "article_regex": "《(.*?)》",
304
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
305
- "date_format": "%Y.%m.%d"
 
 
 
 
 
 
306
  },
307
  {
308
  "site": "Hebei Yuanda Information Technology Co., Ltd.",
@@ -311,7 +329,13 @@
311
  "keyword": "相关报告:",
312
  "article_regex": "《(.*?)》",
313
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
314
- "date_format": "%Y.%m.%d"
 
 
 
 
 
 
315
  },
316
  {
317
  "site": "Huaxin Securities Co., Ltd.",
@@ -329,7 +353,13 @@
329
  "keyword": "1.",
330
  "article_regex": "《(.*?)》",
331
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
332
- "date_format": "%Y.%m.%d"
 
 
 
 
 
 
333
  },
334
  {
335
  "site": "Beijing Tengjing Big Data Application Technology Research Institute",
@@ -338,7 +368,13 @@
338
  "keyword": "相关报告",
339
  "article_regex": "《(.*?)》",
340
  "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
341
- "date_format": "%Y-%m-%d"
 
 
 
 
 
 
342
  },
343
  {
344
  "site": "Wanhe Securities Co., Ltd.",
@@ -347,7 +383,13 @@
347
  "keyword": "相关报告",
348
  "article_regex": "《(.*?)》",
349
  "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
350
- "date_format": "%Y-%m-%d"
 
 
 
 
 
 
351
  },
352
  {
353
  "site": "Centaline Securities Co., Ltd.",
@@ -356,7 +398,13 @@
356
  "keyword": "相关报告",
357
  "article_regex": "《(.*?)》",
358
  "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
359
- "date_format": "%Y-%m-%d"
 
 
 
 
 
 
360
  },
361
  {
362
  "site": "Tengjing Digital Research",
@@ -365,7 +413,13 @@
365
  "keyword": "相关报告",
366
  "article_regex": "《(.*?)》",
367
  "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
368
- "date_format": "%Y-%m-%d"
 
 
 
 
 
 
369
  },
370
  {
371
  "site": "Guoyuan Securities",
@@ -374,7 +428,13 @@
374
  "keyword": "相关研究报告",
375
  "article_regex": "《(.*?)》",
376
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
377
- "date_format": "%Y.%m.%d"
 
 
 
 
 
 
378
  },
379
  {
380
  "site": "China Galaxy Co., Ltd.",
@@ -392,7 +452,13 @@
392
  "keyword": "相关报告",
393
  "article_regex": "《(.*?)》",
394
  "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
395
- "date_format": "%Y-%m-%d"
 
 
 
 
 
 
396
  },
397
  {
398
  "site": "SDIC Anxin Futures",
 
275
  "keyword": "相关研究",
276
  "article_regex": "《(.*?)》",
277
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
278
+ "date_format": "%Y.%m.%d",
279
+ "split":[
280
+ {
281
+ "string": "——",
282
+ "index": 0
283
+ }
284
+ ]
285
  },
286
  {
287
  "site": "Yongxing Securities Co., Ltd.",
 
299
  "keyword": "相关研究",
300
  "article_regex": "《(.*?)》",
301
  "date_regex": "(d{4}\\s/\\d{2}/\\d{2}) ",
302
+ "date_format": "(%Y/%m/%d) ",
303
+ "split":[
304
+ {
305
+ "string": "——",
306
+ "index": 0
307
+ }
308
+ ]
309
  },
310
  {
311
  "site": "Hualong Securities Co., Ltd.",
312
  "pages": [0],
313
+ "date_range": 5,
314
  "keyword": "相关阅读",
315
  "article_regex": "《(.*?)》",
316
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
317
+ "date_format": "%Y.%m.%d",
318
+ "split":[
319
+ {
320
+ "string": "——",
321
+ "index": 0
322
+ }
323
+ ]
324
  },
325
  {
326
  "site": "Hebei Yuanda Information Technology Co., Ltd.",
 
329
  "keyword": "相关报告:",
330
  "article_regex": "《(.*?)》",
331
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
332
+ "date_format": "%Y.%m.%d",
333
+ "split":[
334
+ {
335
+ "string": ":",
336
+ "index": -1
337
+ }
338
+ ]
339
  },
340
  {
341
  "site": "Huaxin Securities Co., Ltd.",
 
353
  "keyword": "1.",
354
  "article_regex": "《(.*?)》",
355
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
356
+ "date_format": "%Y.%m.%d",
357
+ "split":[
358
+ {
359
+ "string": "——",
360
+ "index": 0
361
+ }
362
+ ]
363
  },
364
  {
365
  "site": "Beijing Tengjing Big Data Application Technology Research Institute",
 
368
  "keyword": "相关报告",
369
  "article_regex": "《(.*?)》",
370
  "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
371
+ "date_format": "%Y-%m-%d",
372
+ "split":[
373
+ {
374
+ "string": ":",
375
+ "index": -1
376
+ }
377
+ ]
378
  },
379
  {
380
  "site": "Wanhe Securities Co., Ltd.",
 
383
  "keyword": "相关报告",
384
  "article_regex": "《(.*?)》",
385
  "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
386
+ "date_format": "%Y-%m-%d",
387
+ "split":[
388
+ {
389
+ "string": "-",
390
+ "index": -1
391
+ }
392
+ ]
393
  },
394
  {
395
  "site": "Centaline Securities Co., Ltd.",
 
398
  "keyword": "相关报告",
399
  "article_regex": "《(.*?)》",
400
  "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
401
+ "date_format": "%Y-%m-%d",
402
+ "split":[
403
+ {
404
+ "string": ":",
405
+ "index": -1
406
+ }
407
+ ]
408
  },
409
  {
410
  "site": "Tengjing Digital Research",
 
413
  "keyword": "相关报告",
414
  "article_regex": "《(.*?)》",
415
  "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
416
+ "date_format": "%Y-%m-%d",
417
+ "split":[
418
+ {
419
+ "string": ":",
420
+ "index": -1
421
+ }
422
+ ]
423
  },
424
  {
425
  "site": "Guoyuan Securities",
 
428
  "keyword": "相关研究报告",
429
  "article_regex": "《(.*?)》",
430
  "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
431
+ "date_format": "%Y.%m.%d",
432
+ "split":[
433
+ {
434
+ "string": ":",
435
+ "index": -1
436
+ }
437
+ ]
438
  },
439
  {
440
  "site": "China Galaxy Co., Ltd.",
 
452
  "keyword": "相关报告",
453
  "article_regex": "《(.*?)》",
454
  "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
455
+ "date_format": "%Y-%m-%d",
456
+ "split":[
457
+ {
458
+ "string": ":",
459
+ "index": 0
460
+ }
461
+ ]
462
  },
463
  {
464
  "site": "SDIC Anxin Futures",
pbc.py CHANGED
@@ -1,3 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import time
2
  import uuid
3
  from datetime import datetime, timedelta
 
1
+ """
2
+ This module contains code to scrape the People's Bank of China website and collect policy interpretation articles. It iterates through the pages of the website, extracts relevant information from each article, and stores the data in a database.
3
+
4
+ The main functionality of this module includes:
5
+ - Scraping the website for policy interpretation articles
6
+ - Parsing the HTML content of each article
7
+ - Extracting relevant information such as title, content, publish date, and URL
8
+ - Translating the content from Chinese to English
9
+ - Computing sentiment scores for the content
10
+ - Storing the collected data in a database
11
+
12
+ Note: This code assumes the existence of the following helper functions: encode, translate, datemodifier, sentiment_computation, and upsert_content.
13
+
14
+ """
15
+
16
  import time
17
  import uuid
18
  from datetime import datetime, timedelta
safe.py CHANGED
@@ -1,9 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import time
2
  import urllib.request
3
  from datetime import datetime, timedelta
4
  from lxml import etree
5
  from utils import crawl
6
 
 
7
  i = 1
8
  while i > -1:
9
  if i == 1:
@@ -35,6 +56,7 @@ while i > -1:
35
  except Exception as error:
36
  print(error)
37
 
 
38
  i = 1
39
  while i > -1:
40
  if i == 1:
@@ -64,4 +86,4 @@ while i > -1:
64
  article['category']= "Data Interpretation"
65
  crawl(url, article)
66
  except Exception as error:
67
- print(error)
 
1
+ """Module to crawl the data from the website of State Administration of Foreign Exchange (SAFE) of China.
2
+
3
+ This module contains code to crawl and collect data from the website of the State Administration of Foreign Exchange (SAFE) of China. It includes two sections: Policy Interpretation and Data Interpretation.
4
+
5
+ Policy Interpretation:
6
+ - The code crawls the web pages containing policy interpretations from the SAFE website.
7
+ - It retrieves the publication date and checks if it is within the last 183 days.
8
+ - If the publication date is within the last 183 days, it extracts the URL and other information of the policy interpretation article.
9
+ - The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
10
+
11
+ Data Interpretation:
12
+ - The code crawls the web pages containing data interpretations from the SAFE website.
13
+ - It retrieves the publication date and checks if it is within the last 183 days.
14
+ - If the publication date is within the last 183 days, it extracts the URL and other information of the data interpretation article.
15
+ - The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
16
+
17
+ Note: The 'crawl' function is imported from the 'utils' module.
18
+
19
+ """
20
+
21
  import time
22
  import urllib.request
23
  from datetime import datetime, timedelta
24
  from lxml import etree
25
  from utils import crawl
26
 
27
+ # Policy Interpretation
28
  i = 1
29
  while i > -1:
30
  if i == 1:
 
56
  except Exception as error:
57
  print(error)
58
 
59
+ # Data Interpretation
60
  i = 1
61
  while i > -1:
62
  if i == 1:
 
86
  article['category']= "Data Interpretation"
87
  crawl(url, article)
88
  except Exception as error:
89
+ print(error)
stats.py CHANGED
@@ -1,4 +1,18 @@
1
- import uuid
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  import time
3
  import urllib.request
4
  from datetime import datetime, timedelta
@@ -34,4 +48,4 @@ while i > -1:
34
  article['category']= "Data Interpretation"
35
  crawl(url, article)
36
  except Exception as error:
37
- print(error)
 
1
+ """
2
+ This script is used to crawl data from the website https://www.stats.gov.cn/sj/sjjd/.
3
+ It retrieves articles from the website and extracts relevant information from each article.
4
+
5
+ The script starts by iterating over the pages of the website, starting from the first page.
6
+ For each page, it retrieves the HTML content and parses it using the lxml library.
7
+ It then extracts the list of articles from the parsed HTML.
8
+ For each article, it extracts the publication date and checks if it is within the last 6 months.
9
+ If the article is within the last 6 months, it extracts the URL and crawls the article to extract additional information.
10
+
11
+ The extracted information is stored in a dictionary and can be further processed or saved as needed.
12
+
13
+ Note: This script requires the 'utils' module, which contains the 'encode' and 'crawl' functions.
14
+ """
15
+
16
  import time
17
  import urllib.request
18
  from datetime import datetime, timedelta
 
48
  article['category']= "Data Interpretation"
49
  crawl(url, article)
50
  except Exception as error:
51
+ print(error)
utils.py CHANGED
@@ -1,4 +1,4 @@
1
- """Utilis Functions"""
2
  import os
3
  import re
4
  import json
@@ -32,7 +32,11 @@ with open('patterns.json', 'r', encoding='UTF-8') as f:
32
  patterns = json.load(f)
33
 
34
  def get_client_connection():
35
- """Get dynamoDB connection"""
 
 
 
 
36
  dynamodb = boto3.client(
37
  service_name='dynamodb',
38
  region_name='us-east-1',
@@ -42,6 +46,15 @@ def get_client_connection():
42
  return dynamodb
43
 
44
  def update_reference(report):
 
 
 
 
 
 
 
 
 
45
  dynamodb = get_client_connection()
46
  response = dynamodb.update_item(
47
  TableName="reference_china",
@@ -59,7 +72,15 @@ def update_reference(report):
59
  print(response)
60
 
61
  def download_files_from_s3(folder):
62
- """Download Data Files"""
 
 
 
 
 
 
 
 
63
  if not os.path.exists(folder):
64
  os.makedirs(folder)
65
  client = boto3.client(
@@ -76,6 +97,20 @@ def download_files_from_s3(folder):
76
  return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
77
 
78
  def extract_from_pdf_by_pattern(url, pattern):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  # Send a GET request to the URL and retrieve the PDF content
80
  try:
81
  response = requests.get(url, timeout=60)
@@ -104,15 +139,44 @@ def extract_from_pdf_by_pattern(url, pattern):
104
  return extracted_text.replace('?\n', '?-\n').replace('!\n', '!-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('!-','!\n').replace('。-','。\n')
105
 
106
  def get_reference_by_regex(pattern, text):
 
 
 
 
 
 
 
 
 
 
107
  return re.findall(pattern, text)
108
 
109
  def isnot_substring(list_a, string_to_check):
 
 
 
 
 
 
 
 
 
 
110
  for s in list_a:
111
  if s in string_to_check:
112
  return False
113
  return True
114
 
115
  def extract_reference(row):
 
 
 
 
 
 
 
 
 
116
  try:
117
  pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
118
  extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
@@ -189,10 +253,31 @@ def extract_reference(row):
189
  print(error)
190
 
191
  def translate(text):
 
 
 
 
 
 
 
 
 
192
  return translator.translate(text, dest='en').text
193
 
194
  def datemodifier(date_string, date_format):
195
- """Date Modifier Function"""
 
 
 
 
 
 
 
 
 
 
 
 
196
  try:
197
  to_date = time.strptime(date_string,date_format)
198
  return time.strftime("%Y-%m-%d",to_date)
@@ -200,20 +285,51 @@ def datemodifier(date_string, date_format):
200
  return False
201
 
202
  def fetch_url(url):
203
- response = requests.get(url, timeout = 60)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  if response.status_code == 200:
205
  return response.text
206
  else:
207
  return None
208
 
209
  def translist(infolist):
210
- """Translist Function"""
 
 
 
 
 
 
 
 
211
  out = list(filter(lambda s: s and
212
- (isinstance (s,str) or len(s.strip()) > 0), [i.strip() for i in infolist]))
213
  return out
214
 
215
  def encode(content):
216
- """Encode Function"""
 
 
 
 
 
 
 
 
 
217
  text = ''
218
  for element in content:
219
  if isinstance(element, etree._Element):
@@ -228,7 +344,16 @@ def encode(content):
228
  return text
229
 
230
  def encode_content(content):
231
- """Encode Function"""
 
 
 
 
 
 
 
 
 
232
  text = ''
233
  for element in content:
234
  if isinstance(element, etree._Element):
@@ -252,6 +377,18 @@ def encode_content(content):
252
  return text, summary
253
 
254
  def extract_from_pdf(url):
 
 
 
 
 
 
 
 
 
 
 
 
255
  # Send a GET request to the URL and retrieve the PDF content
256
  response = requests.get(url, timeout=60)
257
  pdf_content = response.content
@@ -281,16 +418,30 @@ def extract_from_pdf(url):
281
  return extracted_text, summary
282
 
283
  def get_db_connection():
284
- """Get dynamoDB connection"""
 
 
 
 
285
  dynamodb = boto3.resource(
286
- service_name='dynamodb',
287
- region_name='us-east-1',
288
- aws_access_key_id=AWS_ACCESS_KEY_ID,
289
- aws_secret_access_key=AWS_SECRET_ACCESS_KEY
290
  )
291
  return dynamodb
292
 
293
  def sentiment_computation(content):
 
 
 
 
 
 
 
 
 
 
294
  label_dict = {
295
  "positive": "+",
296
  "negative": "-",
@@ -314,6 +465,20 @@ def sentiment_computation(content):
314
  return sentiment_score, label_dict[sentiment_label]
315
 
316
  def crawl(url, article):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
  domain = '.'.join(urlparse(url).netloc.split('.')[1:])
318
  req = urllib.request.urlopen(url)
319
  text = req.read()
@@ -351,10 +516,18 @@ def crawl(url, article):
351
  update_content(article)
352
 
353
  def upsert_content(report):
354
- """Upsert the content records"""
 
 
 
 
 
 
 
 
355
  dynamodb = get_db_connection()
356
  table = dynamodb.Table('article_china')
357
- # Define the item data
358
  item = {
359
  'id': str(report['id']),
360
  'site': report['site'],
@@ -377,54 +550,71 @@ def upsert_content(report):
377
  response = table.put_item(Item=item)
378
  print(response)
379
 
380
- # def get_client_connection():
381
- # """Get dynamoDB connection"""
382
- # dynamodb = boto3.client(
383
- # service_name='dynamodb',
384
- # region_name='us-east-1',
385
- # aws_access_key_id=AWS_ACCESS_KEY_ID,
386
- # aws_secret_access_key=AWS_SECRET_ACCESS_KEY
387
- # )
388
- # return dynamodb
389
-
390
  def delete_records(item):
 
 
 
 
 
 
 
 
 
391
  dynamodb_client = get_client_connection()
392
  dynamodb_client.delete_item(
393
- TableName="article_test",
394
- Key={
395
- 'id': {'S': item['id']},
396
- 'site': {'S': item['site']}
397
- }
398
- )
399
 
400
  def update_content(report):
 
 
 
 
 
 
 
 
 
401
  dynamodb = get_client_connection()
402
  response = dynamodb.update_item(
403
- TableName="article_china",
404
- Key={
405
- 'id': {'S': str(report['id'])},
406
- 'site': {'S': report['site']}
407
- },
408
- UpdateExpression='SET title = :title, titleCN = :titleCN, contentCN = :contentCN, category = :category, author = :author, content = :content, subtitle = :subtitle, publishDate = :publishDate, link = :link, attachment = :attachment, sentimentScore = :sentimentScore, sentimentLabel = :sentimentLabel, LastModifiedDate = :LastModifiedDate',
409
- ExpressionAttributeValues={
410
- ':title': {'S': report['title']},
411
- ':titleCN': {'S': report['titleCN']},
412
- ':contentCN': {'S': report['contentCN']},
413
- ':category': {'S': report['category']},
414
- ':author': {'S': report['author']},
415
- ':content': {'S': report['content']},
416
- ':subtitle': {'S': report['subtitle']},
417
- ':publishDate': {'S': report['publishDate']},
418
- ':link': {'S': report['link']},
419
- ':attachment': {'S': report['attachment']},
420
- ':LastModifiedDate': {'S': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")},
421
- ':sentimentScore': {'N': str(Decimal(str(report['sentimentScore'])).quantize(Decimal('0.01')))},
422
- ':sentimentLabel': {'S': report['sentimentLabel']}
423
- }
424
- )
425
  print(response)
426
 
427
  def update_content_sentiment(report):
 
 
 
 
 
 
 
 
 
428
  dynamodb = get_client_connection()
429
  response = dynamodb.update_item(
430
  TableName="article_test",
 
1
+ """Module to define utility function"""
2
  import os
3
  import re
4
  import json
 
32
  patterns = json.load(f)
33
 
34
  def get_client_connection():
35
+ """
36
+ Returns a client connection to DynamoDB.
37
+
38
+ :return: DynamoDB client connection
39
+ """
40
  dynamodb = boto3.client(
41
  service_name='dynamodb',
42
  region_name='us-east-1',
 
46
  return dynamodb
47
 
48
  def update_reference(report):
49
+ """
50
+ Updates the reference in the 'reference_china' table in DynamoDB.
51
+
52
+ Args:
53
+ report (dict): A dictionary containing the report details.
54
+
55
+ Returns:
56
+ None
57
+ """
58
  dynamodb = get_client_connection()
59
  response = dynamodb.update_item(
60
  TableName="reference_china",
 
72
  print(response)
73
 
74
  def download_files_from_s3(folder):
75
+ """
76
+ Downloads Parquet files from an S3 bucket and returns a concatenated DataFrame.
77
+
78
+ Args:
79
+ folder (str): The folder in the S3 bucket to download files from.
80
+
81
+ Returns:
82
+ pandas.DataFrame: A concatenated DataFrame containing the data from the downloaded Parquet files.
83
+ """
84
  if not os.path.exists(folder):
85
  os.makedirs(folder)
86
  client = boto3.client(
 
97
  return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
98
 
99
  def extract_from_pdf_by_pattern(url, pattern):
100
+ """
101
+ Extracts text from a PDF file based on a given pattern.
102
+
103
+ Args:
104
+ url (str): The URL of the PDF file to extract text from.
105
+ pattern (dict): A dictionary containing the pattern to match and the pages to extract text from.
106
+
107
+ Returns:
108
+ str: The extracted text from the PDF file.
109
+
110
+ Raises:
111
+ Exception: If there is an error while retrieving or processing the PDF file.
112
+
113
+ """
114
  # Send a GET request to the URL and retrieve the PDF content
115
  try:
116
  response = requests.get(url, timeout=60)
 
139
  return extracted_text.replace('?\n', '?-\n').replace('!\n', '!-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('!-','!\n').replace('。-','。\n')
140
 
141
  def get_reference_by_regex(pattern, text):
142
+ """
143
+ Finds all occurrences of a given regex pattern in the provided text.
144
+
145
+ Args:
146
+ pattern (str): The regex pattern to search for.
147
+ text (str): The text to search within.
148
+
149
+ Returns:
150
+ list: A list of all matches found in the text.
151
+ """
152
  return re.findall(pattern, text)
153
 
154
  def isnot_substring(list_a, string_to_check):
155
+ """
156
+ Check if any string in the given list is a substring of the string_to_check.
157
+
158
+ Args:
159
+ list_a (list): A list of strings to check.
160
+ string_to_check (str): The string to check for substrings.
161
+
162
+ Returns:
163
+ bool: True if none of the strings in list_a are substrings of string_to_check, False otherwise.
164
+ """
165
  for s in list_a:
166
  if s in string_to_check:
167
  return False
168
  return True
169
 
170
  def extract_reference(row):
171
+ """
172
+ Extracts reference information from a given row.
173
+
174
+ Args:
175
+ row (dict): A dictionary representing a row of data.
176
+
177
+ Returns:
178
+ None
179
+ """
180
  try:
181
  pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
182
  extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
 
253
  print(error)
254
 
255
  def translate(text):
256
+ """
257
+ Translates the given text to English.
258
+
259
+ Args:
260
+ text (str): The text to be translated.
261
+
262
+ Returns:
263
+ str: The translated text in English.
264
+ """
265
  return translator.translate(text, dest='en').text
266
 
267
  def datemodifier(date_string, date_format):
268
+ """Date Modifier Function
269
+
270
+ This function takes a date string and a date format as input and modifies the date string
271
+ according to the specified format. It returns the modified date string in the format 'YYYY-MM-DD'.
272
+
273
+ Args:
274
+ date_string (str): The date string to be modified.
275
+ date_format (str): The format of the date string.
276
+
277
+ Returns:
278
+ str: The modified date string in the format 'YYYY-MM-DD'.
279
+ False: If an error occurs during the modification process.
280
+ """
281
  try:
282
  to_date = time.strptime(date_string,date_format)
283
  return time.strftime("%Y-%m-%d",to_date)
 
285
  return False
286
 
287
  def fetch_url(url):
288
+ """
289
+ Fetches the content of a given URL.
290
+
291
+ Args:
292
+ url (str): The URL to fetch.
293
+
294
+ Returns:
295
+ str or None: The content of the URL if the request is successful (status code 200),
296
+ otherwise None.
297
+
298
+ Raises:
299
+ requests.exceptions.RequestException: If there is an error while making the request.
300
+
301
+ """
302
+ response = requests.get(url, timeout=60)
303
  if response.status_code == 200:
304
  return response.text
305
  else:
306
  return None
307
 
308
  def translist(infolist):
309
+ """
310
+ Filter and transform a list of strings.
311
+
312
+ Args:
313
+ infolist (list): The input list of strings.
314
+
315
+ Returns:
316
+ list: The filtered and transformed list of strings.
317
+ """
318
  out = list(filter(lambda s: s and
319
+ (isinstance(s, str) or len(s.strip()) > 0), [i.strip() for i in infolist]))
320
  return out
321
 
322
  def encode(content):
323
+ """
324
+ Encodes the given content into a single string.
325
+
326
+ Args:
327
+ content (list): A list of elements to be encoded. Each element can be either a string or an `etree._Element` object.
328
+
329
+ Returns:
330
+ str: The encoded content as a single string.
331
+
332
+ """
333
  text = ''
334
  for element in content:
335
  if isinstance(element, etree._Element):
 
344
  return text
345
 
346
  def encode_content(content):
347
+ """
348
+ Encodes the content by removing unnecessary characters and extracting a summary.
349
+
350
+ Args:
351
+ content (list): A list of elements representing the content.
352
+
353
+ Returns:
354
+ tuple: A tuple containing the encoded text and the summary.
355
+
356
+ """
357
  text = ''
358
  for element in content:
359
  if isinstance(element, etree._Element):
 
377
  return text, summary
378
 
379
  def extract_from_pdf(url):
380
+ """
381
+ Extracts text from a PDF file given its URL.
382
+
383
+ Args:
384
+ url (str): The URL of the PDF file.
385
+
386
+ Returns:
387
+ tuple: A tuple containing the extracted text and a summary of the text.
388
+
389
+ Raises:
390
+ Exception: If there is an error during the extraction process.
391
+ """
392
  # Send a GET request to the URL and retrieve the PDF content
393
  response = requests.get(url, timeout=60)
394
  pdf_content = response.content
 
418
  return extracted_text, summary
419
 
420
  def get_db_connection():
421
+ """Get dynamoDB connection.
422
+
423
+ Returns:
424
+ boto3.resource: The DynamoDB resource object representing the connection.
425
+ """
426
  dynamodb = boto3.resource(
427
+ service_name='dynamodb',
428
+ region_name='us-east-1',
429
+ aws_access_key_id=AWS_ACCESS_KEY_ID,
430
+ aws_secret_access_key=AWS_SECRET_ACCESS_KEY
431
  )
432
  return dynamodb
433
 
434
  def sentiment_computation(content):
435
+ """
436
+ Compute the sentiment score and label for the given content.
437
+
438
+ Parameters:
439
+ content (str): The content for which sentiment needs to be computed.
440
+
441
+ Returns:
442
+ tuple: A tuple containing the sentiment score and label. The sentiment score is a float representing the overall sentiment score of the content. The sentiment label is a string representing the sentiment label ('+', '-', or '0').
443
+
444
+ """
445
  label_dict = {
446
  "positive": "+",
447
  "negative": "-",
 
465
  return sentiment_score, label_dict[sentiment_label]
466
 
467
  def crawl(url, article):
468
+ """
469
+ Crawls the given URL and extracts relevant information from the webpage.
470
+
471
+ Args:
472
+ url (str): The URL of the webpage to crawl.
473
+ article (dict): A dictionary to store the extracted information.
474
+
475
+ Returns:
476
+ None: If the length of the extracted content is less than 10 characters.
477
+
478
+ Raises:
479
+ None
480
+
481
+ """
482
  domain = '.'.join(urlparse(url).netloc.split('.')[1:])
483
  req = urllib.request.urlopen(url)
484
  text = req.read()
 
516
  update_content(article)
517
 
518
  def upsert_content(report):
519
+ """
520
+ Upserts the content of a report into the 'article_china' table in DynamoDB.
521
+
522
+ Args:
523
+ report (dict): A dictionary containing the report data.
524
+
525
+ Returns:
526
+ dict: The response from the DynamoDB put_item operation.
527
+ """
528
  dynamodb = get_db_connection()
529
  table = dynamodb.Table('article_china')
530
+ # Define the item data
531
  item = {
532
  'id': str(report['id']),
533
  'site': report['site'],
 
550
  response = table.put_item(Item=item)
551
  print(response)
552
 
 
 
 
 
 
 
 
 
 
 
553
  def delete_records(item):
554
+ """
555
+ Deletes a record from the 'article_test' table in DynamoDB.
556
+
557
+ Args:
558
+ item (dict): The item to be deleted, containing 'id' and 'site' keys.
559
+
560
+ Returns:
561
+ None
562
+ """
563
  dynamodb_client = get_client_connection()
564
  dynamodb_client.delete_item(
565
+ TableName="article_test",
566
+ Key={
567
+ 'id': {'S': item['id']},
568
+ 'site': {'S': item['site']}
569
+ }
570
+ )
571
 
572
  def update_content(report):
573
+ """
574
+ Updates the content of an article in the 'article_china' table in DynamoDB.
575
+
576
+ Args:
577
+ report (dict): A dictionary containing the report data.
578
+
579
+ Returns:
580
+ None
581
+ """
582
  dynamodb = get_client_connection()
583
  response = dynamodb.update_item(
584
+ TableName="article_china",
585
+ Key={
586
+ 'id': {'S': str(report['id'])},
587
+ 'site': {'S': report['site']}
588
+ },
589
+ UpdateExpression='SET title = :title, titleCN = :titleCN, contentCN = :contentCN, category = :category, author = :author, content = :content, subtitle = :subtitle, publishDate = :publishDate, link = :link, attachment = :attachment, sentimentScore = :sentimentScore, sentimentLabel = :sentimentLabel, LastModifiedDate = :LastModifiedDate',
590
+ ExpressionAttributeValues={
591
+ ':title': {'S': report['title']},
592
+ ':titleCN': {'S': report['titleCN']},
593
+ ':contentCN': {'S': report['contentCN']},
594
+ ':category': {'S': report['category']},
595
+ ':author': {'S': report['author']},
596
+ ':content': {'S': report['content']},
597
+ ':subtitle': {'S': report['subtitle']},
598
+ ':publishDate': {'S': report['publishDate']},
599
+ ':link': {'S': report['link']},
600
+ ':attachment': {'S': report['attachment']},
601
+ ':LastModifiedDate': {'S': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")},
602
+ ':sentimentScore': {'N': str(Decimal(str(report['sentimentScore'])).quantize(Decimal('0.01')))},
603
+ ':sentimentLabel': {'S': report['sentimentLabel']}
604
+ }
605
+ )
606
  print(response)
607
 
608
  def update_content_sentiment(report):
609
+ """
610
+ Updates the sentiment score and label of an article in the 'article_test' DynamoDB table.
611
+
612
+ Args:
613
+ report (dict): A dictionary containing the report information.
614
+
615
+ Returns:
616
+ None
617
+ """
618
  dynamodb = get_client_connection()
619
  response = dynamodb.update_item(
620
  TableName="article_test",