gavinzli commited on
Commit
39fe3d1
·
1 Parent(s): d860eae

chore: Add script descriptions and improve code readability

Browse files
Files changed (16) hide show
  1. README.md +65 -1
  2. cbirc.py +14 -4
  3. chinatax.py +14 -1
  4. csrc.py +23 -2
  5. daily.py +29 -15
  6. eastmoney.py +22 -0
  7. glue.py +6 -2
  8. gov.py +20 -0
  9. manual_upload.py +23 -6
  10. mof.py +15 -1
  11. mofcom.py +6 -0
  12. ndrc.py +13 -1
  13. pbc.py +15 -0
  14. safe.py +23 -1
  15. stats.py +16 -2
  16. utils.py +244 -55
README.md CHANGED
@@ -1 +1,65 @@
1
- # security-report-collection
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # security-report-collection
2
+
3
+ The `main.py` file is a Python script that performs sentiment analysis on articles. Here's a detailed breakdown of the code:
4
+
5
+ - Importing Libraries:
6
+ - The script starts by importing necessary libraries:
7
+ ```python
8
+ import os
9
+ import glob
10
+ import warnings
11
+ from decimal import Decimal
12
+ import pandas as pd
13
+ import boto3
14
+ ```
15
+ These libraries include file manipulation (os), file searching (glob), warning suppression (warnings), decimal arithmetic (decimal), data processing (pandas), and AWS services (boto3).
16
+
17
+ - Defining Functions:
18
+ - The script defines three functions:
19
+ 1. `get_db_connection()`: This function establishes a connection to an Amazon DynamoDB database using the AWS access key ID and secret access key.
20
+ 2. `download_files_from_s3()`: This function downloads Parquet files from an S3 bucket named "oe-data-poc" and concatenates them into a Pandas DataFrame.
21
+ 3. `gen_sentiment(record, table_name, label_dict)`: This function computes the sentiment score for each article in the input record. It uses the Hugging Face Transformers library to analyze the text and update the DynamoDB database with the sentiment score and label.
22
+
23
+ - Main Program:
24
+ - The script's main program:
25
+ ```python
26
+ if __name__ == "__main__":
27
+ # Define a dictionary mapping sentiment labels to symbols
28
+ label = {
29
+ "positive": "+",
30
+ "negative": "-",
31
+ "neutral": "0",
32
+ }
33
+
34
+ # Download files from S3 and filter out null values
35
+ df = download_files_from_s3()
36
+ df = df[(~df['content'].isnull()) & (df['sentimentscore'].isnull())]
37
+
38
+ # Iterate through each row in the DataFrame
39
+ for _, row in df.iterrows():
40
+ # Compute sentiment score and update DynamoDB database
41
+ gen_sentiment(row, 'article', label)
42
+ ```
43
+
44
+ The main program defines a dictionary mapping sentiment labels to symbols (e.g., "+" for positive sentiment), downloads files from S3, filters out null values, and then iterates through each row in the DataFrame. For each row, it computes the sentiment score using the `gen_sentiment()` function and updates the DynamoDB database with the sentiment score and label.
45
+
46
+ - That's It!
47
+ - The script concludes by defining a dictionary mapping sentiment labels to symbols and performing sentiment analysis on articles stored in S3.
48
+
49
+ The `glue.py` file contains a Python script that triggers a Parquet snapshot Glue job.
50
+
51
+ Here's a breakdown of the code:
52
+
53
+ 1. It starts by importing necessary modules:
54
+ - `os`: for interacting with the operating system
55
+ - `boto3`: a library for working with AWS services like Amazon S3, DynamoDB, and more
56
+
57
+ 2. Then, it defines two environment variables:
58
+ - `AWS_ACCESS_KEY_ID`
59
+ - `AWS_SECRET_ACCESS_KEY`
60
+
61
+ 3. The script then defines a function called `get_client_connection()` that returns a Boto3 client object for the Glue service. This client is used to interact with Amazon Glue.
62
+
63
+ 4. Finally, it uses this client to start a job run named 'Ner Snapshot'. It prints out the response from Amazon Glue.
64
+
65
+ In summary, `glue.py` sets up an environment and starts a Glue job to create a Parquet snapshot.
cbirc.py CHANGED
@@ -1,9 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import json
2
- import ssl
3
  import uuid
4
  import time
5
- import urllib.request
6
- import urllib3
7
  from datetime import datetime, timedelta
8
  from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
9
 
@@ -34,7 +44,7 @@ while i > -1:
34
  article['titleCN'] = article['docSubtitle']
35
  article['title'] = translate(article['docSubtitle'])
36
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
37
- article['category']= "Policy Interpretation"
38
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
39
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
40
  article['attachment'] = ''
 
1
+ """
2
+ This script fetches data from the China Banking and Insurance Regulatory Commission (CBIRC) website and extracts relevant information from the fetched data.
3
+ The extracted information is then processed and stored in a database.
4
+
5
+ The script performs the following steps:
6
+ 1. Fetches data from the CBIRC website by making HTTP requests.
7
+ 2. Parses the fetched data and extracts relevant information.
8
+ 3. Translates the extracted information to English.
9
+ 4. Computes sentiment scores for the translated content.
10
+ 5. Stores the processed information in a database.
11
+
12
+ Note: The script also includes commented code for fetching data from the State Taxation Administration of China website, but it is currently disabled.
13
+ """
14
  import json
 
15
  import uuid
16
  import time
 
 
17
  from datetime import datetime, timedelta
18
  from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
19
 
 
44
  article['titleCN'] = article['docSubtitle']
45
  article['title'] = translate(article['docSubtitle'])
46
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
47
+ article['category']= "Policy Interpretation"
48
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
49
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
50
  article['attachment'] = ''
chinatax.py CHANGED
@@ -1,3 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import json
2
  import ssl
3
  import uuid
@@ -6,7 +19,7 @@ import time
6
  import urllib.request
7
  import urllib3
8
  from lxml import etree
9
- from utils import encode, translate, sentiment_computation, upsert_content, encode_content
10
 
11
  ssl._create_default_https_context = ssl._create_stdlib_context
12
 
 
1
+ """
2
+ This script is used for data collection from the China Taxation website. It retrieves policy interpretation articles and processes them for further analysis.
3
+
4
+ The script performs the following steps:
5
+ 1. Imports necessary modules and libraries.
6
+ 2. Defines the base URL for retrieving policy interpretation articles.
7
+ 3. Iterates through the pages of the search results.
8
+ 4. Retrieves the content of each article.
9
+ 5. Processes the content by translating it to English and performing sentiment analysis.
10
+ 6. Stores the processed data in a database.
11
+
12
+ Note: The script also retrieves additional articles from a different URL and follows a similar process.
13
+ """
14
  import json
15
  import ssl
16
  import uuid
 
19
  import urllib.request
20
  import urllib3
21
  from lxml import etree
22
+ from utils import translate, sentiment_computation, upsert_content, encode_content
23
 
24
  ssl._create_default_https_context = ssl._create_stdlib_context
25
 
csrc.py CHANGED
@@ -1,3 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import uuid
2
  import json
3
  import time
@@ -35,7 +56,7 @@ while i > -1:
35
  article['category']= "Policy Interpretation"
36
  crawl(url, article)
37
  except Exception as error:
38
- print(error)
39
 
40
  i = 1
41
  while i > -1:
@@ -70,4 +91,4 @@ while i > -1:
70
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
71
  upsert_content(article)
72
  except Exception as error:
73
- print(error)
 
1
+ """
2
+ This script is used to crawl and collect data from the website of the China Securities Regulatory Commission (CSRC).
3
+ It retrieves policy interpretation articles and financial news articles from the CSRC website.
4
+ The collected data is then processed and stored in a database.
5
+
6
+ The script consists of two main parts:
7
+ 1. Crawl and process policy interpretation articles from the CSRC website.
8
+ 2. Crawl and process financial news articles from the CSRC website.
9
+
10
+ The script uses various libraries and functions to handle web scraping, data processing, and database operations.
11
+
12
+ Note: This script assumes the presence of the following dependencies:
13
+ - urllib
14
+ - lxml
15
+ - json
16
+ - datetime
17
+ - time
18
+ - utils (custom module)
19
+
20
+ Please make sure to install these dependencies before running the script.
21
+ """
22
  import uuid
23
  import json
24
  import time
 
56
  article['category']= "Policy Interpretation"
57
  crawl(url, article)
58
  except Exception as error:
59
+ print(error)
60
 
61
  i = 1
62
  while i > -1:
 
91
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
92
  upsert_content(article)
93
  except Exception as error:
94
+ print(error)
daily.py CHANGED
@@ -1,21 +1,21 @@
1
- import os
 
 
 
 
2
  import json
3
- import uuid
4
  import time
5
  import urllib.request
6
- from lxml import etree
7
  from datetime import datetime, timedelta
8
  from urllib.parse import urlparse
9
- from utils import (encode,
10
- translate,
11
- sentiment_computation,
12
- fetch_url,
13
- extract_from_pdf,
14
- crawl,
15
- datemodifier,
16
- encode_content,
17
- update_content,
18
- extract_reference)
19
 
20
  with open('xpath.json', 'r', encoding='UTF-8') as f:
21
  xpath_dict = json.load(f)
@@ -50,7 +50,7 @@ while i > -1:
50
  article['titleCN'] = article['docSubtitle']
51
  article['title'] = translate(article['docSubtitle'])
52
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
53
- article['category']= "Policy Interpretation"
54
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
55
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
56
  article['attachment'] = ''
@@ -133,6 +133,20 @@ while i > -1:
133
 
134
  print("data.eastmoney.com")
135
  def crawl_eastmoney(url, article):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  domain = urlparse(url).netloc
137
  req = urllib.request.urlopen(url)
138
  text = req.read()
@@ -499,4 +513,4 @@ while i > -1:
499
  article['category']= "Data Interpretation"
500
  crawl(url, article)
501
  except Exception as error:
502
- print(error)
 
1
+ """
2
+ This script is responsible for collecting data from various websites related to financial and policy information in China.
3
+ It fetches data from different sources, extracts relevant information, translates it, and updates the content accordingly.
4
+ The collected data includes policy interpretations, financial news, macroeconomic research, and more.
5
+ """
6
  import json
7
+ import os
8
  import time
9
  import urllib.request
10
+ import uuid
11
  from datetime import datetime, timedelta
12
  from urllib.parse import urlparse
13
+
14
+ from lxml import etree
15
+
16
+ from utils import (crawl, datemodifier, encode, encode_content,
17
+ extract_from_pdf, extract_reference, fetch_url,
18
+ sentiment_computation, translate, update_content)
 
 
 
 
19
 
20
  with open('xpath.json', 'r', encoding='UTF-8') as f:
21
  xpath_dict = json.load(f)
 
50
  article['titleCN'] = article['docSubtitle']
51
  article['title'] = translate(article['docSubtitle'])
52
  article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
53
+ article['category']= "Policy Interpretation"
54
  article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
55
  article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
56
  article['attachment'] = ''
 
133
 
134
  print("data.eastmoney.com")
135
  def crawl_eastmoney(url, article):
136
+ """
137
+ Crawls the given URL and extracts information from the webpage.
138
+
139
+ Args:
140
+ url (str): The URL of the webpage to crawl.
141
+ article (dict): A dictionary to store the extracted information.
142
+
143
+ Returns:
144
+ None: If the length of the extracted content is less than 10 characters.
145
+
146
+ Raises:
147
+ None.
148
+
149
+ """
150
  domain = urlparse(url).netloc
151
  req = urllib.request.urlopen(url)
152
  text = req.read()
 
513
  article['category']= "Data Interpretation"
514
  crawl(url, article)
515
  except Exception as error:
516
+ print(error)
eastmoney.py CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  import uuid
2
  import json
3
  import urllib.request
@@ -6,10 +12,26 @@ from datetime import datetime, timedelta
6
  from lxml import etree
7
  from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
8
 
 
9
  with open('xpath.json', 'r', encoding='UTF-8') as f:
10
  xpath_dict = json.load(f)
11
 
12
  def crawl(url, article):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  domain = urlparse(url).netloc
14
  req = urllib.request.urlopen(url)
15
  text = req.read()
 
1
+ """
2
+ This script is used to crawl a webpage and extract relevant information from it. It defines a function `crawl` that takes a URL and a dictionary to store the extracted information. The function crawls the webpage, extracts the content, translates it to English, and stores it in the dictionary.
3
+
4
+ The script also includes a main loop that fetches data from a specific URL and calls the `crawl` function for each article in the fetched data.
5
+ """
6
+
7
  import uuid
8
  import json
9
  import urllib.request
 
12
  from lxml import etree
13
  from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
14
 
15
+ # Load XPath dictionary from a JSON file
16
  with open('xpath.json', 'r', encoding='UTF-8') as f:
17
  xpath_dict = json.load(f)
18
 
19
  def crawl(url, article):
20
+ """
21
+ Crawls the given URL and extracts relevant information from the webpage.
22
+
23
+ Args:
24
+ url (str): The URL of the webpage to crawl.
25
+ article (dict): A dictionary to store the extracted information.
26
+
27
+ Returns:
28
+ None: If the length of the extracted content is less than 10 characters.
29
+ str: The extracted content in English if successful.
30
+
31
+ Raises:
32
+ None
33
+
34
+ """
35
  domain = urlparse(url).netloc
36
  req = urllib.request.urlopen(url)
37
  text = req.read()
glue.py CHANGED
@@ -6,7 +6,11 @@ AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
6
  AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
7
 
8
  def get_client_connection():
9
- """Get dynamoDB connection"""
 
 
 
 
10
  return boto3.client(
11
  service_name='glue',
12
  region_name='us-east-1',
@@ -22,4 +26,4 @@ print(response)
22
  response = glue.start_job_run(
23
  JobName='Reference China'
24
  )
25
- print(response)
 
6
  AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
7
 
8
  def get_client_connection():
9
+ """
10
+ Returns a client connection to the AWS Glue service.
11
+
12
+ :return: AWS Glue client connection
13
+ """
14
  return boto3.client(
15
  service_name='glue',
16
  region_name='us-east-1',
 
26
  response = glue.start_job_run(
27
  JobName='Reference China'
28
  )
29
+ print(response)
gov.py CHANGED
@@ -1,3 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from datetime import datetime, timedelta
2
  import time
3
  import urllib.request
 
1
+ """
2
+ This script is used to crawl and collect policy articles from the official website of the State Council of China (https://www.gov.cn).
3
+
4
+ The script contains two main functions:
5
+ 1. crawl(url, article): This function is responsible for crawling a specific policy article given its URL and extracting relevant information such as title, author, content, publish date, etc.
6
+ 2. main(): This function is the entry point of the script. It iterates over different pages of policy articles and calls the crawl function to collect the information.
7
+
8
+ Note: The script imports the following modules: datetime, timedelta, time, urllib.request, lxml.etree, and utils (custom module).
9
+ """
10
+
11
+ from datetime import datetime, timedelta
12
+ import time
13
+ import urllib.request
14
+ from lxml import etree
15
+ from utils import crawl
16
+
17
+ # Rest of the code...
18
+ """
19
+
20
+ """
21
  from datetime import datetime, timedelta
22
  import time
23
  import urllib.request
manual_upload.py CHANGED
@@ -1,7 +1,27 @@
1
- from decimal import Decimal
2
- from utils import translate, sentiment_computation, get_db_connection
3
- from datetime import datetime
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  import uuid
 
 
 
 
5
 
6
  # User input for the article content
7
  article_titleCN = input("Enter the title of the article: ")
@@ -12,7 +32,6 @@ article_publish_date = input("Enter the publish date of the article (YYYY-MM-DD)
12
  article_link = input("Enter the link to the article: ")
13
  article_siteCN = input("Enter the site of the article: ")
14
 
15
-
16
  # Compute sentiment of the translated content
17
  sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
18
 
@@ -30,8 +49,6 @@ article= {
30
  'publishDate': article_publish_date,
31
  'link': article_link,
32
  'attachment': '',
33
- # 'authorID': str(report['authorid']),
34
- # 'entityList': report['entitylist'],
35
  'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
36
  'sentimentLabel': sentiment_label,
37
  'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
 
1
+ """
2
+ This script allows the user to manually upload an article to a database. It prompts the user to enter various details about the article, such as the title, content, subtitle, publish date, link, and site. It then computes the sentiment of the article's translated content and constructs a dictionary representing the article. Finally, it inserts or updates the article in the database.
3
+
4
+ Dependencies:
5
+ - decimal
6
+ - utils (custom module)
7
+ - datetime
8
+ - uuid
9
+
10
+ Usage:
11
+ 1. Run the script.
12
+ 2. Enter the required details about the article when prompted.
13
+ 3. The script will compute the sentiment of the translated content and construct a dictionary representing the article.
14
+ 4. The article will be inserted or updated in the database.
15
+ 5. The article dictionary and the response from the database operation will be printed.
16
+
17
+ Note: Make sure to configure the database connection and table name before running the script.
18
+ """
19
+
20
  import uuid
21
+ from datetime import datetime
22
+ from decimal import Decimal
23
+
24
+ from utils import get_db_connection, sentiment_computation, translate
25
 
26
  # User input for the article content
27
  article_titleCN = input("Enter the title of the article: ")
 
32
  article_link = input("Enter the link to the article: ")
33
  article_siteCN = input("Enter the site of the article: ")
34
 
 
35
  # Compute sentiment of the translated content
36
  sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
37
 
 
49
  'publishDate': article_publish_date,
50
  'link': article_link,
51
  'attachment': '',
 
 
52
  'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
53
  'sentimentLabel': sentiment_label,
54
  'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
mof.py CHANGED
@@ -1,9 +1,22 @@
 
 
 
 
 
 
 
 
 
 
1
  import time
2
  import urllib.request
3
- from lxml import etree
4
  from datetime import datetime, timedelta
 
 
 
5
  from utils import crawl
6
 
 
7
  i = 0
8
  while i > -1:
9
  if i == 0:
@@ -38,6 +51,7 @@ while i > -1:
38
  except Exception as error:
39
  print(error)
40
 
 
41
  i = 0
42
  while i > -1:
43
  if i == 0:
 
1
+ """
2
+ This script is used to crawl and collect financial news and policy interpretation articles from the website of the Ministry of Finance of China (https://www.mof.gov.cn/).
3
+
4
+ The script iterates through the pages of the "Financial News" and "Policy Interpretation" categories on the website and extracts the articles' URLs. It then calls the `crawl` function from the `utils` module to crawl and collect the article data.
5
+
6
+ The script uses the `lxml` library to parse the HTML content of the website and extract the necessary information.
7
+
8
+ Note: The script assumes the existence of a `crawl` function in the `utils` module.
9
+ """
10
+
11
  import time
12
  import urllib.request
 
13
  from datetime import datetime, timedelta
14
+
15
+ from lxml import etree
16
+
17
  from utils import crawl
18
 
19
+ # Crawl Financial News articles
20
  i = 0
21
  while i > -1:
22
  if i == 0:
 
51
  except Exception as error:
52
  print(error)
53
 
54
+ # Crawl Policy Interpretation articles
55
  i = 0
56
  while i > -1:
57
  if i == 0:
mofcom.py CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  import time
2
  import urllib.request
3
  from datetime import datetime, timedelta
 
1
+ """
2
+ This script is used to crawl and collect data from the Ministry of Commerce of the People's Republic of China (MOFCOM) website.
3
+ It retrieves articles from different categories and extracts relevant information such as date and URL.
4
+ The collected data is then passed to the 'crawl' function for further processing.
5
+ """
6
+
7
  import time
8
  import urllib.request
9
  from datetime import datetime, timedelta
ndrc.py CHANGED
@@ -1,5 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from datetime import datetime, timedelta
2
- import uuid
3
  import time
4
  import urllib.request
5
  from lxml import etree
 
1
+ """
2
+ This script is used to crawl and collect data from the National Development and Reform Commission (NDRC) website.
3
+ It retrieves articles from the website and categorizes them as either "Policy Release" or "Policy Interpretation".
4
+ The script starts by iterating through the pages of the website, starting from the first page.
5
+ For each page, it retrieves the HTML content and parses it using lxml library.
6
+ It then extracts the article list from the parsed HTML and iterates through each article.
7
+ For each article, it extracts the publication date, converts it to a datetime object, and checks if it is within the last 183 days.
8
+ If the article is older than 183 days, the script stops iterating through the pages.
9
+ Otherwise, it extracts the URL of the article and categorizes it based on the URL pattern.
10
+ The script then calls the 'crawl' function from the 'utils' module to crawl the article and collect data.
11
+ Any exceptions that occur during the crawling process are caught and printed.
12
+ """
13
+
14
  from datetime import datetime, timedelta
 
15
  import time
16
  import urllib.request
17
  from lxml import etree
pbc.py CHANGED
@@ -1,3 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import time
2
  import uuid
3
  from datetime import datetime, timedelta
 
1
+ """
2
+ This module contains code to scrape the People's Bank of China website and collect policy interpretation articles. It iterates through the pages of the website, extracts relevant information from each article, and stores the data in a database.
3
+
4
+ The main functionality of this module includes:
5
+ - Scraping the website for policy interpretation articles
6
+ - Parsing the HTML content of each article
7
+ - Extracting relevant information such as title, content, publish date, and URL
8
+ - Translating the content from Chinese to English
9
+ - Computing sentiment scores for the content
10
+ - Storing the collected data in a database
11
+
12
+ Note: This code assumes the existence of the following helper functions: encode, translate, datemodifier, sentiment_computation, and upsert_content.
13
+
14
+ """
15
+
16
  import time
17
  import uuid
18
  from datetime import datetime, timedelta
safe.py CHANGED
@@ -1,9 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import time
2
  import urllib.request
3
  from datetime import datetime, timedelta
4
  from lxml import etree
5
  from utils import crawl
6
 
 
7
  i = 1
8
  while i > -1:
9
  if i == 1:
@@ -35,6 +56,7 @@ while i > -1:
35
  except Exception as error:
36
  print(error)
37
 
 
38
  i = 1
39
  while i > -1:
40
  if i == 1:
@@ -64,4 +86,4 @@ while i > -1:
64
  article['category']= "Data Interpretation"
65
  crawl(url, article)
66
  except Exception as error:
67
- print(error)
 
1
+ """Module to crawl the data from the website of State Administration of Foreign Exchange (SAFE) of China.
2
+
3
+ This module contains code to crawl and collect data from the website of the State Administration of Foreign Exchange (SAFE) of China. It includes two sections: Policy Interpretation and Data Interpretation.
4
+
5
+ Policy Interpretation:
6
+ - The code crawls the web pages containing policy interpretations from the SAFE website.
7
+ - It retrieves the publication date and checks if it is within the last 183 days.
8
+ - If the publication date is within the last 183 days, it extracts the URL and other information of the policy interpretation article.
9
+ - The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
10
+
11
+ Data Interpretation:
12
+ - The code crawls the web pages containing data interpretations from the SAFE website.
13
+ - It retrieves the publication date and checks if it is within the last 183 days.
14
+ - If the publication date is within the last 183 days, it extracts the URL and other information of the data interpretation article.
15
+ - The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
16
+
17
+ Note: The 'crawl' function is imported from the 'utils' module.
18
+
19
+ """
20
+
21
  import time
22
  import urllib.request
23
  from datetime import datetime, timedelta
24
  from lxml import etree
25
  from utils import crawl
26
 
27
+ # Policy Interpretation
28
  i = 1
29
  while i > -1:
30
  if i == 1:
 
56
  except Exception as error:
57
  print(error)
58
 
59
+ # Data Interpretation
60
  i = 1
61
  while i > -1:
62
  if i == 1:
 
86
  article['category']= "Data Interpretation"
87
  crawl(url, article)
88
  except Exception as error:
89
+ print(error)
stats.py CHANGED
@@ -1,4 +1,18 @@
1
- import uuid
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  import time
3
  import urllib.request
4
  from datetime import datetime, timedelta
@@ -34,4 +48,4 @@ while i > -1:
34
  article['category']= "Data Interpretation"
35
  crawl(url, article)
36
  except Exception as error:
37
- print(error)
 
1
+ """
2
+ This script is used to crawl data from the website https://www.stats.gov.cn/sj/sjjd/.
3
+ It retrieves articles from the website and extracts relevant information from each article.
4
+
5
+ The script starts by iterating over the pages of the website, starting from the first page.
6
+ For each page, it retrieves the HTML content and parses it using the lxml library.
7
+ It then extracts the list of articles from the parsed HTML.
8
+ For each article, it extracts the publication date and checks if it is within the last 6 months.
9
+ If the article is within the last 6 months, it extracts the URL and crawls the article to extract additional information.
10
+
11
+ The extracted information is stored in a dictionary and can be further processed or saved as needed.
12
+
13
+ Note: This script requires the 'utils' module, which contains the 'encode' and 'crawl' functions.
14
+ """
15
+
16
  import time
17
  import urllib.request
18
  from datetime import datetime, timedelta
 
48
  article['category']= "Data Interpretation"
49
  crawl(url, article)
50
  except Exception as error:
51
+ print(error)
utils.py CHANGED
@@ -1,4 +1,4 @@
1
- """Utilis Functions"""
2
  import os
3
  import re
4
  import json
@@ -31,7 +31,11 @@ with open('patterns.json', 'r', encoding='UTF-8') as f:
31
  patterns = json.load(f)
32
 
33
  def get_client_connection():
34
- """Get dynamoDB connection"""
 
 
 
 
35
  dynamodb = boto3.client(
36
  service_name='dynamodb',
37
  region_name='us-east-1',
@@ -41,6 +45,15 @@ def get_client_connection():
41
  return dynamodb
42
 
43
  def update_reference(report):
 
 
 
 
 
 
 
 
 
44
  dynamodb = get_client_connection()
45
  response = dynamodb.update_item(
46
  TableName="reference_china",
@@ -58,7 +71,15 @@ def update_reference(report):
58
  print(response)
59
 
60
  def download_files_from_s3(folder):
61
- """Download Data Files"""
 
 
 
 
 
 
 
 
62
  if not os.path.exists(folder):
63
  os.makedirs(folder)
64
  client = boto3.client(
@@ -75,6 +96,20 @@ def download_files_from_s3(folder):
75
  return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
76
 
77
  def extract_from_pdf_by_pattern(url, pattern):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  # Send a GET request to the URL and retrieve the PDF content
79
  try:
80
  response = requests.get(url, timeout=60)
@@ -103,15 +138,44 @@ def extract_from_pdf_by_pattern(url, pattern):
103
  return extracted_text.replace('?\n', '?-\n').replace('!\n', '!-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('!-','!\n').replace('。-','。\n')
104
 
105
  def get_reference_by_regex(pattern, text):
 
 
 
 
 
 
 
 
 
 
106
  return re.findall(pattern, text)
107
 
108
  def isnot_substring(list_a, string_to_check):
 
 
 
 
 
 
 
 
 
 
109
  for s in list_a:
110
  if s in string_to_check:
111
  return False
112
  return True
113
 
114
  def extract_reference(row):
 
 
 
 
 
 
 
 
 
115
  try:
116
  pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
117
  extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
@@ -186,13 +250,33 @@ def extract_reference(row):
186
  update_reference(row)
187
  except Exception as error:
188
  print(error)
189
-
190
 
191
  def translate(text):
 
 
 
 
 
 
 
 
 
192
  return translator.translate(text, dest='en').text
193
 
194
  def datemodifier(date_string, date_format):
195
- """Date Modifier Function"""
 
 
 
 
 
 
 
 
 
 
 
 
196
  try:
197
  to_date = time.strptime(date_string,date_format)
198
  return time.strftime("%Y-%m-%d",to_date)
@@ -200,20 +284,51 @@ def datemodifier(date_string, date_format):
200
  return False
201
 
202
  def fetch_url(url):
203
- response = requests.get(url, timeout = 60)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  if response.status_code == 200:
205
  return response.text
206
  else:
207
  return None
208
 
209
  def translist(infolist):
210
- """Translist Function"""
 
 
 
 
 
 
 
 
211
  out = list(filter(lambda s: s and
212
- (isinstance (s,str) or len(s.strip()) > 0), [i.strip() for i in infolist]))
213
  return out
214
 
215
  def encode(content):
216
- """Encode Function"""
 
 
 
 
 
 
 
 
 
217
  text = ''
218
  for element in content:
219
  if isinstance(element, etree._Element):
@@ -228,7 +343,16 @@ def encode(content):
228
  return text
229
 
230
  def encode_content(content):
231
- """Encode Function"""
 
 
 
 
 
 
 
 
 
232
  text = ''
233
  for element in content:
234
  if isinstance(element, etree._Element):
@@ -252,6 +376,18 @@ def encode_content(content):
252
  return text, summary
253
 
254
  def extract_from_pdf(url):
 
 
 
 
 
 
 
 
 
 
 
 
255
  # Send a GET request to the URL and retrieve the PDF content
256
  response = requests.get(url, timeout=60)
257
  pdf_content = response.content
@@ -281,16 +417,30 @@ def extract_from_pdf(url):
281
  return extracted_text, summary
282
 
283
  def get_db_connection():
284
- """Get dynamoDB connection"""
 
 
 
 
285
  dynamodb = boto3.resource(
286
- service_name='dynamodb',
287
- region_name='us-east-1',
288
- aws_access_key_id=AWS_ACCESS_KEY_ID,
289
- aws_secret_access_key=AWS_SECRET_ACCESS_KEY
290
  )
291
  return dynamodb
292
 
293
  def sentiment_computation(content):
 
 
 
 
 
 
 
 
 
 
294
  label_dict = {
295
  "positive": "+",
296
  "negative": "-",
@@ -314,6 +464,20 @@ def sentiment_computation(content):
314
  return sentiment_score, label_dict[sentiment_label]
315
 
316
  def crawl(url, article):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
  domain = '.'.join(urlparse(url).netloc.split('.')[1:])
318
  req = urllib.request.urlopen(url)
319
  text = req.read()
@@ -348,10 +512,18 @@ def crawl(url, article):
348
  update_content(article)
349
 
350
  def upsert_content(report):
351
- """Upsert the content records"""
 
 
 
 
 
 
 
 
352
  dynamodb = get_db_connection()
353
  table = dynamodb.Table('article_china')
354
- # Define the item data
355
  item = {
356
  'id': str(report['id']),
357
  'site': report['site'],
@@ -374,54 +546,71 @@ def upsert_content(report):
374
  response = table.put_item(Item=item)
375
  print(response)
376
 
377
- # def get_client_connection():
378
- # """Get dynamoDB connection"""
379
- # dynamodb = boto3.client(
380
- # service_name='dynamodb',
381
- # region_name='us-east-1',
382
- # aws_access_key_id=AWS_ACCESS_KEY_ID,
383
- # aws_secret_access_key=AWS_SECRET_ACCESS_KEY
384
- # )
385
- # return dynamodb
386
-
387
  def delete_records(item):
 
 
 
 
 
 
 
 
 
388
  dynamodb_client = get_client_connection()
389
  dynamodb_client.delete_item(
390
- TableName="article_test",
391
- Key={
392
- 'id': {'S': item['id']},
393
- 'site': {'S': item['site']}
394
- }
395
- )
396
 
397
  def update_content(report):
 
 
 
 
 
 
 
 
 
398
  dynamodb = get_client_connection()
399
  response = dynamodb.update_item(
400
- TableName="article_china",
401
- Key={
402
- 'id': {'S': str(report['id'])},
403
- 'site': {'S': report['site']}
404
- },
405
- UpdateExpression='SET title = :title, titleCN = :titleCN, contentCN = :contentCN, category = :category, author = :author, content = :content, subtitle = :subtitle, publishDate = :publishDate, link = :link, attachment = :attachment, sentimentScore = :sentimentScore, sentimentLabel = :sentimentLabel, LastModifiedDate = :LastModifiedDate',
406
- ExpressionAttributeValues={
407
- ':title': {'S': report['title']},
408
- ':titleCN': {'S': report['titleCN']},
409
- ':contentCN': {'S': report['contentCN']},
410
- ':category': {'S': report['category']},
411
- ':author': {'S': report['author']},
412
- ':content': {'S': report['content']},
413
- ':subtitle': {'S': report['subtitle']},
414
- ':publishDate': {'S': report['publishDate']},
415
- ':link': {'S': report['link']},
416
- ':attachment': {'S': report['attachment']},
417
- ':LastModifiedDate': {'S': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")},
418
- ':sentimentScore': {'N': str(Decimal(str(report['sentimentScore'])).quantize(Decimal('0.01')))},
419
- ':sentimentLabel': {'S': report['sentimentLabel']}
420
- }
421
- )
422
  print(response)
423
 
424
  def update_content_sentiment(report):
 
 
 
 
 
 
 
 
 
425
  dynamodb = get_client_connection()
426
  response = dynamodb.update_item(
427
  TableName="article_test",
 
1
+ """Module to define utility function"""
2
  import os
3
  import re
4
  import json
 
31
  patterns = json.load(f)
32
 
33
  def get_client_connection():
34
+ """
35
+ Returns a client connection to DynamoDB.
36
+
37
+ :return: DynamoDB client connection
38
+ """
39
  dynamodb = boto3.client(
40
  service_name='dynamodb',
41
  region_name='us-east-1',
 
45
  return dynamodb
46
 
47
  def update_reference(report):
48
+ """
49
+ Updates the reference in the 'reference_china' table in DynamoDB.
50
+
51
+ Args:
52
+ report (dict): A dictionary containing the report details.
53
+
54
+ Returns:
55
+ None
56
+ """
57
  dynamodb = get_client_connection()
58
  response = dynamodb.update_item(
59
  TableName="reference_china",
 
71
  print(response)
72
 
73
  def download_files_from_s3(folder):
74
+ """
75
+ Downloads Parquet files from an S3 bucket and returns a concatenated DataFrame.
76
+
77
+ Args:
78
+ folder (str): The folder in the S3 bucket to download files from.
79
+
80
+ Returns:
81
+ pandas.DataFrame: A concatenated DataFrame containing the data from the downloaded Parquet files.
82
+ """
83
  if not os.path.exists(folder):
84
  os.makedirs(folder)
85
  client = boto3.client(
 
96
  return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
97
 
98
  def extract_from_pdf_by_pattern(url, pattern):
99
+ """
100
+ Extracts text from a PDF file based on a given pattern.
101
+
102
+ Args:
103
+ url (str): The URL of the PDF file to extract text from.
104
+ pattern (dict): A dictionary containing the pattern to match and the pages to extract text from.
105
+
106
+ Returns:
107
+ str: The extracted text from the PDF file.
108
+
109
+ Raises:
110
+ Exception: If there is an error while retrieving or processing the PDF file.
111
+
112
+ """
113
  # Send a GET request to the URL and retrieve the PDF content
114
  try:
115
  response = requests.get(url, timeout=60)
 
138
  return extracted_text.replace('?\n', '?-\n').replace('!\n', '!-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('!-','!\n').replace('。-','。\n')
139
 
140
  def get_reference_by_regex(pattern, text):
141
+ """
142
+ Finds all occurrences of a given regex pattern in the provided text.
143
+
144
+ Args:
145
+ pattern (str): The regex pattern to search for.
146
+ text (str): The text to search within.
147
+
148
+ Returns:
149
+ list: A list of all matches found in the text.
150
+ """
151
  return re.findall(pattern, text)
152
 
153
  def isnot_substring(list_a, string_to_check):
154
+ """
155
+ Check if any string in the given list is a substring of the string_to_check.
156
+
157
+ Args:
158
+ list_a (list): A list of strings to check.
159
+ string_to_check (str): The string to check for substrings.
160
+
161
+ Returns:
162
+ bool: True if none of the strings in list_a are substrings of string_to_check, False otherwise.
163
+ """
164
  for s in list_a:
165
  if s in string_to_check:
166
  return False
167
  return True
168
 
169
  def extract_reference(row):
170
+ """
171
+ Extracts reference information from a given row.
172
+
173
+ Args:
174
+ row (dict): A dictionary representing a row of data.
175
+
176
+ Returns:
177
+ None
178
+ """
179
  try:
180
  pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
181
  extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
 
250
  update_reference(row)
251
  except Exception as error:
252
  print(error)
 
253
 
254
  def translate(text):
255
+ """
256
+ Translates the given text to English.
257
+
258
+ Args:
259
+ text (str): The text to be translated.
260
+
261
+ Returns:
262
+ str: The translated text in English.
263
+ """
264
  return translator.translate(text, dest='en').text
265
 
266
  def datemodifier(date_string, date_format):
267
+ """Date Modifier Function
268
+
269
+ This function takes a date string and a date format as input and modifies the date string
270
+ according to the specified format. It returns the modified date string in the format 'YYYY-MM-DD'.
271
+
272
+ Args:
273
+ date_string (str): The date string to be modified.
274
+ date_format (str): The format of the date string.
275
+
276
+ Returns:
277
+ str: The modified date string in the format 'YYYY-MM-DD'.
278
+ False: If an error occurs during the modification process.
279
+ """
280
  try:
281
  to_date = time.strptime(date_string,date_format)
282
  return time.strftime("%Y-%m-%d",to_date)
 
284
  return False
285
 
286
  def fetch_url(url):
287
+ """
288
+ Fetches the content of a given URL.
289
+
290
+ Args:
291
+ url (str): The URL to fetch.
292
+
293
+ Returns:
294
+ str or None: The content of the URL if the request is successful (status code 200),
295
+ otherwise None.
296
+
297
+ Raises:
298
+ requests.exceptions.RequestException: If there is an error while making the request.
299
+
300
+ """
301
+ response = requests.get(url, timeout=60)
302
  if response.status_code == 200:
303
  return response.text
304
  else:
305
  return None
306
 
307
  def translist(infolist):
308
+ """
309
+ Filter and transform a list of strings.
310
+
311
+ Args:
312
+ infolist (list): The input list of strings.
313
+
314
+ Returns:
315
+ list: The filtered and transformed list of strings.
316
+ """
317
  out = list(filter(lambda s: s and
318
+ (isinstance(s, str) or len(s.strip()) > 0), [i.strip() for i in infolist]))
319
  return out
320
 
321
  def encode(content):
322
+ """
323
+ Encodes the given content into a single string.
324
+
325
+ Args:
326
+ content (list): A list of elements to be encoded. Each element can be either a string or an `etree._Element` object.
327
+
328
+ Returns:
329
+ str: The encoded content as a single string.
330
+
331
+ """
332
  text = ''
333
  for element in content:
334
  if isinstance(element, etree._Element):
 
343
  return text
344
 
345
  def encode_content(content):
346
+ """
347
+ Encodes the content by removing unnecessary characters and extracting a summary.
348
+
349
+ Args:
350
+ content (list): A list of elements representing the content.
351
+
352
+ Returns:
353
+ tuple: A tuple containing the encoded text and the summary.
354
+
355
+ """
356
  text = ''
357
  for element in content:
358
  if isinstance(element, etree._Element):
 
376
  return text, summary
377
 
378
  def extract_from_pdf(url):
379
+ """
380
+ Extracts text from a PDF file given its URL.
381
+
382
+ Args:
383
+ url (str): The URL of the PDF file.
384
+
385
+ Returns:
386
+ tuple: A tuple containing the extracted text and a summary of the text.
387
+
388
+ Raises:
389
+ Exception: If there is an error during the extraction process.
390
+ """
391
  # Send a GET request to the URL and retrieve the PDF content
392
  response = requests.get(url, timeout=60)
393
  pdf_content = response.content
 
417
  return extracted_text, summary
418
 
419
  def get_db_connection():
420
+ """Get dynamoDB connection.
421
+
422
+ Returns:
423
+ boto3.resource: The DynamoDB resource object representing the connection.
424
+ """
425
  dynamodb = boto3.resource(
426
+ service_name='dynamodb',
427
+ region_name='us-east-1',
428
+ aws_access_key_id=AWS_ACCESS_KEY_ID,
429
+ aws_secret_access_key=AWS_SECRET_ACCESS_KEY
430
  )
431
  return dynamodb
432
 
433
  def sentiment_computation(content):
434
+ """
435
+ Compute the sentiment score and label for the given content.
436
+
437
+ Parameters:
438
+ content (str): The content for which sentiment needs to be computed.
439
+
440
+ Returns:
441
+ tuple: A tuple containing the sentiment score and label. The sentiment score is a float representing the overall sentiment score of the content. The sentiment label is a string representing the sentiment label ('+', '-', or '0').
442
+
443
+ """
444
  label_dict = {
445
  "positive": "+",
446
  "negative": "-",
 
464
  return sentiment_score, label_dict[sentiment_label]
465
 
466
  def crawl(url, article):
467
+ """
468
+ Crawls the given URL and extracts relevant information from the webpage.
469
+
470
+ Args:
471
+ url (str): The URL of the webpage to crawl.
472
+ article (dict): A dictionary to store the extracted information.
473
+
474
+ Returns:
475
+ None: If the length of the extracted content is less than 10 characters.
476
+
477
+ Raises:
478
+ None
479
+
480
+ """
481
  domain = '.'.join(urlparse(url).netloc.split('.')[1:])
482
  req = urllib.request.urlopen(url)
483
  text = req.read()
 
512
  update_content(article)
513
 
514
  def upsert_content(report):
515
+ """
516
+ Upserts the content of a report into the 'article_china' table in DynamoDB.
517
+
518
+ Args:
519
+ report (dict): A dictionary containing the report data.
520
+
521
+ Returns:
522
+ dict: The response from the DynamoDB put_item operation.
523
+ """
524
  dynamodb = get_db_connection()
525
  table = dynamodb.Table('article_china')
526
+ # Define the item data
527
  item = {
528
  'id': str(report['id']),
529
  'site': report['site'],
 
546
  response = table.put_item(Item=item)
547
  print(response)
548
 
 
 
 
 
 
 
 
 
 
 
549
  def delete_records(item):
550
+ """
551
+ Deletes a record from the 'article_test' table in DynamoDB.
552
+
553
+ Args:
554
+ item (dict): The item to be deleted, containing 'id' and 'site' keys.
555
+
556
+ Returns:
557
+ None
558
+ """
559
  dynamodb_client = get_client_connection()
560
  dynamodb_client.delete_item(
561
+ TableName="article_test",
562
+ Key={
563
+ 'id': {'S': item['id']},
564
+ 'site': {'S': item['site']}
565
+ }
566
+ )
567
 
568
  def update_content(report):
569
+ """
570
+ Updates the content of an article in the 'article_china' table in DynamoDB.
571
+
572
+ Args:
573
+ report (dict): A dictionary containing the report data.
574
+
575
+ Returns:
576
+ None
577
+ """
578
  dynamodb = get_client_connection()
579
  response = dynamodb.update_item(
580
+ TableName="article_china",
581
+ Key={
582
+ 'id': {'S': str(report['id'])},
583
+ 'site': {'S': report['site']}
584
+ },
585
+ UpdateExpression='SET title = :title, titleCN = :titleCN, contentCN = :contentCN, category = :category, author = :author, content = :content, subtitle = :subtitle, publishDate = :publishDate, link = :link, attachment = :attachment, sentimentScore = :sentimentScore, sentimentLabel = :sentimentLabel, LastModifiedDate = :LastModifiedDate',
586
+ ExpressionAttributeValues={
587
+ ':title': {'S': report['title']},
588
+ ':titleCN': {'S': report['titleCN']},
589
+ ':contentCN': {'S': report['contentCN']},
590
+ ':category': {'S': report['category']},
591
+ ':author': {'S': report['author']},
592
+ ':content': {'S': report['content']},
593
+ ':subtitle': {'S': report['subtitle']},
594
+ ':publishDate': {'S': report['publishDate']},
595
+ ':link': {'S': report['link']},
596
+ ':attachment': {'S': report['attachment']},
597
+ ':LastModifiedDate': {'S': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")},
598
+ ':sentimentScore': {'N': str(Decimal(str(report['sentimentScore'])).quantize(Decimal('0.01')))},
599
+ ':sentimentLabel': {'S': report['sentimentLabel']}
600
+ }
601
+ )
602
  print(response)
603
 
604
  def update_content_sentiment(report):
605
+ """
606
+ Updates the sentiment score and label of an article in the 'article_test' DynamoDB table.
607
+
608
+ Args:
609
+ report (dict): A dictionary containing the report information.
610
+
611
+ Returns:
612
+ None
613
+ """
614
  dynamodb = get_client_connection()
615
  response = dynamodb.update_item(
616
  TableName="article_test",