chore: Add script descriptions and improve code readability
Browse files
README.md
CHANGED
@@ -1 +1,65 @@
|
|
1 |
-
# security-report-collection
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# security-report-collection
|
2 |
+
|
3 |
+
The `main.py` file is a Python script that performs sentiment analysis on articles. Here's a detailed breakdown of the code:
|
4 |
+
|
5 |
+
- Importing Libraries:
|
6 |
+
- The script starts by importing necessary libraries:
|
7 |
+
```python
|
8 |
+
import os
|
9 |
+
import glob
|
10 |
+
import warnings
|
11 |
+
from decimal import Decimal
|
12 |
+
import pandas as pd
|
13 |
+
import boto3
|
14 |
+
```
|
15 |
+
These libraries include file manipulation (os), file searching (glob), warning suppression (warnings), decimal arithmetic (decimal), data processing (pandas), and AWS services (boto3).
|
16 |
+
|
17 |
+
- Defining Functions:
|
18 |
+
- The script defines three functions:
|
19 |
+
1. `get_db_connection()`: This function establishes a connection to an Amazon DynamoDB database using the AWS access key ID and secret access key.
|
20 |
+
2. `download_files_from_s3()`: This function downloads Parquet files from an S3 bucket named "oe-data-poc" and concatenates them into a Pandas DataFrame.
|
21 |
+
3. `gen_sentiment(record, table_name, label_dict)`: This function computes the sentiment score for each article in the input record. It uses the Hugging Face Transformers library to analyze the text and update the DynamoDB database with the sentiment score and label.
|
22 |
+
|
23 |
+
- Main Program:
|
24 |
+
- The script's main program:
|
25 |
+
```python
|
26 |
+
if __name__ == "__main__":
|
27 |
+
# Define a dictionary mapping sentiment labels to symbols
|
28 |
+
label = {
|
29 |
+
"positive": "+",
|
30 |
+
"negative": "-",
|
31 |
+
"neutral": "0",
|
32 |
+
}
|
33 |
+
|
34 |
+
# Download files from S3 and filter out null values
|
35 |
+
df = download_files_from_s3()
|
36 |
+
df = df[(~df['content'].isnull()) & (df['sentimentscore'].isnull())]
|
37 |
+
|
38 |
+
# Iterate through each row in the DataFrame
|
39 |
+
for _, row in df.iterrows():
|
40 |
+
# Compute sentiment score and update DynamoDB database
|
41 |
+
gen_sentiment(row, 'article', label)
|
42 |
+
```
|
43 |
+
|
44 |
+
The main program defines a dictionary mapping sentiment labels to symbols (e.g., "+" for positive sentiment), downloads files from S3, filters out null values, and then iterates through each row in the DataFrame. For each row, it computes the sentiment score using the `gen_sentiment()` function and updates the DynamoDB database with the sentiment score and label.
|
45 |
+
|
46 |
+
- That's It!
|
47 |
+
- The script concludes by defining a dictionary mapping sentiment labels to symbols and performing sentiment analysis on articles stored in S3.
|
48 |
+
|
49 |
+
The `glue.py` file contains a Python script that triggers a Parquet snapshot Glue job.
|
50 |
+
|
51 |
+
Here's a breakdown of the code:
|
52 |
+
|
53 |
+
1. It starts by importing necessary modules:
|
54 |
+
- `os`: for interacting with the operating system
|
55 |
+
- `boto3`: a library for working with AWS services like Amazon S3, DynamoDB, and more
|
56 |
+
|
57 |
+
2. Then, it defines two environment variables:
|
58 |
+
- `AWS_ACCESS_KEY_ID`
|
59 |
+
- `AWS_SECRET_ACCESS_KEY`
|
60 |
+
|
61 |
+
3. The script then defines a function called `get_client_connection()` that returns a Boto3 client object for the Glue service. This client is used to interact with Amazon Glue.
|
62 |
+
|
63 |
+
4. Finally, it uses this client to start a job run named 'Ner Snapshot'. It prints out the response from Amazon Glue.
|
64 |
+
|
65 |
+
In summary, `glue.py` sets up an environment and starts a Glue job to create a Parquet snapshot.
|
cbirc.py
CHANGED
@@ -1,9 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import json
|
2 |
-
import ssl
|
3 |
import uuid
|
4 |
import time
|
5 |
-
import urllib.request
|
6 |
-
import urllib3
|
7 |
from datetime import datetime, timedelta
|
8 |
from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
|
9 |
|
@@ -34,7 +44,7 @@ while i > -1:
|
|
34 |
article['titleCN'] = article['docSubtitle']
|
35 |
article['title'] = translate(article['docSubtitle'])
|
36 |
article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
|
37 |
-
article['category']= "Policy Interpretation"
|
38 |
article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
|
39 |
article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
|
40 |
article['attachment'] = ''
|
|
|
1 |
+
"""
|
2 |
+
This script fetches data from the China Banking and Insurance Regulatory Commission (CBIRC) website and extracts relevant information from the fetched data.
|
3 |
+
The extracted information is then processed and stored in a database.
|
4 |
+
|
5 |
+
The script performs the following steps:
|
6 |
+
1. Fetches data from the CBIRC website by making HTTP requests.
|
7 |
+
2. Parses the fetched data and extracts relevant information.
|
8 |
+
3. Translates the extracted information to English.
|
9 |
+
4. Computes sentiment scores for the translated content.
|
10 |
+
5. Stores the processed information in a database.
|
11 |
+
|
12 |
+
Note: The script also includes commented code for fetching data from the State Taxation Administration of China website, but it is currently disabled.
|
13 |
+
"""
|
14 |
import json
|
|
|
15 |
import uuid
|
16 |
import time
|
|
|
|
|
17 |
from datetime import datetime, timedelta
|
18 |
from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
|
19 |
|
|
|
44 |
article['titleCN'] = article['docSubtitle']
|
45 |
article['title'] = translate(article['docSubtitle'])
|
46 |
article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
|
47 |
+
article['category']= "Policy Interpretation"
|
48 |
article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
|
49 |
article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
|
50 |
article['attachment'] = ''
|
chinatax.py
CHANGED
@@ -1,3 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import json
|
2 |
import ssl
|
3 |
import uuid
|
@@ -6,7 +19,7 @@ import time
|
|
6 |
import urllib.request
|
7 |
import urllib3
|
8 |
from lxml import etree
|
9 |
-
from utils import
|
10 |
|
11 |
ssl._create_default_https_context = ssl._create_stdlib_context
|
12 |
|
|
|
1 |
+
"""
|
2 |
+
This script is used for data collection from the China Taxation website. It retrieves policy interpretation articles and processes them for further analysis.
|
3 |
+
|
4 |
+
The script performs the following steps:
|
5 |
+
1. Imports necessary modules and libraries.
|
6 |
+
2. Defines the base URL for retrieving policy interpretation articles.
|
7 |
+
3. Iterates through the pages of the search results.
|
8 |
+
4. Retrieves the content of each article.
|
9 |
+
5. Processes the content by translating it to English and performing sentiment analysis.
|
10 |
+
6. Stores the processed data in a database.
|
11 |
+
|
12 |
+
Note: The script also retrieves additional articles from a different URL and follows a similar process.
|
13 |
+
"""
|
14 |
import json
|
15 |
import ssl
|
16 |
import uuid
|
|
|
19 |
import urllib.request
|
20 |
import urllib3
|
21 |
from lxml import etree
|
22 |
+
from utils import translate, sentiment_computation, upsert_content, encode_content
|
23 |
|
24 |
ssl._create_default_https_context = ssl._create_stdlib_context
|
25 |
|
csrc.py
CHANGED
@@ -1,3 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import uuid
|
2 |
import json
|
3 |
import time
|
@@ -35,7 +56,7 @@ while i > -1:
|
|
35 |
article['category']= "Policy Interpretation"
|
36 |
crawl(url, article)
|
37 |
except Exception as error:
|
38 |
-
|
39 |
|
40 |
i = 1
|
41 |
while i > -1:
|
@@ -70,4 +91,4 @@ while i > -1:
|
|
70 |
article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
|
71 |
upsert_content(article)
|
72 |
except Exception as error:
|
73 |
-
print(error)
|
|
|
1 |
+
"""
|
2 |
+
This script is used to crawl and collect data from the website of the China Securities Regulatory Commission (CSRC).
|
3 |
+
It retrieves policy interpretation articles and financial news articles from the CSRC website.
|
4 |
+
The collected data is then processed and stored in a database.
|
5 |
+
|
6 |
+
The script consists of two main parts:
|
7 |
+
1. Crawl and process policy interpretation articles from the CSRC website.
|
8 |
+
2. Crawl and process financial news articles from the CSRC website.
|
9 |
+
|
10 |
+
The script uses various libraries and functions to handle web scraping, data processing, and database operations.
|
11 |
+
|
12 |
+
Note: This script assumes the presence of the following dependencies:
|
13 |
+
- urllib
|
14 |
+
- lxml
|
15 |
+
- json
|
16 |
+
- datetime
|
17 |
+
- time
|
18 |
+
- utils (custom module)
|
19 |
+
|
20 |
+
Please make sure to install these dependencies before running the script.
|
21 |
+
"""
|
22 |
import uuid
|
23 |
import json
|
24 |
import time
|
|
|
56 |
article['category']= "Policy Interpretation"
|
57 |
crawl(url, article)
|
58 |
except Exception as error:
|
59 |
+
print(error)
|
60 |
|
61 |
i = 1
|
62 |
while i > -1:
|
|
|
91 |
article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
|
92 |
upsert_content(article)
|
93 |
except Exception as error:
|
94 |
+
print(error)
|
daily.py
CHANGED
@@ -1,21 +1,21 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
2 |
import json
|
3 |
-
import
|
4 |
import time
|
5 |
import urllib.request
|
6 |
-
|
7 |
from datetime import datetime, timedelta
|
8 |
from urllib.parse import urlparse
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
extract_from_pdf,
|
14 |
-
|
15 |
-
datemodifier,
|
16 |
-
encode_content,
|
17 |
-
update_content,
|
18 |
-
extract_reference)
|
19 |
|
20 |
with open('xpath.json', 'r', encoding='UTF-8') as f:
|
21 |
xpath_dict = json.load(f)
|
@@ -50,7 +50,7 @@ while i > -1:
|
|
50 |
article['titleCN'] = article['docSubtitle']
|
51 |
article['title'] = translate(article['docSubtitle'])
|
52 |
article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
|
53 |
-
article['category']= "Policy Interpretation"
|
54 |
article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
|
55 |
article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
|
56 |
article['attachment'] = ''
|
@@ -133,6 +133,20 @@ while i > -1:
|
|
133 |
|
134 |
print("data.eastmoney.com")
|
135 |
def crawl_eastmoney(url, article):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
136 |
domain = urlparse(url).netloc
|
137 |
req = urllib.request.urlopen(url)
|
138 |
text = req.read()
|
@@ -499,4 +513,4 @@ while i > -1:
|
|
499 |
article['category']= "Data Interpretation"
|
500 |
crawl(url, article)
|
501 |
except Exception as error:
|
502 |
-
|
|
|
1 |
+
"""
|
2 |
+
This script is responsible for collecting data from various websites related to financial and policy information in China.
|
3 |
+
It fetches data from different sources, extracts relevant information, translates it, and updates the content accordingly.
|
4 |
+
The collected data includes policy interpretations, financial news, macroeconomic research, and more.
|
5 |
+
"""
|
6 |
import json
|
7 |
+
import os
|
8 |
import time
|
9 |
import urllib.request
|
10 |
+
import uuid
|
11 |
from datetime import datetime, timedelta
|
12 |
from urllib.parse import urlparse
|
13 |
+
|
14 |
+
from lxml import etree
|
15 |
+
|
16 |
+
from utils import (crawl, datemodifier, encode, encode_content,
|
17 |
+
extract_from_pdf, extract_reference, fetch_url,
|
18 |
+
sentiment_computation, translate, update_content)
|
|
|
|
|
|
|
|
|
19 |
|
20 |
with open('xpath.json', 'r', encoding='UTF-8') as f:
|
21 |
xpath_dict = json.load(f)
|
|
|
50 |
article['titleCN'] = article['docSubtitle']
|
51 |
article['title'] = translate(article['docSubtitle'])
|
52 |
article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
|
53 |
+
article['category']= "Policy Interpretation"
|
54 |
article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
|
55 |
article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
|
56 |
article['attachment'] = ''
|
|
|
133 |
|
134 |
print("data.eastmoney.com")
|
135 |
def crawl_eastmoney(url, article):
|
136 |
+
"""
|
137 |
+
Crawls the given URL and extracts information from the webpage.
|
138 |
+
|
139 |
+
Args:
|
140 |
+
url (str): The URL of the webpage to crawl.
|
141 |
+
article (dict): A dictionary to store the extracted information.
|
142 |
+
|
143 |
+
Returns:
|
144 |
+
None: If the length of the extracted content is less than 10 characters.
|
145 |
+
|
146 |
+
Raises:
|
147 |
+
None.
|
148 |
+
|
149 |
+
"""
|
150 |
domain = urlparse(url).netloc
|
151 |
req = urllib.request.urlopen(url)
|
152 |
text = req.read()
|
|
|
513 |
article['category']= "Data Interpretation"
|
514 |
crawl(url, article)
|
515 |
except Exception as error:
|
516 |
+
print(error)
|
eastmoney.py
CHANGED
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import uuid
|
2 |
import json
|
3 |
import urllib.request
|
@@ -6,10 +12,26 @@ from datetime import datetime, timedelta
|
|
6 |
from lxml import etree
|
7 |
from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
|
8 |
|
|
|
9 |
with open('xpath.json', 'r', encoding='UTF-8') as f:
|
10 |
xpath_dict = json.load(f)
|
11 |
|
12 |
def crawl(url, article):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
domain = urlparse(url).netloc
|
14 |
req = urllib.request.urlopen(url)
|
15 |
text = req.read()
|
|
|
1 |
+
"""
|
2 |
+
This script is used to crawl a webpage and extract relevant information from it. It defines a function `crawl` that takes a URL and a dictionary to store the extracted information. The function crawls the webpage, extracts the content, translates it to English, and stores it in the dictionary.
|
3 |
+
|
4 |
+
The script also includes a main loop that fetches data from a specific URL and calls the `crawl` function for each article in the fetched data.
|
5 |
+
"""
|
6 |
+
|
7 |
import uuid
|
8 |
import json
|
9 |
import urllib.request
|
|
|
12 |
from lxml import etree
|
13 |
from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
|
14 |
|
15 |
+
# Load XPath dictionary from a JSON file
|
16 |
with open('xpath.json', 'r', encoding='UTF-8') as f:
|
17 |
xpath_dict = json.load(f)
|
18 |
|
19 |
def crawl(url, article):
|
20 |
+
"""
|
21 |
+
Crawls the given URL and extracts relevant information from the webpage.
|
22 |
+
|
23 |
+
Args:
|
24 |
+
url (str): The URL of the webpage to crawl.
|
25 |
+
article (dict): A dictionary to store the extracted information.
|
26 |
+
|
27 |
+
Returns:
|
28 |
+
None: If the length of the extracted content is less than 10 characters.
|
29 |
+
str: The extracted content in English if successful.
|
30 |
+
|
31 |
+
Raises:
|
32 |
+
None
|
33 |
+
|
34 |
+
"""
|
35 |
domain = urlparse(url).netloc
|
36 |
req = urllib.request.urlopen(url)
|
37 |
text = req.read()
|
glue.py
CHANGED
@@ -6,7 +6,11 @@ AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
|
|
6 |
AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
|
7 |
|
8 |
def get_client_connection():
|
9 |
-
"""
|
|
|
|
|
|
|
|
|
10 |
return boto3.client(
|
11 |
service_name='glue',
|
12 |
region_name='us-east-1',
|
@@ -22,4 +26,4 @@ print(response)
|
|
22 |
response = glue.start_job_run(
|
23 |
JobName='Reference China'
|
24 |
)
|
25 |
-
print(response)
|
|
|
6 |
AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
|
7 |
|
8 |
def get_client_connection():
|
9 |
+
"""
|
10 |
+
Returns a client connection to the AWS Glue service.
|
11 |
+
|
12 |
+
:return: AWS Glue client connection
|
13 |
+
"""
|
14 |
return boto3.client(
|
15 |
service_name='glue',
|
16 |
region_name='us-east-1',
|
|
|
26 |
response = glue.start_job_run(
|
27 |
JobName='Reference China'
|
28 |
)
|
29 |
+
print(response)
|
gov.py
CHANGED
@@ -1,3 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
from datetime import datetime, timedelta
|
2 |
import time
|
3 |
import urllib.request
|
|
|
1 |
+
"""
|
2 |
+
This script is used to crawl and collect policy articles from the official website of the State Council of China (https://www.gov.cn).
|
3 |
+
|
4 |
+
The script contains two main functions:
|
5 |
+
1. crawl(url, article): This function is responsible for crawling a specific policy article given its URL and extracting relevant information such as title, author, content, publish date, etc.
|
6 |
+
2. main(): This function is the entry point of the script. It iterates over different pages of policy articles and calls the crawl function to collect the information.
|
7 |
+
|
8 |
+
Note: The script imports the following modules: datetime, timedelta, time, urllib.request, lxml.etree, and utils (custom module).
|
9 |
+
"""
|
10 |
+
|
11 |
+
from datetime import datetime, timedelta
|
12 |
+
import time
|
13 |
+
import urllib.request
|
14 |
+
from lxml import etree
|
15 |
+
from utils import crawl
|
16 |
+
|
17 |
+
# Rest of the code...
|
18 |
+
"""
|
19 |
+
|
20 |
+
"""
|
21 |
from datetime import datetime, timedelta
|
22 |
import time
|
23 |
import urllib.request
|
manual_upload.py
CHANGED
@@ -1,7 +1,27 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
import uuid
|
|
|
|
|
|
|
|
|
5 |
|
6 |
# User input for the article content
|
7 |
article_titleCN = input("Enter the title of the article: ")
|
@@ -12,7 +32,6 @@ article_publish_date = input("Enter the publish date of the article (YYYY-MM-DD)
|
|
12 |
article_link = input("Enter the link to the article: ")
|
13 |
article_siteCN = input("Enter the site of the article: ")
|
14 |
|
15 |
-
|
16 |
# Compute sentiment of the translated content
|
17 |
sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
|
18 |
|
@@ -30,8 +49,6 @@ article= {
|
|
30 |
'publishDate': article_publish_date,
|
31 |
'link': article_link,
|
32 |
'attachment': '',
|
33 |
-
# 'authorID': str(report['authorid']),
|
34 |
-
# 'entityList': report['entitylist'],
|
35 |
'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
|
36 |
'sentimentLabel': sentiment_label,
|
37 |
'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
|
|
|
1 |
+
"""
|
2 |
+
This script allows the user to manually upload an article to a database. It prompts the user to enter various details about the article, such as the title, content, subtitle, publish date, link, and site. It then computes the sentiment of the article's translated content and constructs a dictionary representing the article. Finally, it inserts or updates the article in the database.
|
3 |
+
|
4 |
+
Dependencies:
|
5 |
+
- decimal
|
6 |
+
- utils (custom module)
|
7 |
+
- datetime
|
8 |
+
- uuid
|
9 |
+
|
10 |
+
Usage:
|
11 |
+
1. Run the script.
|
12 |
+
2. Enter the required details about the article when prompted.
|
13 |
+
3. The script will compute the sentiment of the translated content and construct a dictionary representing the article.
|
14 |
+
4. The article will be inserted or updated in the database.
|
15 |
+
5. The article dictionary and the response from the database operation will be printed.
|
16 |
+
|
17 |
+
Note: Make sure to configure the database connection and table name before running the script.
|
18 |
+
"""
|
19 |
+
|
20 |
import uuid
|
21 |
+
from datetime import datetime
|
22 |
+
from decimal import Decimal
|
23 |
+
|
24 |
+
from utils import get_db_connection, sentiment_computation, translate
|
25 |
|
26 |
# User input for the article content
|
27 |
article_titleCN = input("Enter the title of the article: ")
|
|
|
32 |
article_link = input("Enter the link to the article: ")
|
33 |
article_siteCN = input("Enter the site of the article: ")
|
34 |
|
|
|
35 |
# Compute sentiment of the translated content
|
36 |
sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
|
37 |
|
|
|
49 |
'publishDate': article_publish_date,
|
50 |
'link': article_link,
|
51 |
'attachment': '',
|
|
|
|
|
52 |
'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
|
53 |
'sentimentLabel': sentiment_label,
|
54 |
'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
|
mof.py
CHANGED
@@ -1,9 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import time
|
2 |
import urllib.request
|
3 |
-
from lxml import etree
|
4 |
from datetime import datetime, timedelta
|
|
|
|
|
|
|
5 |
from utils import crawl
|
6 |
|
|
|
7 |
i = 0
|
8 |
while i > -1:
|
9 |
if i == 0:
|
@@ -38,6 +51,7 @@ while i > -1:
|
|
38 |
except Exception as error:
|
39 |
print(error)
|
40 |
|
|
|
41 |
i = 0
|
42 |
while i > -1:
|
43 |
if i == 0:
|
|
|
1 |
+
"""
|
2 |
+
This script is used to crawl and collect financial news and policy interpretation articles from the website of the Ministry of Finance of China (https://www.mof.gov.cn/).
|
3 |
+
|
4 |
+
The script iterates through the pages of the "Financial News" and "Policy Interpretation" categories on the website and extracts the articles' URLs. It then calls the `crawl` function from the `utils` module to crawl and collect the article data.
|
5 |
+
|
6 |
+
The script uses the `lxml` library to parse the HTML content of the website and extract the necessary information.
|
7 |
+
|
8 |
+
Note: The script assumes the existence of a `crawl` function in the `utils` module.
|
9 |
+
"""
|
10 |
+
|
11 |
import time
|
12 |
import urllib.request
|
|
|
13 |
from datetime import datetime, timedelta
|
14 |
+
|
15 |
+
from lxml import etree
|
16 |
+
|
17 |
from utils import crawl
|
18 |
|
19 |
+
# Crawl Financial News articles
|
20 |
i = 0
|
21 |
while i > -1:
|
22 |
if i == 0:
|
|
|
51 |
except Exception as error:
|
52 |
print(error)
|
53 |
|
54 |
+
# Crawl Policy Interpretation articles
|
55 |
i = 0
|
56 |
while i > -1:
|
57 |
if i == 0:
|
mofcom.py
CHANGED
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import time
|
2 |
import urllib.request
|
3 |
from datetime import datetime, timedelta
|
|
|
1 |
+
"""
|
2 |
+
This script is used to crawl and collect data from the Ministry of Commerce of the People's Republic of China (MOFCOM) website.
|
3 |
+
It retrieves articles from different categories and extracts relevant information such as date and URL.
|
4 |
+
The collected data is then passed to the 'crawl' function for further processing.
|
5 |
+
"""
|
6 |
+
|
7 |
import time
|
8 |
import urllib.request
|
9 |
from datetime import datetime, timedelta
|
ndrc.py
CHANGED
@@ -1,5 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
from datetime import datetime, timedelta
|
2 |
-
import uuid
|
3 |
import time
|
4 |
import urllib.request
|
5 |
from lxml import etree
|
|
|
1 |
+
"""
|
2 |
+
This script is used to crawl and collect data from the National Development and Reform Commission (NDRC) website.
|
3 |
+
It retrieves articles from the website and categorizes them as either "Policy Release" or "Policy Interpretation".
|
4 |
+
The script starts by iterating through the pages of the website, starting from the first page.
|
5 |
+
For each page, it retrieves the HTML content and parses it using lxml library.
|
6 |
+
It then extracts the article list from the parsed HTML and iterates through each article.
|
7 |
+
For each article, it extracts the publication date, converts it to a datetime object, and checks if it is within the last 183 days.
|
8 |
+
If the article is older than 183 days, the script stops iterating through the pages.
|
9 |
+
Otherwise, it extracts the URL of the article and categorizes it based on the URL pattern.
|
10 |
+
The script then calls the 'crawl' function from the 'utils' module to crawl the article and collect data.
|
11 |
+
Any exceptions that occur during the crawling process are caught and printed.
|
12 |
+
"""
|
13 |
+
|
14 |
from datetime import datetime, timedelta
|
|
|
15 |
import time
|
16 |
import urllib.request
|
17 |
from lxml import etree
|
pbc.py
CHANGED
@@ -1,3 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import time
|
2 |
import uuid
|
3 |
from datetime import datetime, timedelta
|
|
|
1 |
+
"""
|
2 |
+
This module contains code to scrape the People's Bank of China website and collect policy interpretation articles. It iterates through the pages of the website, extracts relevant information from each article, and stores the data in a database.
|
3 |
+
|
4 |
+
The main functionality of this module includes:
|
5 |
+
- Scraping the website for policy interpretation articles
|
6 |
+
- Parsing the HTML content of each article
|
7 |
+
- Extracting relevant information such as title, content, publish date, and URL
|
8 |
+
- Translating the content from Chinese to English
|
9 |
+
- Computing sentiment scores for the content
|
10 |
+
- Storing the collected data in a database
|
11 |
+
|
12 |
+
Note: This code assumes the existence of the following helper functions: encode, translate, datemodifier, sentiment_computation, and upsert_content.
|
13 |
+
|
14 |
+
"""
|
15 |
+
|
16 |
import time
|
17 |
import uuid
|
18 |
from datetime import datetime, timedelta
|
safe.py
CHANGED
@@ -1,9 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import time
|
2 |
import urllib.request
|
3 |
from datetime import datetime, timedelta
|
4 |
from lxml import etree
|
5 |
from utils import crawl
|
6 |
|
|
|
7 |
i = 1
|
8 |
while i > -1:
|
9 |
if i == 1:
|
@@ -35,6 +56,7 @@ while i > -1:
|
|
35 |
except Exception as error:
|
36 |
print(error)
|
37 |
|
|
|
38 |
i = 1
|
39 |
while i > -1:
|
40 |
if i == 1:
|
@@ -64,4 +86,4 @@ while i > -1:
|
|
64 |
article['category']= "Data Interpretation"
|
65 |
crawl(url, article)
|
66 |
except Exception as error:
|
67 |
-
print(error)
|
|
|
1 |
+
"""Module to crawl the data from the website of State Administration of Foreign Exchange (SAFE) of China.
|
2 |
+
|
3 |
+
This module contains code to crawl and collect data from the website of the State Administration of Foreign Exchange (SAFE) of China. It includes two sections: Policy Interpretation and Data Interpretation.
|
4 |
+
|
5 |
+
Policy Interpretation:
|
6 |
+
- The code crawls the web pages containing policy interpretations from the SAFE website.
|
7 |
+
- It retrieves the publication date and checks if it is within the last 183 days.
|
8 |
+
- If the publication date is within the last 183 days, it extracts the URL and other information of the policy interpretation article.
|
9 |
+
- The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
|
10 |
+
|
11 |
+
Data Interpretation:
|
12 |
+
- The code crawls the web pages containing data interpretations from the SAFE website.
|
13 |
+
- It retrieves the publication date and checks if it is within the last 183 days.
|
14 |
+
- If the publication date is within the last 183 days, it extracts the URL and other information of the data interpretation article.
|
15 |
+
- The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
|
16 |
+
|
17 |
+
Note: The 'crawl' function is imported from the 'utils' module.
|
18 |
+
|
19 |
+
"""
|
20 |
+
|
21 |
import time
|
22 |
import urllib.request
|
23 |
from datetime import datetime, timedelta
|
24 |
from lxml import etree
|
25 |
from utils import crawl
|
26 |
|
27 |
+
# Policy Interpretation
|
28 |
i = 1
|
29 |
while i > -1:
|
30 |
if i == 1:
|
|
|
56 |
except Exception as error:
|
57 |
print(error)
|
58 |
|
59 |
+
# Data Interpretation
|
60 |
i = 1
|
61 |
while i > -1:
|
62 |
if i == 1:
|
|
|
86 |
article['category']= "Data Interpretation"
|
87 |
crawl(url, article)
|
88 |
except Exception as error:
|
89 |
+
print(error)
|
stats.py
CHANGED
@@ -1,4 +1,18 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
import time
|
3 |
import urllib.request
|
4 |
from datetime import datetime, timedelta
|
@@ -34,4 +48,4 @@ while i > -1:
|
|
34 |
article['category']= "Data Interpretation"
|
35 |
crawl(url, article)
|
36 |
except Exception as error:
|
37 |
-
|
|
|
1 |
+
"""
|
2 |
+
This script is used to crawl data from the website https://www.stats.gov.cn/sj/sjjd/.
|
3 |
+
It retrieves articles from the website and extracts relevant information from each article.
|
4 |
+
|
5 |
+
The script starts by iterating over the pages of the website, starting from the first page.
|
6 |
+
For each page, it retrieves the HTML content and parses it using the lxml library.
|
7 |
+
It then extracts the list of articles from the parsed HTML.
|
8 |
+
For each article, it extracts the publication date and checks if it is within the last 6 months.
|
9 |
+
If the article is within the last 6 months, it extracts the URL and crawls the article to extract additional information.
|
10 |
+
|
11 |
+
The extracted information is stored in a dictionary and can be further processed or saved as needed.
|
12 |
+
|
13 |
+
Note: This script requires the 'utils' module, which contains the 'encode' and 'crawl' functions.
|
14 |
+
"""
|
15 |
+
|
16 |
import time
|
17 |
import urllib.request
|
18 |
from datetime import datetime, timedelta
|
|
|
48 |
article['category']= "Data Interpretation"
|
49 |
crawl(url, article)
|
50 |
except Exception as error:
|
51 |
+
print(error)
|
utils.py
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
"""
|
2 |
import os
|
3 |
import re
|
4 |
import json
|
@@ -31,7 +31,11 @@ with open('patterns.json', 'r', encoding='UTF-8') as f:
|
|
31 |
patterns = json.load(f)
|
32 |
|
33 |
def get_client_connection():
|
34 |
-
"""
|
|
|
|
|
|
|
|
|
35 |
dynamodb = boto3.client(
|
36 |
service_name='dynamodb',
|
37 |
region_name='us-east-1',
|
@@ -41,6 +45,15 @@ def get_client_connection():
|
|
41 |
return dynamodb
|
42 |
|
43 |
def update_reference(report):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
dynamodb = get_client_connection()
|
45 |
response = dynamodb.update_item(
|
46 |
TableName="reference_china",
|
@@ -58,7 +71,15 @@ def update_reference(report):
|
|
58 |
print(response)
|
59 |
|
60 |
def download_files_from_s3(folder):
|
61 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
if not os.path.exists(folder):
|
63 |
os.makedirs(folder)
|
64 |
client = boto3.client(
|
@@ -75,6 +96,20 @@ def download_files_from_s3(folder):
|
|
75 |
return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
|
76 |
|
77 |
def extract_from_pdf_by_pattern(url, pattern):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
# Send a GET request to the URL and retrieve the PDF content
|
79 |
try:
|
80 |
response = requests.get(url, timeout=60)
|
@@ -103,15 +138,44 @@ def extract_from_pdf_by_pattern(url, pattern):
|
|
103 |
return extracted_text.replace('?\n', '?-\n').replace('!\n', '!-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('!-','!\n').replace('。-','。\n')
|
104 |
|
105 |
def get_reference_by_regex(pattern, text):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
106 |
return re.findall(pattern, text)
|
107 |
|
108 |
def isnot_substring(list_a, string_to_check):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
for s in list_a:
|
110 |
if s in string_to_check:
|
111 |
return False
|
112 |
return True
|
113 |
|
114 |
def extract_reference(row):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
try:
|
116 |
pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
|
117 |
extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
|
@@ -186,13 +250,33 @@ def extract_reference(row):
|
|
186 |
update_reference(row)
|
187 |
except Exception as error:
|
188 |
print(error)
|
189 |
-
|
190 |
|
191 |
def translate(text):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
192 |
return translator.translate(text, dest='en').text
|
193 |
|
194 |
def datemodifier(date_string, date_format):
|
195 |
-
"""Date Modifier Function
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
196 |
try:
|
197 |
to_date = time.strptime(date_string,date_format)
|
198 |
return time.strftime("%Y-%m-%d",to_date)
|
@@ -200,20 +284,51 @@ def datemodifier(date_string, date_format):
|
|
200 |
return False
|
201 |
|
202 |
def fetch_url(url):
|
203 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
204 |
if response.status_code == 200:
|
205 |
return response.text
|
206 |
else:
|
207 |
return None
|
208 |
|
209 |
def translist(infolist):
|
210 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
211 |
out = list(filter(lambda s: s and
|
212 |
-
(isinstance
|
213 |
return out
|
214 |
|
215 |
def encode(content):
|
216 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
217 |
text = ''
|
218 |
for element in content:
|
219 |
if isinstance(element, etree._Element):
|
@@ -228,7 +343,16 @@ def encode(content):
|
|
228 |
return text
|
229 |
|
230 |
def encode_content(content):
|
231 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
232 |
text = ''
|
233 |
for element in content:
|
234 |
if isinstance(element, etree._Element):
|
@@ -252,6 +376,18 @@ def encode_content(content):
|
|
252 |
return text, summary
|
253 |
|
254 |
def extract_from_pdf(url):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
255 |
# Send a GET request to the URL and retrieve the PDF content
|
256 |
response = requests.get(url, timeout=60)
|
257 |
pdf_content = response.content
|
@@ -281,16 +417,30 @@ def extract_from_pdf(url):
|
|
281 |
return extracted_text, summary
|
282 |
|
283 |
def get_db_connection():
|
284 |
-
"""Get dynamoDB connection
|
|
|
|
|
|
|
|
|
285 |
dynamodb = boto3.resource(
|
286 |
-
|
287 |
-
|
288 |
-
|
289 |
-
|
290 |
)
|
291 |
return dynamodb
|
292 |
|
293 |
def sentiment_computation(content):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
294 |
label_dict = {
|
295 |
"positive": "+",
|
296 |
"negative": "-",
|
@@ -314,6 +464,20 @@ def sentiment_computation(content):
|
|
314 |
return sentiment_score, label_dict[sentiment_label]
|
315 |
|
316 |
def crawl(url, article):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
317 |
domain = '.'.join(urlparse(url).netloc.split('.')[1:])
|
318 |
req = urllib.request.urlopen(url)
|
319 |
text = req.read()
|
@@ -348,10 +512,18 @@ def crawl(url, article):
|
|
348 |
update_content(article)
|
349 |
|
350 |
def upsert_content(report):
|
351 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
352 |
dynamodb = get_db_connection()
|
353 |
table = dynamodb.Table('article_china')
|
354 |
-
|
355 |
item = {
|
356 |
'id': str(report['id']),
|
357 |
'site': report['site'],
|
@@ -374,54 +546,71 @@ def upsert_content(report):
|
|
374 |
response = table.put_item(Item=item)
|
375 |
print(response)
|
376 |
|
377 |
-
# def get_client_connection():
|
378 |
-
# """Get dynamoDB connection"""
|
379 |
-
# dynamodb = boto3.client(
|
380 |
-
# service_name='dynamodb',
|
381 |
-
# region_name='us-east-1',
|
382 |
-
# aws_access_key_id=AWS_ACCESS_KEY_ID,
|
383 |
-
# aws_secret_access_key=AWS_SECRET_ACCESS_KEY
|
384 |
-
# )
|
385 |
-
# return dynamodb
|
386 |
-
|
387 |
def delete_records(item):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
388 |
dynamodb_client = get_client_connection()
|
389 |
dynamodb_client.delete_item(
|
390 |
-
|
391 |
-
|
392 |
-
|
393 |
-
|
394 |
-
|
395 |
-
|
396 |
|
397 |
def update_content(report):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
398 |
dynamodb = get_client_connection()
|
399 |
response = dynamodb.update_item(
|
400 |
-
|
401 |
-
|
402 |
-
|
403 |
-
|
404 |
-
|
405 |
-
|
406 |
-
|
407 |
-
|
408 |
-
|
409 |
-
|
410 |
-
|
411 |
-
|
412 |
-
|
413 |
-
|
414 |
-
|
415 |
-
|
416 |
-
|
417 |
-
|
418 |
-
|
419 |
-
|
420 |
-
|
421 |
-
|
422 |
print(response)
|
423 |
|
424 |
def update_content_sentiment(report):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
425 |
dynamodb = get_client_connection()
|
426 |
response = dynamodb.update_item(
|
427 |
TableName="article_test",
|
|
|
1 |
+
"""Module to define utility function"""
|
2 |
import os
|
3 |
import re
|
4 |
import json
|
|
|
31 |
patterns = json.load(f)
|
32 |
|
33 |
def get_client_connection():
|
34 |
+
"""
|
35 |
+
Returns a client connection to DynamoDB.
|
36 |
+
|
37 |
+
:return: DynamoDB client connection
|
38 |
+
"""
|
39 |
dynamodb = boto3.client(
|
40 |
service_name='dynamodb',
|
41 |
region_name='us-east-1',
|
|
|
45 |
return dynamodb
|
46 |
|
47 |
def update_reference(report):
|
48 |
+
"""
|
49 |
+
Updates the reference in the 'reference_china' table in DynamoDB.
|
50 |
+
|
51 |
+
Args:
|
52 |
+
report (dict): A dictionary containing the report details.
|
53 |
+
|
54 |
+
Returns:
|
55 |
+
None
|
56 |
+
"""
|
57 |
dynamodb = get_client_connection()
|
58 |
response = dynamodb.update_item(
|
59 |
TableName="reference_china",
|
|
|
71 |
print(response)
|
72 |
|
73 |
def download_files_from_s3(folder):
|
74 |
+
"""
|
75 |
+
Downloads Parquet files from an S3 bucket and returns a concatenated DataFrame.
|
76 |
+
|
77 |
+
Args:
|
78 |
+
folder (str): The folder in the S3 bucket to download files from.
|
79 |
+
|
80 |
+
Returns:
|
81 |
+
pandas.DataFrame: A concatenated DataFrame containing the data from the downloaded Parquet files.
|
82 |
+
"""
|
83 |
if not os.path.exists(folder):
|
84 |
os.makedirs(folder)
|
85 |
client = boto3.client(
|
|
|
96 |
return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
|
97 |
|
98 |
def extract_from_pdf_by_pattern(url, pattern):
|
99 |
+
"""
|
100 |
+
Extracts text from a PDF file based on a given pattern.
|
101 |
+
|
102 |
+
Args:
|
103 |
+
url (str): The URL of the PDF file to extract text from.
|
104 |
+
pattern (dict): A dictionary containing the pattern to match and the pages to extract text from.
|
105 |
+
|
106 |
+
Returns:
|
107 |
+
str: The extracted text from the PDF file.
|
108 |
+
|
109 |
+
Raises:
|
110 |
+
Exception: If there is an error while retrieving or processing the PDF file.
|
111 |
+
|
112 |
+
"""
|
113 |
# Send a GET request to the URL and retrieve the PDF content
|
114 |
try:
|
115 |
response = requests.get(url, timeout=60)
|
|
|
138 |
return extracted_text.replace('?\n', '?-\n').replace('!\n', '!-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('!-','!\n').replace('。-','。\n')
|
139 |
|
140 |
def get_reference_by_regex(pattern, text):
|
141 |
+
"""
|
142 |
+
Finds all occurrences of a given regex pattern in the provided text.
|
143 |
+
|
144 |
+
Args:
|
145 |
+
pattern (str): The regex pattern to search for.
|
146 |
+
text (str): The text to search within.
|
147 |
+
|
148 |
+
Returns:
|
149 |
+
list: A list of all matches found in the text.
|
150 |
+
"""
|
151 |
return re.findall(pattern, text)
|
152 |
|
153 |
def isnot_substring(list_a, string_to_check):
|
154 |
+
"""
|
155 |
+
Check if any string in the given list is a substring of the string_to_check.
|
156 |
+
|
157 |
+
Args:
|
158 |
+
list_a (list): A list of strings to check.
|
159 |
+
string_to_check (str): The string to check for substrings.
|
160 |
+
|
161 |
+
Returns:
|
162 |
+
bool: True if none of the strings in list_a are substrings of string_to_check, False otherwise.
|
163 |
+
"""
|
164 |
for s in list_a:
|
165 |
if s in string_to_check:
|
166 |
return False
|
167 |
return True
|
168 |
|
169 |
def extract_reference(row):
|
170 |
+
"""
|
171 |
+
Extracts reference information from a given row.
|
172 |
+
|
173 |
+
Args:
|
174 |
+
row (dict): A dictionary representing a row of data.
|
175 |
+
|
176 |
+
Returns:
|
177 |
+
None
|
178 |
+
"""
|
179 |
try:
|
180 |
pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
|
181 |
extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
|
|
|
250 |
update_reference(row)
|
251 |
except Exception as error:
|
252 |
print(error)
|
|
|
253 |
|
254 |
def translate(text):
|
255 |
+
"""
|
256 |
+
Translates the given text to English.
|
257 |
+
|
258 |
+
Args:
|
259 |
+
text (str): The text to be translated.
|
260 |
+
|
261 |
+
Returns:
|
262 |
+
str: The translated text in English.
|
263 |
+
"""
|
264 |
return translator.translate(text, dest='en').text
|
265 |
|
266 |
def datemodifier(date_string, date_format):
|
267 |
+
"""Date Modifier Function
|
268 |
+
|
269 |
+
This function takes a date string and a date format as input and modifies the date string
|
270 |
+
according to the specified format. It returns the modified date string in the format 'YYYY-MM-DD'.
|
271 |
+
|
272 |
+
Args:
|
273 |
+
date_string (str): The date string to be modified.
|
274 |
+
date_format (str): The format of the date string.
|
275 |
+
|
276 |
+
Returns:
|
277 |
+
str: The modified date string in the format 'YYYY-MM-DD'.
|
278 |
+
False: If an error occurs during the modification process.
|
279 |
+
"""
|
280 |
try:
|
281 |
to_date = time.strptime(date_string,date_format)
|
282 |
return time.strftime("%Y-%m-%d",to_date)
|
|
|
284 |
return False
|
285 |
|
286 |
def fetch_url(url):
|
287 |
+
"""
|
288 |
+
Fetches the content of a given URL.
|
289 |
+
|
290 |
+
Args:
|
291 |
+
url (str): The URL to fetch.
|
292 |
+
|
293 |
+
Returns:
|
294 |
+
str or None: The content of the URL if the request is successful (status code 200),
|
295 |
+
otherwise None.
|
296 |
+
|
297 |
+
Raises:
|
298 |
+
requests.exceptions.RequestException: If there is an error while making the request.
|
299 |
+
|
300 |
+
"""
|
301 |
+
response = requests.get(url, timeout=60)
|
302 |
if response.status_code == 200:
|
303 |
return response.text
|
304 |
else:
|
305 |
return None
|
306 |
|
307 |
def translist(infolist):
|
308 |
+
"""
|
309 |
+
Filter and transform a list of strings.
|
310 |
+
|
311 |
+
Args:
|
312 |
+
infolist (list): The input list of strings.
|
313 |
+
|
314 |
+
Returns:
|
315 |
+
list: The filtered and transformed list of strings.
|
316 |
+
"""
|
317 |
out = list(filter(lambda s: s and
|
318 |
+
(isinstance(s, str) or len(s.strip()) > 0), [i.strip() for i in infolist]))
|
319 |
return out
|
320 |
|
321 |
def encode(content):
|
322 |
+
"""
|
323 |
+
Encodes the given content into a single string.
|
324 |
+
|
325 |
+
Args:
|
326 |
+
content (list): A list of elements to be encoded. Each element can be either a string or an `etree._Element` object.
|
327 |
+
|
328 |
+
Returns:
|
329 |
+
str: The encoded content as a single string.
|
330 |
+
|
331 |
+
"""
|
332 |
text = ''
|
333 |
for element in content:
|
334 |
if isinstance(element, etree._Element):
|
|
|
343 |
return text
|
344 |
|
345 |
def encode_content(content):
|
346 |
+
"""
|
347 |
+
Encodes the content by removing unnecessary characters and extracting a summary.
|
348 |
+
|
349 |
+
Args:
|
350 |
+
content (list): A list of elements representing the content.
|
351 |
+
|
352 |
+
Returns:
|
353 |
+
tuple: A tuple containing the encoded text and the summary.
|
354 |
+
|
355 |
+
"""
|
356 |
text = ''
|
357 |
for element in content:
|
358 |
if isinstance(element, etree._Element):
|
|
|
376 |
return text, summary
|
377 |
|
378 |
def extract_from_pdf(url):
|
379 |
+
"""
|
380 |
+
Extracts text from a PDF file given its URL.
|
381 |
+
|
382 |
+
Args:
|
383 |
+
url (str): The URL of the PDF file.
|
384 |
+
|
385 |
+
Returns:
|
386 |
+
tuple: A tuple containing the extracted text and a summary of the text.
|
387 |
+
|
388 |
+
Raises:
|
389 |
+
Exception: If there is an error during the extraction process.
|
390 |
+
"""
|
391 |
# Send a GET request to the URL and retrieve the PDF content
|
392 |
response = requests.get(url, timeout=60)
|
393 |
pdf_content = response.content
|
|
|
417 |
return extracted_text, summary
|
418 |
|
419 |
def get_db_connection():
|
420 |
+
"""Get dynamoDB connection.
|
421 |
+
|
422 |
+
Returns:
|
423 |
+
boto3.resource: The DynamoDB resource object representing the connection.
|
424 |
+
"""
|
425 |
dynamodb = boto3.resource(
|
426 |
+
service_name='dynamodb',
|
427 |
+
region_name='us-east-1',
|
428 |
+
aws_access_key_id=AWS_ACCESS_KEY_ID,
|
429 |
+
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
|
430 |
)
|
431 |
return dynamodb
|
432 |
|
433 |
def sentiment_computation(content):
|
434 |
+
"""
|
435 |
+
Compute the sentiment score and label for the given content.
|
436 |
+
|
437 |
+
Parameters:
|
438 |
+
content (str): The content for which sentiment needs to be computed.
|
439 |
+
|
440 |
+
Returns:
|
441 |
+
tuple: A tuple containing the sentiment score and label. The sentiment score is a float representing the overall sentiment score of the content. The sentiment label is a string representing the sentiment label ('+', '-', or '0').
|
442 |
+
|
443 |
+
"""
|
444 |
label_dict = {
|
445 |
"positive": "+",
|
446 |
"negative": "-",
|
|
|
464 |
return sentiment_score, label_dict[sentiment_label]
|
465 |
|
466 |
def crawl(url, article):
|
467 |
+
"""
|
468 |
+
Crawls the given URL and extracts relevant information from the webpage.
|
469 |
+
|
470 |
+
Args:
|
471 |
+
url (str): The URL of the webpage to crawl.
|
472 |
+
article (dict): A dictionary to store the extracted information.
|
473 |
+
|
474 |
+
Returns:
|
475 |
+
None: If the length of the extracted content is less than 10 characters.
|
476 |
+
|
477 |
+
Raises:
|
478 |
+
None
|
479 |
+
|
480 |
+
"""
|
481 |
domain = '.'.join(urlparse(url).netloc.split('.')[1:])
|
482 |
req = urllib.request.urlopen(url)
|
483 |
text = req.read()
|
|
|
512 |
update_content(article)
|
513 |
|
514 |
def upsert_content(report):
|
515 |
+
"""
|
516 |
+
Upserts the content of a report into the 'article_china' table in DynamoDB.
|
517 |
+
|
518 |
+
Args:
|
519 |
+
report (dict): A dictionary containing the report data.
|
520 |
+
|
521 |
+
Returns:
|
522 |
+
dict: The response from the DynamoDB put_item operation.
|
523 |
+
"""
|
524 |
dynamodb = get_db_connection()
|
525 |
table = dynamodb.Table('article_china')
|
526 |
+
# Define the item data
|
527 |
item = {
|
528 |
'id': str(report['id']),
|
529 |
'site': report['site'],
|
|
|
546 |
response = table.put_item(Item=item)
|
547 |
print(response)
|
548 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
549 |
def delete_records(item):
|
550 |
+
"""
|
551 |
+
Deletes a record from the 'article_test' table in DynamoDB.
|
552 |
+
|
553 |
+
Args:
|
554 |
+
item (dict): The item to be deleted, containing 'id' and 'site' keys.
|
555 |
+
|
556 |
+
Returns:
|
557 |
+
None
|
558 |
+
"""
|
559 |
dynamodb_client = get_client_connection()
|
560 |
dynamodb_client.delete_item(
|
561 |
+
TableName="article_test",
|
562 |
+
Key={
|
563 |
+
'id': {'S': item['id']},
|
564 |
+
'site': {'S': item['site']}
|
565 |
+
}
|
566 |
+
)
|
567 |
|
568 |
def update_content(report):
|
569 |
+
"""
|
570 |
+
Updates the content of an article in the 'article_china' table in DynamoDB.
|
571 |
+
|
572 |
+
Args:
|
573 |
+
report (dict): A dictionary containing the report data.
|
574 |
+
|
575 |
+
Returns:
|
576 |
+
None
|
577 |
+
"""
|
578 |
dynamodb = get_client_connection()
|
579 |
response = dynamodb.update_item(
|
580 |
+
TableName="article_china",
|
581 |
+
Key={
|
582 |
+
'id': {'S': str(report['id'])},
|
583 |
+
'site': {'S': report['site']}
|
584 |
+
},
|
585 |
+
UpdateExpression='SET title = :title, titleCN = :titleCN, contentCN = :contentCN, category = :category, author = :author, content = :content, subtitle = :subtitle, publishDate = :publishDate, link = :link, attachment = :attachment, sentimentScore = :sentimentScore, sentimentLabel = :sentimentLabel, LastModifiedDate = :LastModifiedDate',
|
586 |
+
ExpressionAttributeValues={
|
587 |
+
':title': {'S': report['title']},
|
588 |
+
':titleCN': {'S': report['titleCN']},
|
589 |
+
':contentCN': {'S': report['contentCN']},
|
590 |
+
':category': {'S': report['category']},
|
591 |
+
':author': {'S': report['author']},
|
592 |
+
':content': {'S': report['content']},
|
593 |
+
':subtitle': {'S': report['subtitle']},
|
594 |
+
':publishDate': {'S': report['publishDate']},
|
595 |
+
':link': {'S': report['link']},
|
596 |
+
':attachment': {'S': report['attachment']},
|
597 |
+
':LastModifiedDate': {'S': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")},
|
598 |
+
':sentimentScore': {'N': str(Decimal(str(report['sentimentScore'])).quantize(Decimal('0.01')))},
|
599 |
+
':sentimentLabel': {'S': report['sentimentLabel']}
|
600 |
+
}
|
601 |
+
)
|
602 |
print(response)
|
603 |
|
604 |
def update_content_sentiment(report):
|
605 |
+
"""
|
606 |
+
Updates the sentiment score and label of an article in the 'article_test' DynamoDB table.
|
607 |
+
|
608 |
+
Args:
|
609 |
+
report (dict): A dictionary containing the report information.
|
610 |
+
|
611 |
+
Returns:
|
612 |
+
None
|
613 |
+
"""
|
614 |
dynamodb = get_client_connection()
|
615 |
response = dynamodb.update_item(
|
616 |
TableName="article_test",
|