Spaces:

Oxbridge-Economics
/

Data-Collection-China

Running

App Files Files Community

OxbridgeEconomics commited on Jul 9, 2024

Commit

dc5ddb1

2 Parent(s): 98d9fcd ef71343

Merge branch 'main' of https://github.com/oxbridge-econ/data-collection-china

Browse files

Files changed (17) hide show

README.md +64 -1
cbirc.py +14 -4
chinatax.py +14 -1
csrc.py +23 -2
daily.py +29 -15
eastmoney.py +22 -0
glue.py +6 -2
gov.py +20 -0
manual_upload.py +23 -6
mof.py +15 -1
mofcom.py +6 -0
ndrc.py +13 -1
patterns.json +78 -12
pbc.py +15 -0
safe.py +23 -1
stats.py +16 -2
utils.py +244 -54

README.md CHANGED Viewed

	@@ -1 +1,64 @@
1	- # ~~security-report-collection~~

+# Security Report Collection
+The Security Report Collection repository contains a series of Python scripts designed to automate the collection, processing, and storage of financial and policy data from various Chinese government and financial websites. This data is vital for understanding changes in policy, financial news, and regulatory measures that could impact markets and investments.
+## Repository Structure
+- **Python Scripts**: Each script is tailored to specific sources and tasks, ranging from data scraping to sentiment analysis and database operations.
+- **GitHub Workflows**: Automated workflows to execute the Python scripts on a schedule or trigger specific events, excluding `utils.py` and `manual_upload.py`.
+- **requirements.txt**: Lists all Python dependencies required for the scripts to run.
+## Python Scripts Overview
+Each script targets different data sources or handles distinct aspects of data management:
+### Data Collection Scripts
+1. **CBIRC, Chinatax, CSRCV, Daily, Eastmoney, Glue, Gov, Manual_Upload, MOF, MOFCOM, PBC, SAFE, Stats**:
+   - These scripts scrape data from their respective websites, handling tasks such as extracting article URLs, downloading articles, translating content, and calculating sentiment scores.
+   - They use utilities provided by `utils.py` to interact with databases, manage files, and perform translations and sentiment analysis.
+### Utility Scripts
+- **utils.py**:
+  - A central utility script that supports database operations, file handling, content translation, and other shared functionalities across various scripts.
+  - It includes custom functions for working with AWS DynamoDB, handling PDFs, fetching URLs, and more.
+### Special Scripts
+- **manual_upload.py**:
+  - Allows manual data entry into the database, facilitating the addition of articles not captured through automated scripts.
+  - Provides a command-line interface for inputting article details and saving them to DynamoDB.
+## GitHub Workflows
+- Automated workflows are set up for all Python scripts except `utils.py` and `manual_upload.py`.
+- These workflows ensure that data collection and processing tasks are executed periodically or in response to specific triggers, maintaining an up-to-date database.
+## Requirements
+- The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
+   ```pip install -r requirements.txt```
+## Setup and Configuration
+1. **AWS Configuration**:
+ - Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
+ - Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
+2. **Database Setup**:
+ - Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
+3. **Python Environment**:
+ - It's recommended to set up a virtual environment for Python to manage dependencies:
+   ```
+   python -m venv venv
+   source venv/bin/activate  # On Unix/macOS
+   venv\Scripts\activate     # On Windows
+   ```
+4. **Running Scripts**:
+ - To run a script manually, navigate to the script’s directory and execute:
+   ```
+   python <script_name>.py
+   ```

cbirc.py CHANGED Viewed

@@ -1,9 +1,19 @@
 import json
-import ssl
 import uuid
 import time
-import urllib.request
-import urllib3
 from datetime import datetime, timedelta
 from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
@@ -34,7 +44,7 @@ while i > -1:
                 article['titleCN'] = article['docSubtitle']
                 article['title'] = translate(article['docSubtitle'])
                 article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
-                article['category']= "Policy Interpretation"
                 article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
                 article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
                 article['attachment'] = ''

+"""
+This script fetches data from the China Banking and Insurance Regulatory Commission (CBIRC) website and extracts relevant information from the fetched data.
+The extracted information is then processed and stored in a database.
+The script performs the following steps:
+1. Fetches data from the CBIRC website by making HTTP requests.
+2. Parses the fetched data and extracts relevant information.
+3. Translates the extracted information to English.
+4. Computes sentiment scores for the translated content.
+5. Stores the processed information in a database.
+Note: The script also includes commented code for fetching data from the State Taxation Administration of China website, but it is currently disabled.
+"""
 import json
 import uuid
 import time
 from datetime import datetime, timedelta
 from utils import translate, sentiment_computation, upsert_content, fetch_url, extract_from_pdf
                 article['titleCN'] = article['docSubtitle']
                 article['title'] = translate(article['docSubtitle'])
                 article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
+                article['category']= "Policy Interpretation"
                 article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
                 article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
                 article['attachment'] = ''

chinatax.py CHANGED Viewed

@@ -1,3 +1,16 @@
 import json
 import ssl
 import uuid
@@ -6,7 +19,7 @@ import time
 import urllib.request
 import urllib3
 from lxml import etree
-from utils import encode, translate, sentiment_computation, upsert_content, encode_content
 ssl._create_default_https_context = ssl._create_stdlib_context

+"""
+This script is used for data collection from the China Taxation website. It retrieves policy interpretation articles and processes them for further analysis.
+The script performs the following steps:
+1. Imports necessary modules and libraries.
+2. Defines the base URL for retrieving policy interpretation articles.
+3. Iterates through the pages of the search results.
+4. Retrieves the content of each article.
+5. Processes the content by translating it to English and performing sentiment analysis.
+6. Stores the processed data in a database.
+Note: The script also retrieves additional articles from a different URL and follows a similar process.
+"""
 import json
 import ssl
 import uuid
 import urllib.request
 import urllib3
 from lxml import etree
+from utils import translate, sentiment_computation, upsert_content, encode_content
 ssl._create_default_https_context = ssl._create_stdlib_context

csrc.py CHANGED Viewed

@@ -1,3 +1,24 @@
 import uuid
 import json
 import time
@@ -35,7 +56,7 @@ while i > -1:
                         article['category']= "Policy Interpretation"
                         crawl(url, article)
                     except Exception as error:
-                       print(error)
 i = 1
 while i > -1:
@@ -70,4 +91,4 @@ while i > -1:
                 article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
                 upsert_content(article)
         except Exception as error:
-            print(error)

+"""
+This script is used to crawl and collect data from the website of the China Securities Regulatory Commission (CSRC).
+It retrieves policy interpretation articles and financial news articles from the CSRC website.
+The collected data is then processed and stored in a database.
+The script consists of two main parts:
+1. Crawl and process policy interpretation articles from the CSRC website.
+2. Crawl and process financial news articles from the CSRC website.
+The script uses various libraries and functions to handle web scraping, data processing, and database operations.
+Note: This script assumes the presence of the following dependencies:
+- urllib
+- lxml
+- json
+- datetime
+- time
+- utils (custom module)
+Please make sure to install these dependencies before running the script.
+"""
 import uuid
 import json
 import time
                         article['category']= "Policy Interpretation"
                         crawl(url, article)
                     except Exception as error:
+                        print(error)
 i = 1
 while i > -1:
                 article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
                 upsert_content(article)
         except Exception as error:
+            print(error)

daily.py CHANGED Viewed

@@ -1,21 +1,21 @@
-import os
 import json
-import uuid
 import time
 import urllib.request
-from lxml import etree
 from datetime import datetime, timedelta
 from urllib.parse import urlparse
-from utils import (encode,
-                   translate,
-                   sentiment_computation,
-                   fetch_url,
-                   extract_from_pdf,
-                   crawl,
-                   datemodifier,
-                   encode_content,
-                   update_content,
-                   extract_reference)
 with open('xpath.json', 'r', encoding='UTF-8') as f:
     xpath_dict = json.load(f)
@@ -50,7 +50,7 @@ while i > -1:
                 article['titleCN'] = article['docSubtitle']
                 article['title'] = translate(article['docSubtitle'])
                 article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
-                article['category']= "Policy Interpretation"
                 article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
                 article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
                 article['attachment'] = ''
@@ -133,6 +133,20 @@ while i > -1:
 print("data.eastmoney.com")
 def crawl_eastmoney(url, article):
     domain = urlparse(url).netloc
     req = urllib.request.urlopen(url)
     text = req.read()
@@ -499,4 +513,4 @@ while i > -1:
                         article['category']= "Data Interpretation"
                         crawl(url, article)
                     except Exception as error:
-                      print(error)

+"""
+This script is responsible for collecting data from various websites related to financial and policy information in China.
+It fetches data from different sources, extracts relevant information, translates it, and updates the content accordingly.
+The collected data includes policy interpretations, financial news, macroeconomic research, and more.
+"""
 import json
+import os
 import time
 import urllib.request
+import uuid
 from datetime import datetime, timedelta
 from urllib.parse import urlparse
+from lxml import etree
+from utils import (crawl, datemodifier, encode, encode_content,
+                   extract_from_pdf, extract_reference, fetch_url,
+                   sentiment_computation, translate, update_content)
 with open('xpath.json', 'r', encoding='UTF-8') as f:
     xpath_dict = json.load(f)
                 article['titleCN'] = article['docSubtitle']
                 article['title'] = translate(article['docSubtitle'])
                 article['link'] = "https://www.cbirc.gov.cn" + str(article['pdfFileUrl'])
+                article['category']= "Policy Interpretation"
                 article['id'] = uuid.uuid5(uuid.NAMESPACE_OID, article['titleCN']+article['publishDate'])
                 article['sentimentScore'], article['sentimentLabel'] = sentiment_computation(article['content'])
                 article['attachment'] = ''
 print("data.eastmoney.com")
 def crawl_eastmoney(url, article):
+    """
+    Crawls the given URL and extracts information from the webpage.
+    Args:
+        url (str): The URL of the webpage to crawl.
+        article (dict): A dictionary to store the extracted information.
+    Returns:
+        None: If the length of the extracted content is less than 10 characters.
+    Raises:
+        None.
+    """
     domain = urlparse(url).netloc
     req = urllib.request.urlopen(url)
     text = req.read()
                         article['category']= "Data Interpretation"
                         crawl(url, article)
                     except Exception as error:
+                        print(error)

eastmoney.py CHANGED Viewed

@@ -1,3 +1,9 @@
 import uuid
 import json
 import urllib.request
@@ -6,10 +12,26 @@ from datetime import datetime, timedelta
 from lxml import etree
 from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
 with open('xpath.json', 'r', encoding='UTF-8') as f:
     xpath_dict = json.load(f)
 def crawl(url, article):
     domain = urlparse(url).netloc
     req = urllib.request.urlopen(url)
     text = req.read()

+"""
+This script is used to crawl a webpage and extract relevant information from it. It defines a function `crawl` that takes a URL and a dictionary to store the extracted information. The function crawls the webpage, extracts the content, translates it to English, and stores it in the dictionary.
+The script also includes a main loop that fetches data from a specific URL and calls the `crawl` function for each article in the fetched data.
+"""
 import uuid
 import json
 import urllib.request
 from lxml import etree
 from utils import encode, translate, datemodifier, sentiment_computation, upsert_content, fetch_url, encode_content
+# Load XPath dictionary from a JSON file
 with open('xpath.json', 'r', encoding='UTF-8') as f:
     xpath_dict = json.load(f)
 def crawl(url, article):
+    """
+    Crawls the given URL and extracts relevant information from the webpage.
+    Args:
+        url (str): The URL of the webpage to crawl.
+        article (dict): A dictionary to store the extracted information.
+    Returns:
+        None: If the length of the extracted content is less than 10 characters.
+        str: The extracted content in English if successful.
+    Raises:
+        None
+    """
     domain = urlparse(url).netloc
     req = urllib.request.urlopen(url)
     text = req.read()

glue.py CHANGED Viewed

@@ -6,7 +6,11 @@ AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
 AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
 def get_client_connection():
-    """Get dynamoDB connection"""
     return boto3.client(
         service_name='glue',
         region_name='us-east-1',
@@ -22,4 +26,4 @@ print(response)
 response = glue.start_job_run(
     JobName='Reference China'
 )
-print(response)

 AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
 def get_client_connection():
+    """
+    Returns a client connection to the AWS Glue service.
+    :return: AWS Glue client connection
+    """
     return boto3.client(
         service_name='glue',
         region_name='us-east-1',
 response = glue.start_job_run(
     JobName='Reference China'
 )
+print(response)

gov.py CHANGED Viewed

@@ -1,3 +1,23 @@
 from datetime import datetime, timedelta
 import time
 import urllib.request

+"""
+This script is used to crawl and collect policy articles from the official website of the State Council of China (https://www.gov.cn).
+The script contains two main functions:
+1. crawl(url, article): This function is responsible for crawling a specific policy article given its URL and extracting relevant information such as title, author, content, publish date, etc.
+2. main(): This function is the entry point of the script. It iterates over different pages of policy articles and calls the crawl function to collect the information.
+Note: The script imports the following modules: datetime, timedelta, time, urllib.request, lxml.etree, and utils (custom module).
+"""
+from datetime import datetime, timedelta
+import time
+import urllib.request
+from lxml import etree
+from utils import crawl
+# Rest of the code...
+"""
+"""
 from datetime import datetime, timedelta
 import time
 import urllib.request

manual_upload.py CHANGED Viewed

@@ -1,7 +1,27 @@
-from decimal import Decimal
-from utils import translate, sentiment_computation, get_db_connection
-from datetime import datetime
 import uuid
 # User input for the article content
 article_titleCN = input("Enter the title of the article: ")
@@ -12,7 +32,6 @@ article_publish_date = input("Enter the publish date of the article (YYYY-MM-DD)
 article_link = input("Enter the link to the article: ")
 article_siteCN = input("Enter the site of the article: ")
 # Compute sentiment of the translated content
 sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
@@ -30,8 +49,6 @@ article= {
     'publishDate': article_publish_date,
     'link': article_link,
     'attachment': '',
-    # 'authorID': str(report['authorid']),
-    # 'entityList': report['entitylist'],
     'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
     'sentimentLabel': sentiment_label,
     'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")

+"""
+This script allows the user to manually upload an article to a database. It prompts the user to enter various details about the article, such as the title, content, subtitle, publish date, link, and site. It then computes the sentiment of the article's translated content and constructs a dictionary representing the article. Finally, it inserts or updates the article in the database.
+Dependencies:
+- decimal
+- utils (custom module)
+- datetime
+- uuid
+Usage:
+1. Run the script.
+2. Enter the required details about the article when prompted.
+3. The script will compute the sentiment of the translated content and construct a dictionary representing the article.
+4. The article will be inserted or updated in the database.
+5. The article dictionary and the response from the database operation will be printed.
+Note: Make sure to configure the database connection and table name before running the script.
+"""
 import uuid
+from datetime import datetime
+from decimal import Decimal
+from utils import get_db_connection, sentiment_computation, translate
 # User input for the article content
 article_titleCN = input("Enter the title of the article: ")
 article_link = input("Enter the link to the article: ")
 article_siteCN = input("Enter the site of the article: ")
 # Compute sentiment of the translated content
 sentiment_score, sentiment_label = sentiment_computation(article_contentCN)
     'publishDate': article_publish_date,
     'link': article_link,
     'attachment': '',
     'sentimentScore': Decimal(str(sentiment_score)).quantize(Decimal('0.01')),
     'sentimentLabel': sentiment_label,
     'LastModifiedDate': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")

mof.py CHANGED Viewed

@@ -1,9 +1,22 @@
 import time
 import urllib.request
-from lxml import etree
 from datetime import datetime, timedelta
 from utils import crawl
 i = 0
 while i > -1:
     if i == 0:
@@ -38,6 +51,7 @@ while i > -1:
                     except Exception as error:
                         print(error)
 i = 0
 while i > -1:
     if i == 0:

+"""
+This script is used to crawl and collect financial news and policy interpretation articles from the website of the Ministry of Finance of China (https://www.mof.gov.cn/).
+The script iterates through the pages of the "Financial News" and "Policy Interpretation" categories on the website and extracts the articles' URLs. It then calls the `crawl` function from the `utils` module to crawl and collect the article data.
+The script uses the `lxml` library to parse the HTML content of the website and extract the necessary information.
+Note: The script assumes the existence of a `crawl` function in the `utils` module.
+"""
 import time
 import urllib.request
 from datetime import datetime, timedelta
+from lxml import etree
 from utils import crawl
+# Crawl Financial News articles
 i = 0
 while i > -1:
     if i == 0:
                     except Exception as error:
                         print(error)
+# Crawl Policy Interpretation articles
 i = 0
 while i > -1:
     if i == 0:

mofcom.py CHANGED Viewed

@@ -1,3 +1,9 @@
 import time
 import urllib.request
 from datetime import datetime, timedelta

+"""
+This script is used to crawl and collect data from the Ministry of Commerce of the People's Republic of China (MOFCOM) website.
+It retrieves articles from different categories and extracts relevant information such as date and URL.
+The collected data is then passed to the 'crawl' function for further processing.
+"""
 import time
 import urllib.request
 from datetime import datetime, timedelta

ndrc.py CHANGED Viewed

@@ -1,5 +1,17 @@
 from datetime import datetime, timedelta
-import uuid
 import time
 import urllib.request
 from lxml import etree

+"""
+This script is used to crawl and collect data from the National Development and Reform Commission (NDRC) website.
+It retrieves articles from the website and categorizes them as either "Policy Release" or "Policy Interpretation".
+The script starts by iterating through the pages of the website, starting from the first page.
+For each page, it retrieves the HTML content and parses it using lxml library.
+It then extracts the article list from the parsed HTML and iterates through each article.
+For each article, it extracts the publication date, converts it to a datetime object, and checks if it is within the last 183 days.
+If the article is older than 183 days, the script stops iterating through the pages.
+Otherwise, it extracts the URL of the article and categorizes it based on the URL pattern.
+The script then calls the 'crawl' function from the 'utils' module to crawl the article and collect data.
+Any exceptions that occur during the crawling process are caught and printed.
+"""
 from datetime import datetime, timedelta
 import time
 import urllib.request
 from lxml import etree

patterns.json CHANGED Viewed

@@ -275,7 +275,13 @@
       "keyword": "相关研究",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
-      "date_format": "%Y.%m.%d"
   },
   {
       "site": "Yongxing Securities Co., Ltd.",
@@ -293,16 +299,28 @@
       "keyword": "相关研究",
       "article_regex": "《(.*?)》",
       "date_regex": "(d{4}\\s/\\d{2}/\\d{2}) ",
-      "date_format": "(%Y/%m/%d) "
   },
   {
       "site": "Hualong Securities Co., Ltd.",
       "pages": [0],
-      "date_range": 1,
       "keyword": "相关阅读",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
-      "date_format": "%Y.%m.%d"
   },
   {
       "site": "Hebei Yuanda Information Technology Co., Ltd.",
@@ -311,7 +329,13 @@
       "keyword": "相关报告：",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
-      "date_format": "%Y.%m.%d"
   },
   {
       "site": "Huaxin Securities Co., Ltd.",
@@ -329,7 +353,13 @@
       "keyword": "1.",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
-      "date_format": "%Y.%m.%d"
   },
   {
       "site": "Beijing Tengjing Big Data Application Technology Research Institute",
@@ -338,7 +368,13 @@
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
-      "date_format": "%Y-%m-%d"
   },
   {
       "site": "Wanhe Securities Co., Ltd.",
@@ -347,7 +383,13 @@
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
-      "date_format": "%Y-%m-%d"
   },
   {
       "site": "Centaline Securities Co., Ltd.",
@@ -356,7 +398,13 @@
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
-      "date_format": "%Y-%m-%d"
   },
   {
       "site": "Tengjing Digital Research",
@@ -365,7 +413,13 @@
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
-      "date_format": "%Y-%m-%d"
   },
   {
       "site": "Guoyuan Securities",
@@ -374,7 +428,13 @@
       "keyword": "相关研究报告",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
-      "date_format": "%Y.%m.%d"
   },
   {
       "site": "China Galaxy Co., Ltd.",
@@ -392,7 +452,13 @@
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
-      "date_format": "%Y-%m-%d"
   },
   {
       "site": "SDIC Anxin Futures",

       "keyword": "相关研究",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
+      "date_format": "%Y.%m.%d",
+      "split":[
+                {
+                  "string": "——",
+                  "index": 0
+                }
+              ]
   },
   {
       "site": "Yongxing Securities Co., Ltd.",
       "keyword": "相关研究",
       "article_regex": "《(.*?)》",
       "date_regex": "(d{4}\\s/\\d{2}/\\d{2}) ",
+      "date_format": "(%Y/%m/%d) ",
+      "split":[
+                {
+                  "string": "——",
+                  "index": 0
+                }
+              ]
   },
   {
       "site": "Hualong Securities Co., Ltd.",
       "pages": [0],
+      "date_range": 5,
       "keyword": "相关阅读",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
+      "date_format": "%Y.%m.%d",
+      "split":[
+                {
+                  "string": "——",
+                  "index": 0
+                }
+              ]
   },
   {
       "site": "Hebei Yuanda Information Technology Co., Ltd.",
       "keyword": "相关报告：",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
+      "date_format": "%Y.%m.%d",
+      "split":[
+                {
+                  "string": "：",
+                  "index": -1
+                }
+              ]
   },
   {
       "site": "Huaxin Securities Co., Ltd.",
       "keyword": "1.",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
+      "date_format": "%Y.%m.%d",
+      "split":[
+                {
+                  "string": "——",
+                  "index": 0
+                }
+              ]
   },
   {
       "site": "Beijing Tengjing Big Data Application Technology Research Institute",
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
+      "date_format": "%Y-%m-%d",
+      "split":[
+                  {
+                    "string": "：",
+                    "index": -1
+                  }
+                ]
   },
   {
       "site": "Wanhe Securities Co., Ltd.",
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}-\\d{2}-\\d{2})",
+      "date_format": "%Y-%m-%d",
+      "split":[
+                  {
+                    "string": "-",
+                    "index": -1
+                  }
+                ]
   },
   {
       "site": "Centaline Securities Co., Ltd.",
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
+      "date_format": "%Y-%m-%d",
+      "split":[
+                  {
+                    "string": "：",
+                    "index": -1
+                  }
+                ]
   },
   {
       "site": "Tengjing Digital Research",
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
+      "date_format": "%Y-%m-%d",
+      "split":[
+                  {
+                    "string": "：",
+                    "index": -1
+                  }
+                ]
   },
   {
       "site": "Guoyuan Securities",
       "keyword": "相关研究报告",
       "article_regex": "《(.*?)》",
       "date_regex": "\\d{4}\\s?.\\s?\\d{1,2}\\s?.\\s?\\d{1,2}",
+      "date_format": "%Y.%m.%d",
+      "split":[
+                  {
+                    "string": "：",
+                    "index": -1
+                  }
+                ]
   },
   {
       "site": "China Galaxy Co., Ltd.",
       "keyword": "相关报告",
       "article_regex": "《(.*?)》",
       "date_regex": "(\\d{4}\\s?-\\s?\\d{2}\\s?-\\s?\\d{2})",
+      "date_format": "%Y-%m-%d",
+      "split":[
+                  {
+                    "string": "：",
+                    "index": 0
+                  }
+                ]
   },
   {
       "site": "SDIC Anxin Futures",

pbc.py CHANGED Viewed

@@ -1,3 +1,18 @@
 import time
 import uuid
 from datetime import datetime, timedelta

+"""
+This module contains code to scrape the People's Bank of China website and collect policy interpretation articles. It iterates through the pages of the website, extracts relevant information from each article, and stores the data in a database.
+The main functionality of this module includes:
+- Scraping the website for policy interpretation articles
+- Parsing the HTML content of each article
+- Extracting relevant information such as title, content, publish date, and URL
+- Translating the content from Chinese to English
+- Computing sentiment scores for the content
+- Storing the collected data in a database
+Note: This code assumes the existence of the following helper functions: encode, translate, datemodifier, sentiment_computation, and upsert_content.
+"""
 import time
 import uuid
 from datetime import datetime, timedelta

safe.py CHANGED Viewed

@@ -1,9 +1,30 @@
 import time
 import urllib.request
 from datetime import datetime, timedelta
 from lxml import etree
 from utils import crawl
 i = 1
 while i > -1:
     if i == 1:
@@ -35,6 +56,7 @@ while i > -1:
                     except Exception as error:
                         print(error)
 i = 1
 while i > -1:
     if i == 1:
@@ -64,4 +86,4 @@ while i > -1:
                         article['category']= "Data Interpretation"
                         crawl(url, article)
                     except Exception as error:
-                        print(error)

+"""Module to crawl the data from the website of State Administration of Foreign Exchange (SAFE) of China.
+This module contains code to crawl and collect data from the website of the State Administration of Foreign Exchange (SAFE) of China. It includes two sections: Policy Interpretation and Data Interpretation.
+Policy Interpretation:
+- The code crawls the web pages containing policy interpretations from the SAFE website.
+- It retrieves the publication date and checks if it is within the last 183 days.
+- If the publication date is within the last 183 days, it extracts the URL and other information of the policy interpretation article.
+- The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
+Data Interpretation:
+- The code crawls the web pages containing data interpretations from the SAFE website.
+- It retrieves the publication date and checks if it is within the last 183 days.
+- If the publication date is within the last 183 days, it extracts the URL and other information of the data interpretation article.
+- The extracted data is stored in a dictionary and passed to the 'crawl' function for further processing.
+Note: The 'crawl' function is imported from the 'utils' module.
+"""
 import time
 import urllib.request
 from datetime import datetime, timedelta
 from lxml import etree
 from utils import crawl
+# Policy Interpretation
 i = 1
 while i > -1:
     if i == 1:
                     except Exception as error:
                         print(error)
+# Data Interpretation
 i = 1
 while i > -1:
     if i == 1:
                         article['category']= "Data Interpretation"
                         crawl(url, article)
                     except Exception as error:
+                        print(error)

stats.py CHANGED Viewed

@@ -1,4 +1,18 @@
-import uuid
 import time
 import urllib.request
 from datetime import datetime, timedelta
@@ -34,4 +48,4 @@ while i > -1:
                         article['category']= "Data Interpretation"
                         crawl(url, article)
                     except Exception as error:
-                      print(error)

+"""
+This script is used to crawl data from the website https://www.stats.gov.cn/sj/sjjd/.
+It retrieves articles from the website and extracts relevant information from each article.
+The script starts by iterating over the pages of the website, starting from the first page.
+For each page, it retrieves the HTML content and parses it using the lxml library.
+It then extracts the list of articles from the parsed HTML.
+For each article, it extracts the publication date and checks if it is within the last 6 months.
+If the article is within the last 6 months, it extracts the URL and crawls the article to extract additional information.
+The extracted information is stored in a dictionary and can be further processed or saved as needed.
+Note: This script requires the 'utils' module, which contains the 'encode' and 'crawl' functions.
+"""
 import time
 import urllib.request
 from datetime import datetime, timedelta
                         article['category']= "Data Interpretation"
                         crawl(url, article)
                     except Exception as error:
+                        print(error)

utils.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Utilis Functions"""
 import os
 import re
 import json
@@ -32,7 +32,11 @@ with open('patterns.json', 'r', encoding='UTF-8') as f:
     patterns = json.load(f)
 def get_client_connection():
-    """Get dynamoDB connection"""
     dynamodb = boto3.client(
         service_name='dynamodb',
         region_name='us-east-1',
@@ -42,6 +46,15 @@ def get_client_connection():
     return dynamodb
 def update_reference(report):
     dynamodb = get_client_connection()
     response = dynamodb.update_item(
                 TableName="reference_china",
@@ -59,7 +72,15 @@ def update_reference(report):
     print(response)
 def download_files_from_s3(folder):
-    """Download Data Files"""
     if not os.path.exists(folder):
         os.makedirs(folder)
     client = boto3.client(
@@ -76,6 +97,20 @@ def download_files_from_s3(folder):
     return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
 def extract_from_pdf_by_pattern(url, pattern):
     # Send a GET request to the URL and retrieve the PDF content
     try:
         response = requests.get(url, timeout=60)
@@ -104,15 +139,44 @@ def extract_from_pdf_by_pattern(url, pattern):
     return extracted_text.replace('?\n', '?-\n').replace('！\n', '！-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('！-','！\n').replace('。-','。\n')
 def get_reference_by_regex(pattern, text):
     return re.findall(pattern, text)
 def isnot_substring(list_a, string_to_check):
     for s in list_a:
         if s in string_to_check:
             return False
     return True
 def extract_reference(row):
     try:
         pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
         extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
@@ -189,10 +253,31 @@ def extract_reference(row):
         print(error)
 def translate(text):
     return translator.translate(text, dest='en').text
 def datemodifier(date_string, date_format):
-    """Date Modifier Function"""
     try:
         to_date = time.strptime(date_string,date_format)
         return time.strftime("%Y-%m-%d",to_date)
@@ -200,20 +285,51 @@ def datemodifier(date_string, date_format):
         return False
 def fetch_url(url):
-    response = requests.get(url, timeout = 60)
     if response.status_code == 200:
         return response.text
     else:
         return None
 def translist(infolist):
-    """Translist Function"""
     out = list(filter(lambda s: s and
-                      (isinstance (s,str) or len(s.strip()) > 0), [i.strip() for i in infolist]))
     return out
 def encode(content):
-    """Encode Function"""
     text = ''
     for element in content:
         if isinstance(element, etree._Element):
@@ -228,7 +344,16 @@ def encode(content):
     return text
 def encode_content(content):
-    """Encode Function"""
     text = ''
     for element in content:
         if isinstance(element, etree._Element):
@@ -252,6 +377,18 @@ def encode_content(content):
     return text, summary
 def extract_from_pdf(url):
     # Send a GET request to the URL and retrieve the PDF content
     response = requests.get(url, timeout=60)
     pdf_content = response.content
@@ -281,16 +418,30 @@ def extract_from_pdf(url):
     return extracted_text, summary
 def get_db_connection():
-    """Get dynamoDB connection"""
     dynamodb = boto3.resource(
-    service_name='dynamodb',
-    region_name='us-east-1',
-    aws_access_key_id=AWS_ACCESS_KEY_ID,
-    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
     )
     return dynamodb
 def sentiment_computation(content):
     label_dict = {
         "positive": "+",
         "negative": "-",
@@ -314,6 +465,20 @@ def sentiment_computation(content):
     return sentiment_score, label_dict[sentiment_label]
 def crawl(url, article):
     domain = '.'.join(urlparse(url).netloc.split('.')[1:])
     req = urllib.request.urlopen(url)
     text = req.read()
@@ -351,10 +516,18 @@ def crawl(url, article):
     update_content(article)
 def upsert_content(report):
-    """Upsert the content records"""
     dynamodb = get_db_connection()
     table = dynamodb.Table('article_china')
-        # Define the item data
     item = {
         'id': str(report['id']),
         'site': report['site'],
@@ -377,54 +550,71 @@ def upsert_content(report):
     response = table.put_item(Item=item)
     print(response)
-# def get_client_connection():
-#     """Get dynamoDB connection"""
-#     dynamodb = boto3.client(
-#         service_name='dynamodb',
-#         region_name='us-east-1',
-#         aws_access_key_id=AWS_ACCESS_KEY_ID,
-#         aws_secret_access_key=AWS_SECRET_ACCESS_KEY
-#     )
-#     return dynamodb
 def delete_records(item):
     dynamodb_client = get_client_connection()
     dynamodb_client.delete_item(
-            TableName="article_test",
-            Key={
-                'id': {'S': item['id']},
-                'site': {'S': item['site']}
-            }
-        )
 def update_content(report):
     dynamodb = get_client_connection()
     response = dynamodb.update_item(
-                TableName="article_china",
-                Key={
-                    'id': {'S': str(report['id'])},
-                    'site': {'S': report['site']}
-                },
-                UpdateExpression='SET title = :title, titleCN = :titleCN, contentCN = :contentCN, category = :category, author = :author, content = :content, subtitle = :subtitle, publishDate = :publishDate, link = :link, attachment = :attachment, sentimentScore = :sentimentScore, sentimentLabel = :sentimentLabel, LastModifiedDate = :LastModifiedDate',
-                ExpressionAttributeValues={
-                    ':title': {'S': report['title']},
-                    ':titleCN': {'S': report['titleCN']},
-                    ':contentCN': {'S': report['contentCN']},
-                    ':category': {'S': report['category']},
-                    ':author': {'S': report['author']},
-                    ':content': {'S': report['content']},
-                    ':subtitle': {'S': report['subtitle']},
-                    ':publishDate': {'S': report['publishDate']},
-                    ':link': {'S': report['link']},
-                    ':attachment': {'S': report['attachment']},
-                    ':LastModifiedDate': {'S': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")},
-                    ':sentimentScore': {'N': str(Decimal(str(report['sentimentScore'])).quantize(Decimal('0.01')))},
-                    ':sentimentLabel': {'S': report['sentimentLabel']}
-                }
-            )
     print(response)
 def update_content_sentiment(report):
     dynamodb = get_client_connection()
     response = dynamodb.update_item(
                 TableName="article_test",

+"""Module to define utility function"""
 import os
 import re
 import json
     patterns = json.load(f)
 def get_client_connection():
+    """
+    Returns a client connection to DynamoDB.
+    :return: DynamoDB client connection
+    """
     dynamodb = boto3.client(
         service_name='dynamodb',
         region_name='us-east-1',
     return dynamodb
 def update_reference(report):
+    """
+    Updates the reference in the 'reference_china' table in DynamoDB.
+    Args:
+        report (dict): A dictionary containing the report details.
+    Returns:
+        None
+    """
     dynamodb = get_client_connection()
     response = dynamodb.update_item(
                 TableName="reference_china",
     print(response)
 def download_files_from_s3(folder):
+    """
+    Downloads Parquet files from an S3 bucket and returns a concatenated DataFrame.
+    Args:
+        folder (str): The folder in the S3 bucket to download files from.
+    Returns:
+        pandas.DataFrame: A concatenated DataFrame containing the data from the downloaded Parquet files.
+    """
     if not os.path.exists(folder):
         os.makedirs(folder)
     client = boto3.client(
     return pd.concat([pd.read_parquet(file_path) for file_path in file_paths], ignore_index=True)
 def extract_from_pdf_by_pattern(url, pattern):
+    """
+    Extracts text from a PDF file based on a given pattern.
+    Args:
+        url (str): The URL of the PDF file to extract text from.
+        pattern (dict): A dictionary containing the pattern to match and the pages to extract text from.
+    Returns:
+        str: The extracted text from the PDF file.
+    Raises:
+        Exception: If there is an error while retrieving or processing the PDF file.
+    """
     # Send a GET request to the URL and retrieve the PDF content
     try:
         response = requests.get(url, timeout=60)
     return extracted_text.replace('?\n', '?-\n').replace('！\n', '！-\n').replace('。\n', '。-\n').replace('\n',' ').replace('?-','?\n').replace('！-','！\n').replace('。-','。\n')
 def get_reference_by_regex(pattern, text):
+    """
+    Finds all occurrences of a given regex pattern in the provided text.
+    Args:
+        pattern (str): The regex pattern to search for.
+        text (str): The text to search within.
+    Returns:
+        list: A list of all matches found in the text.
+    """
     return re.findall(pattern, text)
 def isnot_substring(list_a, string_to_check):
+    """
+    Check if any string in the given list is a substring of the string_to_check.
+    Args:
+        list_a (list): A list of strings to check.
+        string_to_check (str): The string to check for substrings.
+    Returns:
+        bool: True if none of the strings in list_a are substrings of string_to_check, False otherwise.
+    """
     for s in list_a:
         if s in string_to_check:
             return False
     return True
 def extract_reference(row):
+    """
+    Extracts reference information from a given row.
+    Args:
+        row (dict): A dictionary representing a row of data.
+    Returns:
+        None
+    """
     try:
         pattern = next((elem for elem in patterns if elem['site'] == row['site']), None)
         extracted_text = extract_from_pdf_by_pattern(row['attachment'],pattern)
         print(error)
 def translate(text):
+    """
+    Translates the given text to English.
+    Args:
+        text (str): The text to be translated.
+    Returns:
+        str: The translated text in English.
+    """
     return translator.translate(text, dest='en').text
 def datemodifier(date_string, date_format):
+    """Date Modifier Function
+    This function takes a date string and a date format as input and modifies the date string
+    according to the specified format. It returns the modified date string in the format 'YYYY-MM-DD'.
+    Args:
+        date_string (str): The date string to be modified.
+        date_format (str): The format of the date string.
+    Returns:
+        str: The modified date string in the format 'YYYY-MM-DD'.
+        False: If an error occurs during the modification process.
+    """
     try:
         to_date = time.strptime(date_string,date_format)
         return time.strftime("%Y-%m-%d",to_date)
         return False
 def fetch_url(url):
+    """
+    Fetches the content of a given URL.
+    Args:
+        url (str): The URL to fetch.
+    Returns:
+        str or None: The content of the URL if the request is successful (status code 200),
+        otherwise None.
+    Raises:
+        requests.exceptions.RequestException: If there is an error while making the request.
+    """
+    response = requests.get(url, timeout=60)
     if response.status_code == 200:
         return response.text
     else:
         return None
 def translist(infolist):
+    """
+    Filter and transform a list of strings.
+    Args:
+        infolist (list): The input list of strings.
+    Returns:
+        list: The filtered and transformed list of strings.
+    """
     out = list(filter(lambda s: s and
+                      (isinstance(s, str) or len(s.strip()) > 0), [i.strip() for i in infolist]))
     return out
 def encode(content):
+    """
+    Encodes the given content into a single string.
+    Args:
+        content (list): A list of elements to be encoded. Each element can be either a string or an `etree._Element` object.
+    Returns:
+        str: The encoded content as a single string.
+    """
     text = ''
     for element in content:
         if isinstance(element, etree._Element):
     return text
 def encode_content(content):
+    """
+    Encodes the content by removing unnecessary characters and extracting a summary.
+    Args:
+        content (list): A list of elements representing the content.
+    Returns:
+        tuple: A tuple containing the encoded text and the summary.
+    """
     text = ''
     for element in content:
         if isinstance(element, etree._Element):
     return text, summary
 def extract_from_pdf(url):
+    """
+    Extracts text from a PDF file given its URL.
+    Args:
+        url (str): The URL of the PDF file.
+    Returns:
+        tuple: A tuple containing the extracted text and a summary of the text.
+    Raises:
+        Exception: If there is an error during the extraction process.
+    """
     # Send a GET request to the URL and retrieve the PDF content
     response = requests.get(url, timeout=60)
     pdf_content = response.content
     return extracted_text, summary
 def get_db_connection():
+    """Get dynamoDB connection.
+    Returns:
+        boto3.resource: The DynamoDB resource object representing the connection.
+    """
     dynamodb = boto3.resource(
+        service_name='dynamodb',
+        region_name='us-east-1',
+        aws_access_key_id=AWS_ACCESS_KEY_ID,
+        aws_secret_access_key=AWS_SECRET_ACCESS_KEY
     )
     return dynamodb
 def sentiment_computation(content):
+    """
+    Compute the sentiment score and label for the given content.
+    Parameters:
+    content (str): The content for which sentiment needs to be computed.
+    Returns:
+    tuple: A tuple containing the sentiment score and label. The sentiment score is a float representing the overall sentiment score of the content. The sentiment label is a string representing the sentiment label ('+', '-', or '0').
+    """
     label_dict = {
         "positive": "+",
         "negative": "-",
     return sentiment_score, label_dict[sentiment_label]
 def crawl(url, article):
+    """
+    Crawls the given URL and extracts relevant information from the webpage.
+    Args:
+        url (str): The URL of the webpage to crawl.
+        article (dict): A dictionary to store the extracted information.
+    Returns:
+        None: If the length of the extracted content is less than 10 characters.
+    Raises:
+        None
+    """
     domain = '.'.join(urlparse(url).netloc.split('.')[1:])
     req = urllib.request.urlopen(url)
     text = req.read()
     update_content(article)
 def upsert_content(report):
+    """
+    Upserts the content of a report into the 'article_china' table in DynamoDB.
+    Args:
+        report (dict): A dictionary containing the report data.
+    Returns:
+        dict: The response from the DynamoDB put_item operation.
+    """
     dynamodb = get_db_connection()
     table = dynamodb.Table('article_china')
+    # Define the item data
     item = {
         'id': str(report['id']),
         'site': report['site'],
     response = table.put_item(Item=item)
     print(response)
 def delete_records(item):
+    """
+    Deletes a record from the 'article_test' table in DynamoDB.
+    Args:
+        item (dict): The item to be deleted, containing 'id' and 'site' keys.
+    Returns:
+        None
+    """
     dynamodb_client = get_client_connection()
     dynamodb_client.delete_item(
+        TableName="article_test",
+        Key={
+            'id': {'S': item['id']},
+            'site': {'S': item['site']}
+        }
+    )
 def update_content(report):
+    """
+    Updates the content of an article in the 'article_china' table in DynamoDB.
+    Args:
+        report (dict): A dictionary containing the report data.
+    Returns:
+        None
+    """
     dynamodb = get_client_connection()
     response = dynamodb.update_item(
+        TableName="article_china",
+        Key={
+            'id': {'S': str(report['id'])},
+            'site': {'S': report['site']}
+        },
+        UpdateExpression='SET title = :title, titleCN = :titleCN, contentCN = :contentCN, category = :category, author = :author, content = :content, subtitle = :subtitle, publishDate = :publishDate, link = :link, attachment = :attachment, sentimentScore = :sentimentScore, sentimentLabel = :sentimentLabel, LastModifiedDate = :LastModifiedDate',
+        ExpressionAttributeValues={
+            ':title': {'S': report['title']},
+            ':titleCN': {'S': report['titleCN']},
+            ':contentCN': {'S': report['contentCN']},
+            ':category': {'S': report['category']},
+            ':author': {'S': report['author']},
+            ':content': {'S': report['content']},
+            ':subtitle': {'S': report['subtitle']},
+            ':publishDate': {'S': report['publishDate']},
+            ':link': {'S': report['link']},
+            ':attachment': {'S': report['attachment']},
+            ':LastModifiedDate': {'S': datetime.now().strftime("%Y-%m-%dT%H:%M:%S")},
+            ':sentimentScore': {'N': str(Decimal(str(report['sentimentScore'])).quantize(Decimal('0.01')))},
+            ':sentimentLabel': {'S': report['sentimentLabel']}
+        }
+    )
     print(response)
 def update_content_sentiment(report):
+    """
+    Updates the sentiment score and label of an article in the 'article_test' DynamoDB table.
+    Args:
+        report (dict): A dictionary containing the report information.
+    Returns:
+        None
+    """
     dynamodb = get_client_connection()
     response = dynamodb.update_item(
                 TableName="article_test",