Muhammad Abdur Rahman Saad
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,65 +1,64 @@
|
|
1 |
-
#
|
2 |
-
|
3 |
-
The
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
-
|
24 |
-
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
In summary, `glue.py` sets up an environment and starts a Glue job to create a Parquet snapshot.
|
|
|
1 |
+
# Security Report Collection
|
2 |
+
|
3 |
+
The Security Report Collection repository contains a series of Python scripts designed to automate the collection, processing, and storage of financial and policy data from various Chinese government and financial websites. This data is vital for understanding changes in policy, financial news, and regulatory measures that could impact markets and investments.
|
4 |
+
|
5 |
+
## Repository Structure
|
6 |
+
|
7 |
+
- **Python Scripts**: Each script is tailored to specific sources and tasks, ranging from data scraping to sentiment analysis and database operations.
|
8 |
+
- **GitHub Workflows**: Automated workflows to execute the Python scripts on a schedule or trigger specific events, excluding `utils.py` and `manual_upload.py`.
|
9 |
+
- **requirements.txt**: Lists all Python dependencies required for the scripts to run.
|
10 |
+
|
11 |
+
## Python Scripts Overview
|
12 |
+
|
13 |
+
Each script targets different data sources or handles distinct aspects of data management:
|
14 |
+
|
15 |
+
### Data Collection Scripts
|
16 |
+
|
17 |
+
1. **CBIRC, Chinatax, CSRCV, Daily, Eastmoney, Glue, Gov, Manual_Upload, MOF, MOFCOM, PBC, SAFE, Stats**:
|
18 |
+
- These scripts scrape data from their respective websites, handling tasks such as extracting article URLs, downloading articles, translating content, and calculating sentiment scores.
|
19 |
+
- They use utilities provided by `utils.py` to interact with databases, manage files, and perform translations and sentiment analysis.
|
20 |
+
|
21 |
+
### Utility Scripts
|
22 |
+
|
23 |
+
- **utils.py**:
|
24 |
+
- A central utility script that supports database operations, file handling, content translation, and other shared functionalities across various scripts.
|
25 |
+
- It includes custom functions for working with AWS DynamoDB, handling PDFs, fetching URLs, and more.
|
26 |
+
|
27 |
+
### Special Scripts
|
28 |
+
|
29 |
+
- **manual_upload.py**:
|
30 |
+
- Allows manual data entry into the database, facilitating the addition of articles not captured through automated scripts.
|
31 |
+
- Provides a command-line interface for inputting article details and saving them to DynamoDB.
|
32 |
+
|
33 |
+
## GitHub Workflows
|
34 |
+
|
35 |
+
- Automated workflows are set up for all Python scripts except `utils.py` and `manual_upload.py`.
|
36 |
+
- These workflows ensure that data collection and processing tasks are executed periodically or in response to specific triggers, maintaining an up-to-date database.
|
37 |
+
|
38 |
+
## Requirements
|
39 |
+
|
40 |
+
- The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
|
41 |
+
```pip install -r requirements.txt```
|
42 |
+
|
43 |
+
## Setup and Configuration
|
44 |
+
|
45 |
+
1. **AWS Configuration**:
|
46 |
+
- Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
|
47 |
+
- Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
|
48 |
+
|
49 |
+
2. **Database Setup**:
|
50 |
+
- Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
|
51 |
+
|
52 |
+
3. **Python Environment**:
|
53 |
+
- It's recommended to set up a virtual environment for Python to manage dependencies:
|
54 |
+
```
|
55 |
+
python -m venv venv
|
56 |
+
source venv/bin/activate # On Unix/macOS
|
57 |
+
venv\Scripts\activate # On Windows
|
58 |
+
```
|
59 |
+
|
60 |
+
4. **Running Scripts**:
|
61 |
+
- To run a script manually, navigate to the script’s directory and execute:
|
62 |
+
```
|
63 |
+
python <script_name>.py
|
64 |
+
```
|
|