Muhammad Abdur Rahman Saad commited on
Commit
f024a59
·
1 Parent(s): 82a33ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -64
README.md CHANGED
@@ -1,64 +1,38 @@
1
- # Security Report Collection
2
-
3
- The Security Report Collection repository contains a series of Python scripts designed to automate the collection, processing, and storage of financial and policy data from various Chinese government and financial websites. This data is vital for understanding changes in policy, financial news, and regulatory measures that could impact markets and investments.
4
-
5
- ## Repository Structure
6
-
7
- - **Python Scripts**: Each script is tailored to specific sources and tasks, ranging from data scraping to sentiment analysis and database operations.
8
- - **GitHub Workflows**: Automated workflows to execute the Python scripts on a schedule or trigger specific events, excluding `utils.py` and `manual_upload.py`.
9
- - **requirements.txt**: Lists all Python dependencies required for the scripts to run.
10
-
11
- ## Python Scripts Overview
12
-
13
- Each script targets different data sources or handles distinct aspects of data management:
14
-
15
- ### Data Collection Scripts
16
-
17
- 1. **CBIRC, Chinatax, CSRCV, Daily, Eastmoney, Glue, Gov, Manual_Upload, MOF, MOFCOM, PBC, SAFE, Stats**:
18
- - These scripts scrape data from their respective websites, handling tasks such as extracting article URLs, downloading articles, translating content, and calculating sentiment scores.
19
- - They use utilities provided by `utils.py` to interact with databases, manage files, and perform translations and sentiment analysis.
20
-
21
- ### Utility Scripts
22
-
23
- - **utils.py**:
24
- - A central utility script that supports database operations, file handling, content translation, and other shared functionalities across various scripts.
25
- - It includes custom functions for working with AWS DynamoDB, handling PDFs, fetching URLs, and more.
26
-
27
- ### Special Scripts
28
-
29
- - **manual_upload.py**:
30
- - Allows manual data entry into the database, facilitating the addition of articles not captured through automated scripts.
31
- - Provides a command-line interface for inputting article details and saving them to DynamoDB.
32
-
33
- ## GitHub Workflows
34
-
35
- - Automated workflows are set up for all Python scripts except `utils.py` and `manual_upload.py`.
36
- - These workflows ensure that data collection and processing tasks are executed periodically or in response to specific triggers, maintaining an up-to-date database.
37
-
38
- ## Requirements
39
-
40
- - The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
41
- ```pip install -r requirements.txt```
42
-
43
- ## Setup and Configuration
44
-
45
- 1. **AWS Configuration**:
46
- - Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
47
- - Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
48
-
49
- 2. **Database Setup**:
50
- - Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
51
-
52
- 3. **Python Environment**:
53
- - It's recommended to set up a virtual environment for Python to manage dependencies:
54
- ```
55
- python -m venv venv
56
- source venv/bin/activate # On Unix/macOS
57
- venv\Scripts\activate # On Windows
58
- ```
59
-
60
- 4. **Running Scripts**:
61
- - To run a script manually, navigate to the script’s directory and execute:
62
- ```
63
- python <script_name>.py
64
- ```
 
1
+ # Data Collection China
2
+
3
+ This repository is dedicated to collecting and processing articles from various financial and governmental sources in China. It automates the scraping of data, summarizes content, vectorizes articles, and uploads them to a database for further analysis.
4
+
5
+ ## Project Structure
6
+
7
+ - **/source**: Contains individual Python scripts for each data source. Each script has a `crawl` function to scrape articles from its respective source.
8
+ - **/controllers**:
9
+ - `summarizer.py`: Functions to summarize the content of articles.
10
+ - `utils.py`: Utility functions to support data crawling and database operations.
11
+ - `vectorizer.py`: Functions to vectorize articles and upload them to a vector database.
12
+ - `main.py`: The main script that runs crawl functions from all source files.
13
+ - `requirements.txt`: Lists all the dependencies necessary to run the scripts.
14
+
15
+ ## Data Sources
16
+
17
+ The repository includes data collection scripts for the following sources:
18
+ - **CSRC** (China Securities Regulatory Commission)
19
+ - **CBIRC** (China Banking and Insurance Regulatory Commission)
20
+ - **EastMoney**
21
+ - **Gov**
22
+ - **MOF** (Ministry of Finance)
23
+ - **MOFCOM** (Ministry of Commerce)
24
+ - **Stats** (National Bureau of Statistics of China)
25
+ - **SAFE** (State Administration of Foreign Exchange)
26
+ - **NDRC** (National Development and Reform Commission)
27
+
28
+ ## Installation
29
+
30
+ 1. Clone the repository:
31
+ ```bash
32
+ git clone https://github.com/yourusername/data-collection-china.git
33
+ 2. Navigate to cloned directory
34
+ ```cd data-collection-china
35
+ 3. Install all dependencies
36
+ ```pip install -r requirements.txt
37
+ 4. Run main file
38
+ ```python main.py