Muhammad Abdur Rahman Saad
commited on
Commit
·
f024a59
1
Parent(s):
82a33ed
Update README.md
Browse files
README.md
CHANGED
@@ -1,64 +1,38 @@
|
|
1 |
-
#
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
##
|
6 |
-
|
7 |
-
-
|
8 |
-
-
|
9 |
-
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
- **
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
- The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
|
41 |
-
```pip install -r requirements.txt```
|
42 |
-
|
43 |
-
## Setup and Configuration
|
44 |
-
|
45 |
-
1. **AWS Configuration**:
|
46 |
-
- Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
|
47 |
-
- Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
|
48 |
-
|
49 |
-
2. **Database Setup**:
|
50 |
-
- Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
|
51 |
-
|
52 |
-
3. **Python Environment**:
|
53 |
-
- It's recommended to set up a virtual environment for Python to manage dependencies:
|
54 |
-
```
|
55 |
-
python -m venv venv
|
56 |
-
source venv/bin/activate # On Unix/macOS
|
57 |
-
venv\Scripts\activate # On Windows
|
58 |
-
```
|
59 |
-
|
60 |
-
4. **Running Scripts**:
|
61 |
-
- To run a script manually, navigate to the script’s directory and execute:
|
62 |
-
```
|
63 |
-
python <script_name>.py
|
64 |
-
```
|
|
|
1 |
+
# Data Collection China
|
2 |
+
|
3 |
+
This repository is dedicated to collecting and processing articles from various financial and governmental sources in China. It automates the scraping of data, summarizes content, vectorizes articles, and uploads them to a database for further analysis.
|
4 |
+
|
5 |
+
## Project Structure
|
6 |
+
|
7 |
+
- **/source**: Contains individual Python scripts for each data source. Each script has a `crawl` function to scrape articles from its respective source.
|
8 |
+
- **/controllers**:
|
9 |
+
- `summarizer.py`: Functions to summarize the content of articles.
|
10 |
+
- `utils.py`: Utility functions to support data crawling and database operations.
|
11 |
+
- `vectorizer.py`: Functions to vectorize articles and upload them to a vector database.
|
12 |
+
- `main.py`: The main script that runs crawl functions from all source files.
|
13 |
+
- `requirements.txt`: Lists all the dependencies necessary to run the scripts.
|
14 |
+
|
15 |
+
## Data Sources
|
16 |
+
|
17 |
+
The repository includes data collection scripts for the following sources:
|
18 |
+
- **CSRC** (China Securities Regulatory Commission)
|
19 |
+
- **CBIRC** (China Banking and Insurance Regulatory Commission)
|
20 |
+
- **EastMoney**
|
21 |
+
- **Gov**
|
22 |
+
- **MOF** (Ministry of Finance)
|
23 |
+
- **MOFCOM** (Ministry of Commerce)
|
24 |
+
- **Stats** (National Bureau of Statistics of China)
|
25 |
+
- **SAFE** (State Administration of Foreign Exchange)
|
26 |
+
- **NDRC** (National Development and Reform Commission)
|
27 |
+
|
28 |
+
## Installation
|
29 |
+
|
30 |
+
1. Clone the repository:
|
31 |
+
```bash
|
32 |
+
git clone https://github.com/yourusername/data-collection-china.git
|
33 |
+
2. Navigate to cloned directory
|
34 |
+
```cd data-collection-china
|
35 |
+
3. Install all dependencies
|
36 |
+
```pip install -r requirements.txt
|
37 |
+
4. Run main file
|
38 |
+
```python main.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|