File size: 1,880 Bytes
6bd813f
 
 
 
 
 
 
 
 
 
 
 
f024a59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e262e38
 
f024a59
e262e38
 
f024a59
e262e38
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
title: Data Collection China
emoji: 🔥
colorFrom: blue
colorTo: pink
sdk: docker
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Data Collection China

This repository is dedicated to collecting and processing articles from various financial and governmental sources in China. It automates the scraping of data, summarizes content, vectorizes articles, and uploads them to a database for further analysis.

## Project Structure

- **/source**: Contains individual Python scripts for each data source. Each script has a `crawl` function to scrape articles from its respective source.
- **/controllers**:
  - `summarizer.py`: Functions to summarize the content of articles.
  - `utils.py`: Utility functions to support data crawling and database operations.
  - `vectorizer.py`: Functions to vectorize articles and upload them to a vector database.
- `main.py`: The main script that runs crawl functions from all source files.
- `requirements.txt`: Lists all the dependencies necessary to run the scripts.

## Data Sources

The repository includes data collection scripts for the following sources:
- **CSRC** (China Securities Regulatory Commission)
- **CBIRC** (China Banking and Insurance Regulatory Commission)
- **EastMoney**
- **Gov**
- **MOF** (Ministry of Finance)
- **MOFCOM** (Ministry of Commerce)
- **Stats** (National Bureau of Statistics of China)
- **SAFE** (State Administration of Foreign Exchange)
- **NDRC** (National Development and Reform Commission)

## Installation

1. Clone the repository:
   ```bash
   git clone https://github.com/yourusername/data-collection-china.git
2. Navigate to cloned directory
   ```bash
   cd data-collection-china
3. Install all dependencies
   ```bash
   pip install -r requirements.txt
4. Run main file
   ```bash
   python main.py