Spaces:

Oxbridge-Economics
/

Data-Collection-China

Running

App Files Files Community

Muhammad Abdur Rahman Saad commited on Aug 20, 2024

Commit

f024a59

1 Parent(s): 82a33ed

Update README.md

Browse files

Files changed (1) hide show

README.md +38 -64

README.md CHANGED Viewed

@@ -1,64 +1,38 @@
-# Security Report Collection
-The Security Report Collection repository contains a series of Python scripts designed to automate the collection, processing, and storage of financial and policy data from various Chinese government and financial websites. This data is vital for understanding changes in policy, financial news, and regulatory measures that could impact markets and investments.
-## Repository Structure
-- **Python Scripts**: Each script is tailored to specific sources and tasks, ranging from data scraping to sentiment analysis and database operations.
-- **GitHub Workflows**: Automated workflows to execute the Python scripts on a schedule or trigger specific events, excluding `utils.py` and `manual_upload.py`.
-- **requirements.txt**: Lists all Python dependencies required for the scripts to run.
-## Python Scripts Overview
-Each script targets different data sources or handles distinct aspects of data management:
-### Data Collection Scripts
-1. **CBIRC, Chinatax, CSRCV, Daily, Eastmoney, Glue, Gov, Manual_Upload, MOF, MOFCOM, PBC, SAFE, Stats**:
-   - These scripts scrape data from their respective websites, handling tasks such as extracting article URLs, downloading articles, translating content, and calculating sentiment scores.
-   - They use utilities provided by `utils.py` to interact with databases, manage files, and perform translations and sentiment analysis.
-### Utility Scripts
-- **utils.py**:
-  - A central utility script that supports database operations, file handling, content translation, and other shared functionalities across various scripts.
-  - It includes custom functions for working with AWS DynamoDB, handling PDFs, fetching URLs, and more.
-### Special Scripts
-- **manual_upload.py**:
-  - Allows manual data entry into the database, facilitating the addition of articles not captured through automated scripts.
-  - Provides a command-line interface for inputting article details and saving them to DynamoDB.
-## GitHub Workflows
-- Automated workflows are set up for all Python scripts except `utils.py` and `manual_upload.py`.
-- These workflows ensure that data collection and processing tasks are executed periodically or in response to specific triggers, maintaining an up-to-date database.
-## Requirements
-- The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
-   ```pip install -r requirements.txt```
-## Setup and Configuration
-1. **AWS Configuration**:
- - Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
- - Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
-2. **Database Setup**:
- - Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
-3. **Python Environment**:
- - It's recommended to set up a virtual environment for Python to manage dependencies:
-   ```
-   python -m venv venv
-   source venv/bin/activate  # On Unix/macOS
-   venv\Scripts\activate     # On Windows
-   ```
-4. **Running Scripts**:
- - To run a script manually, navigate to the script’s directory and execute:
-   ```
-   python <script_name>.py
-   ```

+# Data Collection China
+This repository is dedicated to collecting and processing articles from various financial and governmental sources in China. It automates the scraping of data, summarizes content, vectorizes articles, and uploads them to a database for further analysis.
+## Project Structure
+- **/source**: Contains individual Python scripts for each data source. Each script has a `crawl` function to scrape articles from its respective source.
+- **/controllers**:
+  - `summarizer.py`: Functions to summarize the content of articles.
+  - `utils.py`: Utility functions to support data crawling and database operations.
+  - `vectorizer.py`: Functions to vectorize articles and upload them to a vector database.
+- `main.py`: The main script that runs crawl functions from all source files.
+- `requirements.txt`: Lists all the dependencies necessary to run the scripts.
+## Data Sources
+The repository includes data collection scripts for the following sources:
+- **CSRC** (China Securities Regulatory Commission)
+- **CBIRC** (China Banking and Insurance Regulatory Commission)
+- **EastMoney**
+- **Gov**
+- **MOF** (Ministry of Finance)
+- **MOFCOM** (Ministry of Commerce)
+- **Stats** (National Bureau of Statistics of China)
+- **SAFE** (State Administration of Foreign Exchange)
+- **NDRC** (National Development and Reform Commission)
+## Installation
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/yourusername/data-collection-china.git
+2. Navigate to cloned directory
+   ```cd data-collection-china
+3. Install all dependencies
+   ```pip install -r requirements.txt
+4. Run main file
+   ```python main.py