Spaces:

Oxbridge-Economics
/

Data-Collection-China

Running

App Files Files Community

Muhammad Abdur Rahman Saad commited on Jul 4, 2024

Commit

ef71343

unverified ·

1 Parent(s): 9bc2641

Update README.md

Browse files

Files changed (1) hide show

README.md +64 -65

README.md CHANGED Viewed

@@ -1,65 +1,64 @@
-# security-report-collection
-The `main.py` file is a Python script that performs sentiment analysis on articles. Here's a detailed breakdown of the code:
-- Importing Libraries:
-  - The script starts by importing necessary libraries:
-        ```python
-        import os
-        import glob
-        import warnings
-        from decimal import Decimal
-        import pandas as pd
-        import boto3
-        ```
-        These libraries include file manipulation (os), file searching (glob), warning suppression (warnings), decimal arithmetic (decimal), data processing (pandas), and AWS services (boto3).
-- Defining Functions:
-  - The script defines three functions:
-        1. `get_db_connection()`: This function establishes a connection to an Amazon DynamoDB database using the AWS access key ID and secret access key.
-        2. `download_files_from_s3()`: This function downloads Parquet files from an S3 bucket named "oe-data-poc" and concatenates them into a Pandas DataFrame.
-        3. `gen_sentiment(record, table_name, label_dict)`: This function computes the sentiment score for each article in the input record. It uses the Hugging Face Transformers library to analyze the text and update the DynamoDB database with the sentiment score and label.
-- Main Program:
-  - The script's main program:
-        ```python
-        if __name__ == "__main__":
-                # Define a dictionary mapping sentiment labels to symbols
-                label = {
-                        "positive": "+",
-                        "negative": "-",
-                        "neutral": "0",
-                }
-                # Download files from S3 and filter out null values
-                df = download_files_from_s3()
-                df = df[(~df['content'].isnull()) & (df['sentimentscore'].isnull())]
-                # Iterate through each row in the DataFrame
-                for _, row in df.iterrows():
-                        # Compute sentiment score and update DynamoDB database
-                        gen_sentiment(row, 'article', label)
-        ```
-        The main program defines a dictionary mapping sentiment labels to symbols (e.g., "+" for positive sentiment), downloads files from S3, filters out null values, and then iterates through each row in the DataFrame. For each row, it computes the sentiment score using the `gen_sentiment()` function and updates the DynamoDB database with the sentiment score and label.
-- That's It!
-  - The script concludes by defining a dictionary mapping sentiment labels to symbols and performing sentiment analysis on articles stored in S3.
-The `glue.py` file contains a Python script that triggers a Parquet snapshot Glue job.
-Here's a breakdown of the code:
-1. It starts by importing necessary modules:
-     - `os`: for interacting with the operating system
-     - `boto3`: a library for working with AWS services like Amazon S3, DynamoDB, and more
-2. Then, it defines two environment variables:
-     - `AWS_ACCESS_KEY_ID`
-     - `AWS_SECRET_ACCESS_KEY`
-3. The script then defines a function called `get_client_connection()` that returns a Boto3 client object for the Glue service. This client is used to interact with Amazon Glue.
-4. Finally, it uses this client to start a job run named 'Ner Snapshot'. It prints out the response from Amazon Glue.
-In summary, `glue.py` sets up an environment and starts a Glue job to create a Parquet snapshot.

+# Security Report Collection
+The Security Report Collection repository contains a series of Python scripts designed to automate the collection, processing, and storage of financial and policy data from various Chinese government and financial websites. This data is vital for understanding changes in policy, financial news, and regulatory measures that could impact markets and investments.
+## Repository Structure
+- **Python Scripts**: Each script is tailored to specific sources and tasks, ranging from data scraping to sentiment analysis and database operations.
+- **GitHub Workflows**: Automated workflows to execute the Python scripts on a schedule or trigger specific events, excluding `utils.py` and `manual_upload.py`.
+- **requirements.txt**: Lists all Python dependencies required for the scripts to run.
+## Python Scripts Overview
+Each script targets different data sources or handles distinct aspects of data management:
+### Data Collection Scripts
+1. **CBIRC, Chinatax, CSRCV, Daily, Eastmoney, Glue, Gov, Manual_Upload, MOF, MOFCOM, PBC, SAFE, Stats**:
+   - These scripts scrape data from their respective websites, handling tasks such as extracting article URLs, downloading articles, translating content, and calculating sentiment scores.
+   - They use utilities provided by `utils.py` to interact with databases, manage files, and perform translations and sentiment analysis.
+### Utility Scripts
+- **utils.py**:
+  - A central utility script that supports database operations, file handling, content translation, and other shared functionalities across various scripts.
+  - It includes custom functions for working with AWS DynamoDB, handling PDFs, fetching URLs, and more.
+### Special Scripts
+- **manual_upload.py**:
+  - Allows manual data entry into the database, facilitating the addition of articles not captured through automated scripts.
+  - Provides a command-line interface for inputting article details and saving them to DynamoDB.
+## GitHub Workflows
+- Automated workflows are set up for all Python scripts except `utils.py` and `manual_upload.py`.
+- These workflows ensure that data collection and processing tasks are executed periodically or in response to specific triggers, maintaining an up-to-date database.
+## Requirements
+- The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
+   ```pip install -r requirements.txt```
+## Setup and Configuration
+1. **AWS Configuration**:
+ - Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
+ - Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
+2. **Database Setup**:
+ - Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
+3. **Python Environment**:
+ - It's recommended to set up a virtual environment for Python to manage dependencies:
+   ```
+   python -m venv venv
+   source venv/bin/activate  # On Unix/macOS
+   venv\Scripts\activate     # On Windows
+   ```
+4. **Running Scripts**:
+ - To run a script manually, navigate to the script’s directory and execute:
+   ```
+   python <script_name>.py
+   ```