Muhammad Abdur Rahman Saad commited on
Commit
ef71343
·
unverified ·
1 Parent(s): 9bc2641

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -65
README.md CHANGED
@@ -1,65 +1,64 @@
1
- # security-report-collection
2
-
3
- The `main.py` file is a Python script that performs sentiment analysis on articles. Here's a detailed breakdown of the code:
4
-
5
- - Importing Libraries:
6
- - The script starts by importing necessary libraries:
7
- ```python
8
- import os
9
- import glob
10
- import warnings
11
- from decimal import Decimal
12
- import pandas as pd
13
- import boto3
14
- ```
15
- These libraries include file manipulation (os), file searching (glob), warning suppression (warnings), decimal arithmetic (decimal), data processing (pandas), and AWS services (boto3).
16
-
17
- - Defining Functions:
18
- - The script defines three functions:
19
- 1. `get_db_connection()`: This function establishes a connection to an Amazon DynamoDB database using the AWS access key ID and secret access key.
20
- 2. `download_files_from_s3()`: This function downloads Parquet files from an S3 bucket named "oe-data-poc" and concatenates them into a Pandas DataFrame.
21
- 3. `gen_sentiment(record, table_name, label_dict)`: This function computes the sentiment score for each article in the input record. It uses the Hugging Face Transformers library to analyze the text and update the DynamoDB database with the sentiment score and label.
22
-
23
- - Main Program:
24
- - The script's main program:
25
- ```python
26
- if __name__ == "__main__":
27
- # Define a dictionary mapping sentiment labels to symbols
28
- label = {
29
- "positive": "+",
30
- "negative": "-",
31
- "neutral": "0",
32
- }
33
-
34
- # Download files from S3 and filter out null values
35
- df = download_files_from_s3()
36
- df = df[(~df['content'].isnull()) & (df['sentimentscore'].isnull())]
37
-
38
- # Iterate through each row in the DataFrame
39
- for _, row in df.iterrows():
40
- # Compute sentiment score and update DynamoDB database
41
- gen_sentiment(row, 'article', label)
42
- ```
43
-
44
- The main program defines a dictionary mapping sentiment labels to symbols (e.g., "+" for positive sentiment), downloads files from S3, filters out null values, and then iterates through each row in the DataFrame. For each row, it computes the sentiment score using the `gen_sentiment()` function and updates the DynamoDB database with the sentiment score and label.
45
-
46
- - That's It!
47
- - The script concludes by defining a dictionary mapping sentiment labels to symbols and performing sentiment analysis on articles stored in S3.
48
-
49
- The `glue.py` file contains a Python script that triggers a Parquet snapshot Glue job.
50
-
51
- Here's a breakdown of the code:
52
-
53
- 1. It starts by importing necessary modules:
54
- - `os`: for interacting with the operating system
55
- - `boto3`: a library for working with AWS services like Amazon S3, DynamoDB, and more
56
-
57
- 2. Then, it defines two environment variables:
58
- - `AWS_ACCESS_KEY_ID`
59
- - `AWS_SECRET_ACCESS_KEY`
60
-
61
- 3. The script then defines a function called `get_client_connection()` that returns a Boto3 client object for the Glue service. This client is used to interact with Amazon Glue.
62
-
63
- 4. Finally, it uses this client to start a job run named 'Ner Snapshot'. It prints out the response from Amazon Glue.
64
-
65
- In summary, `glue.py` sets up an environment and starts a Glue job to create a Parquet snapshot.
 
1
+ # Security Report Collection
2
+
3
+ The Security Report Collection repository contains a series of Python scripts designed to automate the collection, processing, and storage of financial and policy data from various Chinese government and financial websites. This data is vital for understanding changes in policy, financial news, and regulatory measures that could impact markets and investments.
4
+
5
+ ## Repository Structure
6
+
7
+ - **Python Scripts**: Each script is tailored to specific sources and tasks, ranging from data scraping to sentiment analysis and database operations.
8
+ - **GitHub Workflows**: Automated workflows to execute the Python scripts on a schedule or trigger specific events, excluding `utils.py` and `manual_upload.py`.
9
+ - **requirements.txt**: Lists all Python dependencies required for the scripts to run.
10
+
11
+ ## Python Scripts Overview
12
+
13
+ Each script targets different data sources or handles distinct aspects of data management:
14
+
15
+ ### Data Collection Scripts
16
+
17
+ 1. **CBIRC, Chinatax, CSRCV, Daily, Eastmoney, Glue, Gov, Manual_Upload, MOF, MOFCOM, PBC, SAFE, Stats**:
18
+ - These scripts scrape data from their respective websites, handling tasks such as extracting article URLs, downloading articles, translating content, and calculating sentiment scores.
19
+ - They use utilities provided by `utils.py` to interact with databases, manage files, and perform translations and sentiment analysis.
20
+
21
+ ### Utility Scripts
22
+
23
+ - **utils.py**:
24
+ - A central utility script that supports database operations, file handling, content translation, and other shared functionalities across various scripts.
25
+ - It includes custom functions for working with AWS DynamoDB, handling PDFs, fetching URLs, and more.
26
+
27
+ ### Special Scripts
28
+
29
+ - **manual_upload.py**:
30
+ - Allows manual data entry into the database, facilitating the addition of articles not captured through automated scripts.
31
+ - Provides a command-line interface for inputting article details and saving them to DynamoDB.
32
+
33
+ ## GitHub Workflows
34
+
35
+ - Automated workflows are set up for all Python scripts except `utils.py` and `manual_upload.py`.
36
+ - These workflows ensure that data collection and processing tasks are executed periodically or in response to specific triggers, maintaining an up-to-date database.
37
+
38
+ ## Requirements
39
+
40
+ - The `requirements.txt` file includes all necessary Python packages such as `boto3`, `lxml`, `requests`, `pandas`, `PyPDF2`, and others. Install these packages using:
41
+ ```pip install -r requirements.txt```
42
+
43
+ ## Setup and Configuration
44
+
45
+ 1. **AWS Configuration**:
46
+ - Ensure AWS credentials are correctly configured for access to services like S3 and DynamoDB.
47
+ - Set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
48
+
49
+ 2. **Database Setup**:
50
+ - Scripts assume specific DynamoDB table configurations. Set up the required tables in AWS DynamoDB as per the scripts' needs.
51
+
52
+ 3. **Python Environment**:
53
+ - It's recommended to set up a virtual environment for Python to manage dependencies:
54
+ ```
55
+ python -m venv venv
56
+ source venv/bin/activate # On Unix/macOS
57
+ venv\Scripts\activate # On Windows
58
+ ```
59
+
60
+ 4. **Running Scripts**:
61
+ - To run a script manually, navigate to the script’s directory and execute:
62
+ ```
63
+ python <script_name>.py
64
+ ```