Upload 5 files
Browse files- README.md +168 -10
- api.py +203 -0
- llm_utils.py +241 -0
- requirements.txt +34 -0
- utils.py +146 -0
README.md
CHANGED
@@ -1,10 +1,168 @@
|
|
1 |
-
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
-
sdk:
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: News Summarizer
|
3 |
+
emoji: 👁
|
4 |
+
colorFrom: gray
|
5 |
+
colorTo: green
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 5.22.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
short_description: An app for summarizing news articles on orgs.
|
11 |
+
---
|
12 |
+
|
13 |
+
# News Summarization and Text-to-Speech Application
|
14 |
+
|
15 |
+
## Overview
|
16 |
+
This project is a web-based application that extracts key details from multiple news articles related to a given company, performs sentiment analysis, conducts a comparative analysis, and generates a text-to-speech (TTS) output in Hindi.
|
17 |
+
|
18 |
+
## Features
|
19 |
+
- **News Extraction**: Scrapes and displays at least 10 news articles from The New York Times and BBC.
|
20 |
+
- **Sentiment Analysis**: Categorizes articles into Positive, Negative, or Neutral sentiments.
|
21 |
+
- **Comparative Analysis**: Groups articles with most semantic similarity. Then compares the groups to derive insights on how a company's news coverage varies.
|
22 |
+
- **Text-to-Speech (TTS)**: Converts the summarized sentiment report into Hindi speech.
|
23 |
+
- **User Interface**: Provides a simple web-based interface using Gradio.
|
24 |
+
- **API Integration**: Implements FastAPI for backend communication.
|
25 |
+
- **Deployment**: Deployable on Hugging Face Spaces.
|
26 |
+
|
27 |
+
## Tech Stack
|
28 |
+
- **Frontend**: Gradio
|
29 |
+
- **Backend**: FastAPI
|
30 |
+
- **Scraping**: BeautifulSoup
|
31 |
+
- **NLP**: OpenAI GPT models, LangChain, Sentence Transformers
|
32 |
+
- **Sentiment Analysis**: Pre-trained Transformer model
|
33 |
+
- **Text-to-Speech**: Google TTS (gTTS)
|
34 |
+
- **Deployment**: Uvicorn, Hugging Face Spaces
|
35 |
+
|
36 |
+
---
|
37 |
+
|
38 |
+
## Installation and Setup
|
39 |
+
|
40 |
+
### 1. Clone the Repository
|
41 |
+
```bash
|
42 |
+
git clone https://github.com/Senzen18/News-Summarizer.git
|
43 |
+
cd News-Summarizer
|
44 |
+
```
|
45 |
+
|
46 |
+
### 2. Install Dependencies
|
47 |
+
Ensure you have Python 3.8+ installed. Then, run:
|
48 |
+
```bash
|
49 |
+
pip install -r requirements.txt
|
50 |
+
```
|
51 |
+
|
52 |
+
### 3. To run Fast API endpoints
|
53 |
+
Start the FastAPI backend:
|
54 |
+
```bash
|
55 |
+
uvicorn api:app --host 127.0.0.1 --port 8000 --reload
|
56 |
+
```
|
57 |
+
|
58 |
+
### 4. To run the both Gradio and Fast API
|
59 |
+
Start the FastAPI backend:
|
60 |
+
```bash
|
61 |
+
gradio app.py
|
62 |
+
```
|
63 |
+
|
64 |
+
### 5. Access the Application
|
65 |
+
Once started, access the Gradio UI at:
|
66 |
+
```
|
67 |
+
http://127.0.0.1:7860
|
68 |
+
```
|
69 |
+
|
70 |
+
---
|
71 |
+
|
72 |
+
## API Endpoints
|
73 |
+
|
74 |
+
### 1. Fetch News
|
75 |
+
**GET** `/news/{company_name}`
|
76 |
+
- Fetches the latest articles related to a company.
|
77 |
+
- **Example:** `/news/Tesla`
|
78 |
+
|
79 |
+
### 2. Analyze News Sentiment
|
80 |
+
**GET** `/analyze-news`
|
81 |
+
- Performs sentiment analysis on the extracted articles.
|
82 |
+
|
83 |
+
### 3. Compare News Articles
|
84 |
+
**POST** `/compare-news`
|
85 |
+
- Performs comparative analysis.
|
86 |
+
- **Request Body:**
|
87 |
+
```json
|
88 |
+
{
|
89 |
+
"api_key": "your-openai-api-key",
|
90 |
+
"model_name": "gpt-4o-mini",
|
91 |
+
"company_name": "Tesla"
|
92 |
+
}
|
93 |
+
```
|
94 |
+
|
95 |
+
### 4. Generate Hindi Summary
|
96 |
+
**GET** `/hindi-summary`
|
97 |
+
- Returns the summarized analysis in Hindi and stores the speech file.
|
98 |
+
|
99 |
+
---
|
100 |
+
|
101 |
+
## File Structure
|
102 |
+
```
|
103 |
+
├── api.py # FastAPI backend for news extraction, sentiment analysis, and comparison
|
104 |
+
├── app.py # Gradio frontend to interact with users
|
105 |
+
├── llm_utils.py # Handles OpenAI API calls for topic extraction and comparative analysis
|
106 |
+
├── utils.py # Utility functions for web scraping, sentiment analysis, and TTS
|
107 |
+
├── requirements.txt # Dependencies
|
108 |
+
└── README.md # Project documentation
|
109 |
+
```
|
110 |
+
|
111 |
+
---
|
112 |
+
|
113 |
+
## Assumptions and Limitations
|
114 |
+
- Only extracts articles from The New York Times and BBC.
|
115 |
+
- Requires a valid OpenAI API key for sentiment analysis and comparison.
|
116 |
+
- Hindi speech output uses gTTS, which requires an internet connection.
|
117 |
+
|
118 |
+
---
|
119 |
+
|
120 |
+
## Deployment
|
121 |
+
This project can be deployed on Hugging Face Spaces. To deploy:
|
122 |
+
1. Push your repository to GitHub.
|
123 |
+
2. Follow [Hugging Face Spaces documentation](https://huggingface.co/docs/spaces) for deployment.
|
124 |
+
|
125 |
+
---
|
126 |
+
|
127 |
+
## Example Output
|
128 |
+
```json
|
129 |
+
{
|
130 |
+
"Company": "Tesla",
|
131 |
+
"Articles": [
|
132 |
+
{
|
133 |
+
"Title": "Tesla's New Model Breaks Sales Records",
|
134 |
+
"Summary": "Tesla's latest EV sees record sales in Q3...",
|
135 |
+
"Sentiment": "Positive",
|
136 |
+
"Topics": ["Electric Vehicles", "Stock Market", "Innovation"]
|
137 |
+
}
|
138 |
+
],
|
139 |
+
"Comparative Sentiment Score": {
|
140 |
+
"Sentiment Distribution": {"Positive": 1, "Negative": 1, "Neutral": 0},
|
141 |
+
"Coverage Differences": [{
|
142 |
+
"Comparison": "Article 1 highlights Tesla's strong sales, while Article 2 discusses regulatory issues.",
|
143 |
+
"Impact": "Investors may react positively to growth news but stay cautious due to regulatory scrutiny."
|
144 |
+
}],
|
145 |
+
"Topic Overlap": {
|
146 |
+
"Common Topics": ["Electric Vehicles"],
|
147 |
+
"Unique Topics in Article 1": ["Stock Market", "Innovation"],
|
148 |
+
"Unique Topics in Article 2": ["Regulations", "Autonomous Vehicles"]
|
149 |
+
}
|
150 |
+
},
|
151 |
+
"Final Sentiment Analysis": "Tesla’s latest news coverage is mostly positive. Potential stock growth expected.",
|
152 |
+
"Audio": "[Play Hindi Speech]"
|
153 |
+
}
|
154 |
+
```
|
155 |
+
|
156 |
+
---
|
157 |
+
|
158 |
+
## Contributing
|
159 |
+
Feel free to contribute by:
|
160 |
+
- Adding more news sources
|
161 |
+
- Improving the sentiment model
|
162 |
+
- Enhancing the UI
|
163 |
+
|
164 |
+
---
|
165 |
+
|
166 |
+
## Contact
|
167 |
+
For queries, reach out at [[email protected]].
|
168 |
+
|
api.py
ADDED
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from fastapi import FastAPI, HTTPException
|
2 |
+
from utils import bs4_extractor, SentimentAnalyzer, SemanticGrouping, save_audio
|
3 |
+
from llm_utils import *
|
4 |
+
import openai
|
5 |
+
import asyncio
|
6 |
+
|
7 |
+
app = FastAPI()
|
8 |
+
|
9 |
+
# Initialize sentiment analyzer and semantic grouping
|
10 |
+
sentiment_analyzer = SentimentAnalyzer()
|
11 |
+
semantic_grouping = SemanticGrouping()
|
12 |
+
|
13 |
+
|
14 |
+
class CompareNewsRequest(BaseModel):
|
15 |
+
api_key: str
|
16 |
+
model_name: str = "gpt-4o-mini"
|
17 |
+
company_name: str
|
18 |
+
|
19 |
+
|
20 |
+
def check_api_key(api_key: str):
|
21 |
+
if api_key == None:
|
22 |
+
return False
|
23 |
+
elif api_key != None:
|
24 |
+
client = openai.OpenAI(api_key=api_key)
|
25 |
+
try:
|
26 |
+
client.models.list()
|
27 |
+
except openai.AuthenticationError:
|
28 |
+
return False
|
29 |
+
else:
|
30 |
+
return True
|
31 |
+
|
32 |
+
|
33 |
+
def check_model_name(model_name: str, api_key: str):
|
34 |
+
openai.api_key = api_key
|
35 |
+
model_list = [model.id for model in openai.models.list()]
|
36 |
+
return True if model_name in model_list else False
|
37 |
+
|
38 |
+
|
39 |
+
# Helper function to get articles and article sentiments
|
40 |
+
def get_articles(company_name: str):
|
41 |
+
if not company_name:
|
42 |
+
raise HTTPException(status_code=500, detail="The company name is required.")
|
43 |
+
|
44 |
+
news_articles = bs4_extractor(company_name)
|
45 |
+
|
46 |
+
if not news_articles:
|
47 |
+
raise HTTPException(status_code=500, detail="No news found")
|
48 |
+
|
49 |
+
articles_data = [
|
50 |
+
{"title": article["title"], "summary": article["summary"]}
|
51 |
+
for article in news_articles
|
52 |
+
]
|
53 |
+
|
54 |
+
analyzed_articles = sentiment_analyzer.classify_sentiments(articles_data)
|
55 |
+
|
56 |
+
return news_articles, analyzed_articles
|
57 |
+
|
58 |
+
|
59 |
+
def get_formatted_output(
|
60 |
+
company_name,
|
61 |
+
analyzed_articles,
|
62 |
+
topic_extraction_results,
|
63 |
+
topic_overlap_results,
|
64 |
+
comparative_analysis_results,
|
65 |
+
final_analysis,
|
66 |
+
):
|
67 |
+
articles = analyzed_articles
|
68 |
+
sentiment_distribution = {"positive": 0, "negative": 0, "neutral": 0}
|
69 |
+
for i in range(len(articles)):
|
70 |
+
articles[i]["topics"] = topic_extraction_results[i]
|
71 |
+
|
72 |
+
sentiment = articles[i]["sentiment"]
|
73 |
+
sentiment_distribution[sentiment] += 1
|
74 |
+
comparative_sentiment_score = {
|
75 |
+
"Sentiment Distribution": sentiment_distribution,
|
76 |
+
"Coverage Differences": comparative_analysis_results,
|
77 |
+
"Topic Overlap": topic_overlap_results,
|
78 |
+
}
|
79 |
+
final_output = {
|
80 |
+
"Company": company_name,
|
81 |
+
"Articles": articles,
|
82 |
+
"Comparative Sentiment Score": comparative_sentiment_score,
|
83 |
+
"Final Sentiment Analysis": final_analysis,
|
84 |
+
}
|
85 |
+
|
86 |
+
return final_output
|
87 |
+
|
88 |
+
|
89 |
+
@app.get("/news/{company_name}")
|
90 |
+
def get_news(company_name: str):
|
91 |
+
"""
|
92 |
+
API endpoint to get news for a company.
|
93 |
+
Fetches news articles from NYTimes and BBC.
|
94 |
+
"""
|
95 |
+
try:
|
96 |
+
news_articles = bs4_extractor(company_name)
|
97 |
+
app.state.company_name = company_name
|
98 |
+
if not news_articles:
|
99 |
+
raise HTTPException(status_code=404, detail="No news found")
|
100 |
+
|
101 |
+
app.state.news_articles = news_articles
|
102 |
+
return {"company": company_name, "articles": news_articles}
|
103 |
+
except Exception as e:
|
104 |
+
raise HTTPException(status_code=500, detail=str(e))
|
105 |
+
|
106 |
+
|
107 |
+
@app.get("/analyze-news")
|
108 |
+
def analyze_news():
|
109 |
+
"""
|
110 |
+
API endpoint to analyze news articles.
|
111 |
+
Performs sentiment analysis on a list of articles.
|
112 |
+
"""
|
113 |
+
if not app.state.news_articles:
|
114 |
+
HTTPException(
|
115 |
+
status_code=500, detail="Type in the name before the news analysis."
|
116 |
+
)
|
117 |
+
try:
|
118 |
+
articles_data = [
|
119 |
+
{"title": article["title"], "summary": article["summary"]}
|
120 |
+
for article in app.state.news_articles
|
121 |
+
]
|
122 |
+
analyzed_articles = sentiment_analyzer.classify_sentiments(articles_data)
|
123 |
+
app.state.articles_with_sentiments = analyzed_articles
|
124 |
+
return {"analyzed_articles": analyzed_articles}
|
125 |
+
except Exception as e:
|
126 |
+
raise HTTPException(status_code=500, detail=str(e))
|
127 |
+
|
128 |
+
|
129 |
+
@app.post("/compare-news")
|
130 |
+
async def compare_news(request_info: CompareNewsRequest):
|
131 |
+
"""
|
132 |
+
API endpoint to perform comparative analysis.
|
133 |
+
Uses semantic similarity to find the most related articles.
|
134 |
+
"""
|
135 |
+
api_key = request_info.api_key
|
136 |
+
company_name = request_info.company_name
|
137 |
+
model_name = request_info.model_name
|
138 |
+
if not check_api_key(api_key):
|
139 |
+
HTTPException(
|
140 |
+
status_code=500,
|
141 |
+
detail="The entered API key does not seem to be right. Please enter a valid API key",
|
142 |
+
)
|
143 |
+
|
144 |
+
if not check_model_name(model_name, api_key):
|
145 |
+
HTTPException(
|
146 |
+
status_code=500,
|
147 |
+
detail="The model you specified does not exist.",
|
148 |
+
)
|
149 |
+
news_articles, analyzed_articles = get_articles(company_name)
|
150 |
+
try:
|
151 |
+
articles_text = [
|
152 |
+
f"{article['title']}. {article['summary']}" for article in news_articles
|
153 |
+
]
|
154 |
+
|
155 |
+
if len(articles_text) < 2:
|
156 |
+
raise HTTPException(
|
157 |
+
status_code=400, detail="At least two articles required for comparison."
|
158 |
+
)
|
159 |
+
top_similar_articles = semantic_grouping.find_top_k_similar_articles(
|
160 |
+
articles_text, k=5
|
161 |
+
)
|
162 |
+
|
163 |
+
llm_chatbot = ChatBot(
|
164 |
+
api_key,
|
165 |
+
model_name,
|
166 |
+
analyzed_articles,
|
167 |
+
company_name,
|
168 |
+
)
|
169 |
+
llm_result = await llm_chatbot.main(top_similar_articles)
|
170 |
+
|
171 |
+
(
|
172 |
+
topic_extraction_results,
|
173 |
+
topic_overlap_results,
|
174 |
+
comparative_analysis_results,
|
175 |
+
) = llm_result.values()
|
176 |
+
final_analysis_eng, final_analysis_hi = llm_chatbot.final_analysis(
|
177 |
+
comparative_analysis_results
|
178 |
+
)
|
179 |
+
|
180 |
+
final_output = get_formatted_output(
|
181 |
+
company_name,
|
182 |
+
analyzed_articles,
|
183 |
+
topic_extraction_results,
|
184 |
+
topic_overlap_results,
|
185 |
+
comparative_analysis_results,
|
186 |
+
final_analysis_eng,
|
187 |
+
)
|
188 |
+
|
189 |
+
app.state.hindi_summary = final_analysis_hi
|
190 |
+
return final_output
|
191 |
+
|
192 |
+
except Exception as e:
|
193 |
+
raise HTTPException(status_code=500, detail=str(e))
|
194 |
+
|
195 |
+
|
196 |
+
@app.get("/hindi-summary")
|
197 |
+
def get_hindi_summary():
|
198 |
+
if not app.state.hindi_summary:
|
199 |
+
raise HTTPException(
|
200 |
+
status_code=500, detail="Generate the Comparative Analysis first."
|
201 |
+
)
|
202 |
+
save_audio(app.state.hindi_summary)
|
203 |
+
return {"hindi_summary": app.state.hindi_summary}
|
llm_utils.py
ADDED
@@ -0,0 +1,241 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import asyncio
|
2 |
+
from pydantic import BaseModel, Field
|
3 |
+
from langchain_core.prompts import ChatPromptTemplate
|
4 |
+
from langchain_openai import ChatOpenAI
|
5 |
+
from typing import List
|
6 |
+
import asyncio
|
7 |
+
|
8 |
+
|
9 |
+
class TopicExtraction(BaseModel):
|
10 |
+
"""Extracts topics from news article."""
|
11 |
+
|
12 |
+
topics: List[str] = Field(
|
13 |
+
..., description="A list of topics covered in the news article."
|
14 |
+
)
|
15 |
+
|
16 |
+
|
17 |
+
class TopicOverlap(BaseModel):
|
18 |
+
"""Extracts topics from news article."""
|
19 |
+
|
20 |
+
common_topics: List[str] = Field(
|
21 |
+
..., description="A list of topics covered in the both article."
|
22 |
+
)
|
23 |
+
unique_topics_1: List[str] = Field(
|
24 |
+
..., description="A list of topics unique to article 1."
|
25 |
+
)
|
26 |
+
unique_topics_2: List[str] = Field(
|
27 |
+
..., description="A list of topics unique to article 2."
|
28 |
+
)
|
29 |
+
|
30 |
+
|
31 |
+
class ComparativeAnalyzer(BaseModel):
|
32 |
+
"""Compares given pair of articles and extracts comparison and impact."""
|
33 |
+
|
34 |
+
comparison: str = Field(
|
35 |
+
..., description="A sentence of comparative insights between articles."
|
36 |
+
)
|
37 |
+
impact: str = Field(
|
38 |
+
..., description="A sentence of potential impacts from the compared articles."
|
39 |
+
)
|
40 |
+
|
41 |
+
|
42 |
+
class FinalAnalysis(BaseModel):
|
43 |
+
"""Summarizes the Comparative analysis."""
|
44 |
+
|
45 |
+
english: str = Field(..., description="Summarizes the analysis in english.")
|
46 |
+
hindi: str = Field(..., description="Summarizes the analysis in hindi.")
|
47 |
+
|
48 |
+
|
49 |
+
class ChatBot:
|
50 |
+
def __init__(
|
51 |
+
self, api_key: str, model: str, articles_dict: list, company_name: str
|
52 |
+
):
|
53 |
+
self.llm = ChatOpenAI(model=model, api_key=api_key, temperature=0.1)
|
54 |
+
articles_list = []
|
55 |
+
for i, article in enumerate(articles_dict):
|
56 |
+
title = article["title"]
|
57 |
+
summary = article["summary"]
|
58 |
+
sentiment = article["sentiment"]
|
59 |
+
articles_list.append(
|
60 |
+
f"title {title} \n summary {summary} \n sentiment {sentiment} \n\n"
|
61 |
+
)
|
62 |
+
|
63 |
+
self.articles = articles_list
|
64 |
+
self.company_name = company_name
|
65 |
+
|
66 |
+
async def topic_extraction(self, article: str):
|
67 |
+
system_message = """You are an expert in text analysis and topic extraction. Your task is to identify the main topics from a short news articleS.
|
68 |
+
|
69 |
+
### Instructions:
|
70 |
+
- Extract **2 to 3 key topics** that summarize the core ideas of the article.
|
71 |
+
- Use **concise, generalizable topics** (e.g., "Electric Vehicles" instead of "Tesla Model X").
|
72 |
+
- Avoid generic words like "news" or "report".
|
73 |
+
- If relevant, include categories such as **Technology, Finance, Politics, Business, or Science**.
|
74 |
+
- Return the topics in **JSON format** as a list of strings.
|
75 |
+
- Seperate the topics for each articles by line break.
|
76 |
+
- Do not include just he company name {company_name}
|
77 |
+
|
78 |
+
|
79 |
+
### Example:
|
80 |
+
|
81 |
+
#### Input Article:
|
82 |
+
"Tesla has launched a new AI-powered self-driving feature that improves vehicle autonomy and enhances road safety. The update is expected to impact the automotive industry's shift toward electric and smart vehicles."
|
83 |
+
|
84 |
+
#### Output:
|
85 |
+
["Artificial Intelligence", "Self-Driving Cars", "Automotive Industry", "Electric Vehicles", "Road Safety"]
|
86 |
+
:
|
87 |
+
|
88 |
+
"""
|
89 |
+
|
90 |
+
prompt = ChatPromptTemplate.from_messages(
|
91 |
+
[("system", system_message), ("human", "Input Article: \n {articles}")]
|
92 |
+
)
|
93 |
+
structured_llm = self.llm.with_structured_output(TopicExtraction)
|
94 |
+
chain = prompt | structured_llm
|
95 |
+
response = await chain.ainvoke(
|
96 |
+
({"company_name": self.company_name, "articles": article})
|
97 |
+
)
|
98 |
+
return response.topics
|
99 |
+
|
100 |
+
async def topic_overlap(self, id1: int, id2: int):
|
101 |
+
article_1, article_2 = self.articles[id1], self.articles[id2]
|
102 |
+
|
103 |
+
system_message = """You are an advanced AI specializing in text analysis and topic extraction. Your task is to compare two news articles and extract key topics.
|
104 |
+
|
105 |
+
### **Instructions:**
|
106 |
+
- Identify **common topics** present in **both articles**.
|
107 |
+
- Identify **topics unique to each article**.
|
108 |
+
- Use **generalized topics** (e.g., "Electric Vehicles" instead of "Tesla Model X").
|
109 |
+
- Ensure topics are **concise and meaningful**.
|
110 |
+
---
|
111 |
+
### **Example:**
|
112 |
+
#### **Article 1:**
|
113 |
+
"Tesla has launched a new AI-powered self-driving feature that enhances vehicle autonomy and road safety. The update is expected to impact the automotive industry."
|
114 |
+
|
115 |
+
#### **Article 2:**
|
116 |
+
"Regulators are reviewing Tesla’s self-driving technology due to safety concerns. Experts debate whether AI-based vehicle autonomy meets current legal standards."
|
117 |
+
|
118 |
+
#### **Expected Output:**
|
119 |
+
"common_topics": ["Self-Driving Cars", "Artificial Intelligence", "Safety"],
|
120 |
+
"unique_topics_1": ["Automotive Industry", "Automotive Industry"],
|
121 |
+
"unique_topics_2": ["Regulations", "Legal Standards"]
|
122 |
+
"""
|
123 |
+
|
124 |
+
user_message = """
|
125 |
+
Here are the news articles on the company.
|
126 |
+
Article 1:
|
127 |
+
{article_1}
|
128 |
+
Article 2:
|
129 |
+
{article_2}
|
130 |
+
"""
|
131 |
+
|
132 |
+
prompt = ChatPromptTemplate.from_messages(
|
133 |
+
[
|
134 |
+
("system", system_message),
|
135 |
+
("human", user_message),
|
136 |
+
]
|
137 |
+
)
|
138 |
+
structured_llm = self.llm.with_structured_output(TopicOverlap)
|
139 |
+
chain = prompt | structured_llm
|
140 |
+
response = await chain.ainvoke({"article_1": article_1, "article_2": article_2})
|
141 |
+
return {
|
142 |
+
"Common Topics ": response.common_topics,
|
143 |
+
f"Unique Topics in Article {id1}": response.unique_topics_1,
|
144 |
+
f"Unique Topics in Article {id2}": response.unique_topics_2,
|
145 |
+
}
|
146 |
+
|
147 |
+
async def comparative_analysis(self, id1: int, id2: int):
|
148 |
+
article_1, article_2 = self.articles[id1], self.articles[id2]
|
149 |
+
|
150 |
+
system_message = """
|
151 |
+
You are an AI assistant that performs Comparative Analysis on given articles.
|
152 |
+
Analyze the following articles and provide a comparative analysis. Highlight their key themes, sentiment, and impact.
|
153 |
+
Compare how each article portrays the companies and discuss potential implications for investors and the industry.
|
154 |
+
Structure your response with 'Comparison' and 'Impact' sections.
|
155 |
+
The length of each comparison and impact should be less than 20 words
|
156 |
+
Mention the articles ids.
|
157 |
+
|
158 |
+
### **Example:**
|
159 |
+
#### **Article 1:**
|
160 |
+
""Tesla's New Model Breaks Sales Records.Tesla's latest EV sees record sales in Q3..."
|
161 |
+
|
162 |
+
#### **Article 2:**
|
163 |
+
"Regulatory Scrutiny on Tesla's Self-Driving Tech. Regulators have raised concerns over Tesla’s self-driving software..."
|
164 |
+
|
165 |
+
#### **Expected Output:**
|
166 |
+
"Comparison": "Article 1 highlights Tesla's strong sales, while Article 2 discusses regulatory issues.",
|
167 |
+
"Impact": "The first article boosts confidence in Tesla's market growth, while the second raises concerns about future regulatory hurdles."
|
168 |
+
"""
|
169 |
+
|
170 |
+
user_message = """
|
171 |
+
Here are the news articles on the company.
|
172 |
+
Article {id1}:
|
173 |
+
{article_1}
|
174 |
+
Article {id2}:
|
175 |
+
{article_2}
|
176 |
+
"""
|
177 |
+
|
178 |
+
prompt = ChatPromptTemplate.from_messages(
|
179 |
+
[
|
180 |
+
("system", system_message),
|
181 |
+
("human", user_message),
|
182 |
+
]
|
183 |
+
)
|
184 |
+
structured_llm = self.llm.with_structured_output(ComparativeAnalyzer)
|
185 |
+
chain = prompt | structured_llm
|
186 |
+
response = await chain.ainvoke(
|
187 |
+
{"article_1": article_1, "article_2": article_2, "id1": id1, "id2": id2}
|
188 |
+
)
|
189 |
+
return {
|
190 |
+
f"comparison of {id1}, {id2}": response.comparison,
|
191 |
+
"impact": response.impact,
|
192 |
+
}
|
193 |
+
|
194 |
+
async def main(self, similar_pairs: list):
|
195 |
+
"""Runs all OpenAI API calls in parallel."""
|
196 |
+
|
197 |
+
topic_extraction_tasks = [
|
198 |
+
self.topic_extraction(article) for article in self.articles
|
199 |
+
]
|
200 |
+
|
201 |
+
topic_overlap_tasks = [
|
202 |
+
self.topic_overlap(id1, id2) for id1, id2, _ in similar_pairs
|
203 |
+
]
|
204 |
+
|
205 |
+
comparative_analysis_tasks = [
|
206 |
+
self.comparative_analysis(id1, id2) for id1, id2, _ in similar_pairs
|
207 |
+
]
|
208 |
+
|
209 |
+
(
|
210 |
+
topic_extraction_results,
|
211 |
+
topic_overlap_results,
|
212 |
+
comparative_analysis_results,
|
213 |
+
) = await asyncio.gather(
|
214 |
+
asyncio.gather(*topic_extraction_tasks),
|
215 |
+
asyncio.gather(*topic_overlap_tasks),
|
216 |
+
asyncio.gather(*comparative_analysis_tasks),
|
217 |
+
)
|
218 |
+
return {
|
219 |
+
"topic_extraction_results": topic_extraction_results,
|
220 |
+
"topic_overlap_results": topic_overlap_results,
|
221 |
+
"comparative_analysis_results": comparative_analysis_results,
|
222 |
+
}
|
223 |
+
|
224 |
+
def final_analysis(self, comparative_analysis_articles):
|
225 |
+
comparative_results = "Comparative Analysis: \n"
|
226 |
+
for comparisons in comparative_analysis_articles:
|
227 |
+
comparison, impact = comparisons.values()
|
228 |
+
comparative_results += f"comparison: {comparison} \n impact: {impact} \n\n"
|
229 |
+
|
230 |
+
template = """
|
231 |
+
You are an AI assistant that reads a Comparative Analysis of Articles.
|
232 |
+
And summarizes them to produce the final sentiment analysis.
|
233 |
+
Make the final sentiment analysis less than 20 words
|
234 |
+
Comprative Analysis:
|
235 |
+
{comparative_results}
|
236 |
+
"""
|
237 |
+
prompt = ChatPromptTemplate.from_template(template)
|
238 |
+
structured_llm = self.llm.with_structured_output(FinalAnalysis)
|
239 |
+
chain = prompt | structured_llm
|
240 |
+
response = chain.invoke({"comparative_results": comparative_results})
|
241 |
+
return response.english, response.hindi
|
requirements.txt
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
beautifulsoup4==4.13.3
|
3 |
+
gradio==5.22.0
|
4 |
+
gradio_client==1.8.0
|
5 |
+
langchain==0.3.20
|
6 |
+
langchain-community==0.3.19
|
7 |
+
langchain-core==0.3.45
|
8 |
+
langchain-openai==0.3.9
|
9 |
+
langchain-text-splitters==0.3.6
|
10 |
+
multiprocess==0.70.16
|
11 |
+
numpy==1.26.4
|
12 |
+
openai==1.66.3
|
13 |
+
pandas==2.2.3
|
14 |
+
pydantic==2.10.6
|
15 |
+
pydantic_core==2.27.2
|
16 |
+
pydantic-settings==2.8.1
|
17 |
+
requests==2.32.3
|
18 |
+
scikit-learn==1.6.1
|
19 |
+
scipy==1.15.2
|
20 |
+
sentence-transformers==3.4.1
|
21 |
+
tokenizers==0.20.3
|
22 |
+
torch==2.6.0+cu124
|
23 |
+
torchaudio==2.6.0
|
24 |
+
transformers==4.46.1
|
25 |
+
uvicorn==0.34.0
|
26 |
+
|
27 |
+
|
28 |
+
|
29 |
+
|
30 |
+
|
31 |
+
|
32 |
+
|
33 |
+
|
34 |
+
|
utils.py
ADDED
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import requests
|
2 |
+
from bs4 import BeautifulSoup
|
3 |
+
from transformers import pipeline
|
4 |
+
import pandas as pd
|
5 |
+
from sentence_transformers import SentenceTransformer, util
|
6 |
+
|
7 |
+
# from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
8 |
+
import itertools
|
9 |
+
import re
|
10 |
+
import heapq
|
11 |
+
import torch
|
12 |
+
from gtts import gTTS
|
13 |
+
|
14 |
+
|
15 |
+
def filter_articles(articles_list, company_name):
|
16 |
+
"""
|
17 |
+
Filters articles that only contain the company name.
|
18 |
+
|
19 |
+
Args:
|
20 |
+
articles_list (list): List of dictionaries with 'title' and 'summary'.
|
21 |
+
company_name (str): The company name to filter articles by.
|
22 |
+
|
23 |
+
Returns:
|
24 |
+
list: A filtered list of articles that contain the company name.
|
25 |
+
"""
|
26 |
+
articles_list_filtered = []
|
27 |
+
|
28 |
+
for article in articles_list:
|
29 |
+
full_text = (article["title"] + " " + article["summary"]).lower()
|
30 |
+
|
31 |
+
if re.search(company_name.lower(), full_text):
|
32 |
+
articles_list_filtered.append(article)
|
33 |
+
|
34 |
+
return articles_list_filtered
|
35 |
+
|
36 |
+
|
37 |
+
def bs4_extractor(company_name: str):
|
38 |
+
"""
|
39 |
+
Extracts news articles from The New York Times and BBC for a given company.
|
40 |
+
|
41 |
+
Args:
|
42 |
+
company_name (str): The name of the company to search for.
|
43 |
+
|
44 |
+
Returns:
|
45 |
+
list: A list of dictionaries containing article titles and summaries.
|
46 |
+
"""
|
47 |
+
articles_list = []
|
48 |
+
|
49 |
+
# Fetch and parse NYTimes articles
|
50 |
+
nytimes_url = f"https://www.nytimes.com/search?query={company_name}"
|
51 |
+
nytimes_page = requests.get(nytimes_url).text
|
52 |
+
nytimes_soup = BeautifulSoup(nytimes_page, "html.parser")
|
53 |
+
|
54 |
+
for article in nytimes_soup.find_all("li", {"data-testid": "search-bodega-result"}):
|
55 |
+
try:
|
56 |
+
title = article.find("h4").text.strip()
|
57 |
+
summary = article.find("p", {"class": "css-e5tzus"}).text.strip()
|
58 |
+
|
59 |
+
if not title or not summary:
|
60 |
+
continue
|
61 |
+
|
62 |
+
articles_list.append({"title": title, "summary": summary})
|
63 |
+
except AttributeError as e:
|
64 |
+
print(f"NYTimes Extraction Error: {e}")
|
65 |
+
continue
|
66 |
+
|
67 |
+
# Fetch and parse BBC articles
|
68 |
+
bbc_url = f"https://www.bbc.com/search?q={company_name}"
|
69 |
+
bbc_page = requests.get(bbc_url).text
|
70 |
+
bbc_soup = BeautifulSoup(bbc_page, "html.parser")
|
71 |
+
|
72 |
+
for article in bbc_soup.find_all("div", {"data-testid": "newport-article"}):
|
73 |
+
try:
|
74 |
+
title = article.find("h2", {"data-testid": "card-headline"}).text.strip()
|
75 |
+
summary = article.find(
|
76 |
+
"div", {"class": "sc-4ea10043-3 kMizuB"}
|
77 |
+
).text.strip()
|
78 |
+
|
79 |
+
if not title or not summary:
|
80 |
+
continue
|
81 |
+
|
82 |
+
articles_list.append({"title": title, "summary": summary})
|
83 |
+
except AttributeError as e:
|
84 |
+
print(f"BBC Extraction Error: {e}")
|
85 |
+
continue
|
86 |
+
articles_list = articles_list[:10]
|
87 |
+
articles_filtered = filter_articles(articles_list, company_name)
|
88 |
+
return articles_filtered
|
89 |
+
|
90 |
+
|
91 |
+
def save_audio(hindi_text):
|
92 |
+
tts = gTTS(text=hindi_text, lang="hi", slow=False)
|
93 |
+
tts.save("output.mp3")
|
94 |
+
|
95 |
+
|
96 |
+
class SentimentAnalyzer:
|
97 |
+
|
98 |
+
def __init__(
|
99 |
+
self, model_id="mrm8488/deberta-v3-ft-financial-news-sentiment-analysis"
|
100 |
+
):
|
101 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
102 |
+
self.pipe = pipeline(task="text-classification", model=model_id, device=device)
|
103 |
+
|
104 |
+
def classify_sentiments(self, articles_list):
|
105 |
+
"""
|
106 |
+
Classifies the sentiment of each article based on its title and summary.
|
107 |
+
|
108 |
+
Args:
|
109 |
+
articles_list (list of dict): A list of articles with 'title' and 'summary' keys.
|
110 |
+
|
111 |
+
Returns:
|
112 |
+
list of dict: A new list with added 'sentiment' keys.
|
113 |
+
"""
|
114 |
+
for article in articles_list:
|
115 |
+
sentiment = self.pipe(f"{article['title']}. {article['summary']}")
|
116 |
+
article["sentiment"] = sentiment[0]["label"]
|
117 |
+
|
118 |
+
return articles_list
|
119 |
+
|
120 |
+
|
121 |
+
class SemanticGrouping:
|
122 |
+
|
123 |
+
def __init__(self, model_id="sentence-transformers/all-MiniLM-L6-v2"):
|
124 |
+
|
125 |
+
self.model = SentenceTransformer(model_id)
|
126 |
+
|
127 |
+
def find_top_k_similar_articles(self, articles, k=5):
|
128 |
+
"""
|
129 |
+
Finds the top-k most similar pairs of articles using cosine similarity.
|
130 |
+
|
131 |
+
Args:
|
132 |
+
articles (list of str): A list of article texts to compare.
|
133 |
+
k (int, optional): The number of top similar pairs to return. Defaults to 5.
|
134 |
+
|
135 |
+
Returns:
|
136 |
+
list of tuples: A list of (index1, index2, similarity_score) tuples.
|
137 |
+
"""
|
138 |
+
embeddings = self.model.encode(articles, convert_to_tensor=True)
|
139 |
+
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)
|
140 |
+
|
141 |
+
pairs = itertools.combinations(range(len(articles)), 2)
|
142 |
+
similarity_scores = [(i, j, cosine_scores[i][j].item()) for i, j in pairs]
|
143 |
+
|
144 |
+
top_k_pairs = heapq.nlargest(k, similarity_scores, key=lambda x: x[2])
|
145 |
+
|
146 |
+
return top_k_pairs
|