Senzen commited on
Commit
c576592
·
verified ·
1 Parent(s): 012003c

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +168 -10
  2. api.py +203 -0
  3. llm_utils.py +241 -0
  4. requirements.txt +34 -0
  5. utils.py +146 -0
README.md CHANGED
@@ -1,10 +1,168 @@
1
- ---
2
- title: Back End
3
- emoji:
4
- colorFrom: indigo
5
- colorTo: pink
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: News Summarizer
3
+ emoji: 👁
4
+ colorFrom: gray
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 5.22.0
8
+ app_file: app.py
9
+ pinned: false
10
+ short_description: An app for summarizing news articles on orgs.
11
+ ---
12
+
13
+ # News Summarization and Text-to-Speech Application
14
+
15
+ ## Overview
16
+ This project is a web-based application that extracts key details from multiple news articles related to a given company, performs sentiment analysis, conducts a comparative analysis, and generates a text-to-speech (TTS) output in Hindi.
17
+
18
+ ## Features
19
+ - **News Extraction**: Scrapes and displays at least 10 news articles from The New York Times and BBC.
20
+ - **Sentiment Analysis**: Categorizes articles into Positive, Negative, or Neutral sentiments.
21
+ - **Comparative Analysis**: Groups articles with most semantic similarity. Then compares the groups to derive insights on how a company's news coverage varies.
22
+ - **Text-to-Speech (TTS)**: Converts the summarized sentiment report into Hindi speech.
23
+ - **User Interface**: Provides a simple web-based interface using Gradio.
24
+ - **API Integration**: Implements FastAPI for backend communication.
25
+ - **Deployment**: Deployable on Hugging Face Spaces.
26
+
27
+ ## Tech Stack
28
+ - **Frontend**: Gradio
29
+ - **Backend**: FastAPI
30
+ - **Scraping**: BeautifulSoup
31
+ - **NLP**: OpenAI GPT models, LangChain, Sentence Transformers
32
+ - **Sentiment Analysis**: Pre-trained Transformer model
33
+ - **Text-to-Speech**: Google TTS (gTTS)
34
+ - **Deployment**: Uvicorn, Hugging Face Spaces
35
+
36
+ ---
37
+
38
+ ## Installation and Setup
39
+
40
+ ### 1. Clone the Repository
41
+ ```bash
42
+ git clone https://github.com/Senzen18/News-Summarizer.git
43
+ cd News-Summarizer
44
+ ```
45
+
46
+ ### 2. Install Dependencies
47
+ Ensure you have Python 3.8+ installed. Then, run:
48
+ ```bash
49
+ pip install -r requirements.txt
50
+ ```
51
+
52
+ ### 3. To run Fast API endpoints
53
+ Start the FastAPI backend:
54
+ ```bash
55
+ uvicorn api:app --host 127.0.0.1 --port 8000 --reload
56
+ ```
57
+
58
+ ### 4. To run the both Gradio and Fast API
59
+ Start the FastAPI backend:
60
+ ```bash
61
+ gradio app.py
62
+ ```
63
+
64
+ ### 5. Access the Application
65
+ Once started, access the Gradio UI at:
66
+ ```
67
+ http://127.0.0.1:7860
68
+ ```
69
+
70
+ ---
71
+
72
+ ## API Endpoints
73
+
74
+ ### 1. Fetch News
75
+ **GET** `/news/{company_name}`
76
+ - Fetches the latest articles related to a company.
77
+ - **Example:** `/news/Tesla`
78
+
79
+ ### 2. Analyze News Sentiment
80
+ **GET** `/analyze-news`
81
+ - Performs sentiment analysis on the extracted articles.
82
+
83
+ ### 3. Compare News Articles
84
+ **POST** `/compare-news`
85
+ - Performs comparative analysis.
86
+ - **Request Body:**
87
+ ```json
88
+ {
89
+ "api_key": "your-openai-api-key",
90
+ "model_name": "gpt-4o-mini",
91
+ "company_name": "Tesla"
92
+ }
93
+ ```
94
+
95
+ ### 4. Generate Hindi Summary
96
+ **GET** `/hindi-summary`
97
+ - Returns the summarized analysis in Hindi and stores the speech file.
98
+
99
+ ---
100
+
101
+ ## File Structure
102
+ ```
103
+ ├── api.py # FastAPI backend for news extraction, sentiment analysis, and comparison
104
+ ├── app.py # Gradio frontend to interact with users
105
+ ├── llm_utils.py # Handles OpenAI API calls for topic extraction and comparative analysis
106
+ ├── utils.py # Utility functions for web scraping, sentiment analysis, and TTS
107
+ ├── requirements.txt # Dependencies
108
+ └── README.md # Project documentation
109
+ ```
110
+
111
+ ---
112
+
113
+ ## Assumptions and Limitations
114
+ - Only extracts articles from The New York Times and BBC.
115
+ - Requires a valid OpenAI API key for sentiment analysis and comparison.
116
+ - Hindi speech output uses gTTS, which requires an internet connection.
117
+
118
+ ---
119
+
120
+ ## Deployment
121
+ This project can be deployed on Hugging Face Spaces. To deploy:
122
+ 1. Push your repository to GitHub.
123
+ 2. Follow [Hugging Face Spaces documentation](https://huggingface.co/docs/spaces) for deployment.
124
+
125
+ ---
126
+
127
+ ## Example Output
128
+ ```json
129
+ {
130
+ "Company": "Tesla",
131
+ "Articles": [
132
+ {
133
+ "Title": "Tesla's New Model Breaks Sales Records",
134
+ "Summary": "Tesla's latest EV sees record sales in Q3...",
135
+ "Sentiment": "Positive",
136
+ "Topics": ["Electric Vehicles", "Stock Market", "Innovation"]
137
+ }
138
+ ],
139
+ "Comparative Sentiment Score": {
140
+ "Sentiment Distribution": {"Positive": 1, "Negative": 1, "Neutral": 0},
141
+ "Coverage Differences": [{
142
+ "Comparison": "Article 1 highlights Tesla's strong sales, while Article 2 discusses regulatory issues.",
143
+ "Impact": "Investors may react positively to growth news but stay cautious due to regulatory scrutiny."
144
+ }],
145
+ "Topic Overlap": {
146
+ "Common Topics": ["Electric Vehicles"],
147
+ "Unique Topics in Article 1": ["Stock Market", "Innovation"],
148
+ "Unique Topics in Article 2": ["Regulations", "Autonomous Vehicles"]
149
+ }
150
+ },
151
+ "Final Sentiment Analysis": "Tesla’s latest news coverage is mostly positive. Potential stock growth expected.",
152
+ "Audio": "[Play Hindi Speech]"
153
+ }
154
+ ```
155
+
156
+ ---
157
+
158
+ ## Contributing
159
+ Feel free to contribute by:
160
+ - Adding more news sources
161
+ - Improving the sentiment model
162
+ - Enhancing the UI
163
+
164
+ ---
165
+
166
+ ## Contact
167
+ For queries, reach out at [[email protected]].
168
+
api.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, HTTPException
2
+ from utils import bs4_extractor, SentimentAnalyzer, SemanticGrouping, save_audio
3
+ from llm_utils import *
4
+ import openai
5
+ import asyncio
6
+
7
+ app = FastAPI()
8
+
9
+ # Initialize sentiment analyzer and semantic grouping
10
+ sentiment_analyzer = SentimentAnalyzer()
11
+ semantic_grouping = SemanticGrouping()
12
+
13
+
14
+ class CompareNewsRequest(BaseModel):
15
+ api_key: str
16
+ model_name: str = "gpt-4o-mini"
17
+ company_name: str
18
+
19
+
20
+ def check_api_key(api_key: str):
21
+ if api_key == None:
22
+ return False
23
+ elif api_key != None:
24
+ client = openai.OpenAI(api_key=api_key)
25
+ try:
26
+ client.models.list()
27
+ except openai.AuthenticationError:
28
+ return False
29
+ else:
30
+ return True
31
+
32
+
33
+ def check_model_name(model_name: str, api_key: str):
34
+ openai.api_key = api_key
35
+ model_list = [model.id for model in openai.models.list()]
36
+ return True if model_name in model_list else False
37
+
38
+
39
+ # Helper function to get articles and article sentiments
40
+ def get_articles(company_name: str):
41
+ if not company_name:
42
+ raise HTTPException(status_code=500, detail="The company name is required.")
43
+
44
+ news_articles = bs4_extractor(company_name)
45
+
46
+ if not news_articles:
47
+ raise HTTPException(status_code=500, detail="No news found")
48
+
49
+ articles_data = [
50
+ {"title": article["title"], "summary": article["summary"]}
51
+ for article in news_articles
52
+ ]
53
+
54
+ analyzed_articles = sentiment_analyzer.classify_sentiments(articles_data)
55
+
56
+ return news_articles, analyzed_articles
57
+
58
+
59
+ def get_formatted_output(
60
+ company_name,
61
+ analyzed_articles,
62
+ topic_extraction_results,
63
+ topic_overlap_results,
64
+ comparative_analysis_results,
65
+ final_analysis,
66
+ ):
67
+ articles = analyzed_articles
68
+ sentiment_distribution = {"positive": 0, "negative": 0, "neutral": 0}
69
+ for i in range(len(articles)):
70
+ articles[i]["topics"] = topic_extraction_results[i]
71
+
72
+ sentiment = articles[i]["sentiment"]
73
+ sentiment_distribution[sentiment] += 1
74
+ comparative_sentiment_score = {
75
+ "Sentiment Distribution": sentiment_distribution,
76
+ "Coverage Differences": comparative_analysis_results,
77
+ "Topic Overlap": topic_overlap_results,
78
+ }
79
+ final_output = {
80
+ "Company": company_name,
81
+ "Articles": articles,
82
+ "Comparative Sentiment Score": comparative_sentiment_score,
83
+ "Final Sentiment Analysis": final_analysis,
84
+ }
85
+
86
+ return final_output
87
+
88
+
89
+ @app.get("/news/{company_name}")
90
+ def get_news(company_name: str):
91
+ """
92
+ API endpoint to get news for a company.
93
+ Fetches news articles from NYTimes and BBC.
94
+ """
95
+ try:
96
+ news_articles = bs4_extractor(company_name)
97
+ app.state.company_name = company_name
98
+ if not news_articles:
99
+ raise HTTPException(status_code=404, detail="No news found")
100
+
101
+ app.state.news_articles = news_articles
102
+ return {"company": company_name, "articles": news_articles}
103
+ except Exception as e:
104
+ raise HTTPException(status_code=500, detail=str(e))
105
+
106
+
107
+ @app.get("/analyze-news")
108
+ def analyze_news():
109
+ """
110
+ API endpoint to analyze news articles.
111
+ Performs sentiment analysis on a list of articles.
112
+ """
113
+ if not app.state.news_articles:
114
+ HTTPException(
115
+ status_code=500, detail="Type in the name before the news analysis."
116
+ )
117
+ try:
118
+ articles_data = [
119
+ {"title": article["title"], "summary": article["summary"]}
120
+ for article in app.state.news_articles
121
+ ]
122
+ analyzed_articles = sentiment_analyzer.classify_sentiments(articles_data)
123
+ app.state.articles_with_sentiments = analyzed_articles
124
+ return {"analyzed_articles": analyzed_articles}
125
+ except Exception as e:
126
+ raise HTTPException(status_code=500, detail=str(e))
127
+
128
+
129
+ @app.post("/compare-news")
130
+ async def compare_news(request_info: CompareNewsRequest):
131
+ """
132
+ API endpoint to perform comparative analysis.
133
+ Uses semantic similarity to find the most related articles.
134
+ """
135
+ api_key = request_info.api_key
136
+ company_name = request_info.company_name
137
+ model_name = request_info.model_name
138
+ if not check_api_key(api_key):
139
+ HTTPException(
140
+ status_code=500,
141
+ detail="The entered API key does not seem to be right. Please enter a valid API key",
142
+ )
143
+
144
+ if not check_model_name(model_name, api_key):
145
+ HTTPException(
146
+ status_code=500,
147
+ detail="The model you specified does not exist.",
148
+ )
149
+ news_articles, analyzed_articles = get_articles(company_name)
150
+ try:
151
+ articles_text = [
152
+ f"{article['title']}. {article['summary']}" for article in news_articles
153
+ ]
154
+
155
+ if len(articles_text) < 2:
156
+ raise HTTPException(
157
+ status_code=400, detail="At least two articles required for comparison."
158
+ )
159
+ top_similar_articles = semantic_grouping.find_top_k_similar_articles(
160
+ articles_text, k=5
161
+ )
162
+
163
+ llm_chatbot = ChatBot(
164
+ api_key,
165
+ model_name,
166
+ analyzed_articles,
167
+ company_name,
168
+ )
169
+ llm_result = await llm_chatbot.main(top_similar_articles)
170
+
171
+ (
172
+ topic_extraction_results,
173
+ topic_overlap_results,
174
+ comparative_analysis_results,
175
+ ) = llm_result.values()
176
+ final_analysis_eng, final_analysis_hi = llm_chatbot.final_analysis(
177
+ comparative_analysis_results
178
+ )
179
+
180
+ final_output = get_formatted_output(
181
+ company_name,
182
+ analyzed_articles,
183
+ topic_extraction_results,
184
+ topic_overlap_results,
185
+ comparative_analysis_results,
186
+ final_analysis_eng,
187
+ )
188
+
189
+ app.state.hindi_summary = final_analysis_hi
190
+ return final_output
191
+
192
+ except Exception as e:
193
+ raise HTTPException(status_code=500, detail=str(e))
194
+
195
+
196
+ @app.get("/hindi-summary")
197
+ def get_hindi_summary():
198
+ if not app.state.hindi_summary:
199
+ raise HTTPException(
200
+ status_code=500, detail="Generate the Comparative Analysis first."
201
+ )
202
+ save_audio(app.state.hindi_summary)
203
+ return {"hindi_summary": app.state.hindi_summary}
llm_utils.py ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ from pydantic import BaseModel, Field
3
+ from langchain_core.prompts import ChatPromptTemplate
4
+ from langchain_openai import ChatOpenAI
5
+ from typing import List
6
+ import asyncio
7
+
8
+
9
+ class TopicExtraction(BaseModel):
10
+ """Extracts topics from news article."""
11
+
12
+ topics: List[str] = Field(
13
+ ..., description="A list of topics covered in the news article."
14
+ )
15
+
16
+
17
+ class TopicOverlap(BaseModel):
18
+ """Extracts topics from news article."""
19
+
20
+ common_topics: List[str] = Field(
21
+ ..., description="A list of topics covered in the both article."
22
+ )
23
+ unique_topics_1: List[str] = Field(
24
+ ..., description="A list of topics unique to article 1."
25
+ )
26
+ unique_topics_2: List[str] = Field(
27
+ ..., description="A list of topics unique to article 2."
28
+ )
29
+
30
+
31
+ class ComparativeAnalyzer(BaseModel):
32
+ """Compares given pair of articles and extracts comparison and impact."""
33
+
34
+ comparison: str = Field(
35
+ ..., description="A sentence of comparative insights between articles."
36
+ )
37
+ impact: str = Field(
38
+ ..., description="A sentence of potential impacts from the compared articles."
39
+ )
40
+
41
+
42
+ class FinalAnalysis(BaseModel):
43
+ """Summarizes the Comparative analysis."""
44
+
45
+ english: str = Field(..., description="Summarizes the analysis in english.")
46
+ hindi: str = Field(..., description="Summarizes the analysis in hindi.")
47
+
48
+
49
+ class ChatBot:
50
+ def __init__(
51
+ self, api_key: str, model: str, articles_dict: list, company_name: str
52
+ ):
53
+ self.llm = ChatOpenAI(model=model, api_key=api_key, temperature=0.1)
54
+ articles_list = []
55
+ for i, article in enumerate(articles_dict):
56
+ title = article["title"]
57
+ summary = article["summary"]
58
+ sentiment = article["sentiment"]
59
+ articles_list.append(
60
+ f"title {title} \n summary {summary} \n sentiment {sentiment} \n\n"
61
+ )
62
+
63
+ self.articles = articles_list
64
+ self.company_name = company_name
65
+
66
+ async def topic_extraction(self, article: str):
67
+ system_message = """You are an expert in text analysis and topic extraction. Your task is to identify the main topics from a short news articleS.
68
+
69
+ ### Instructions:
70
+ - Extract **2 to 3 key topics** that summarize the core ideas of the article.
71
+ - Use **concise, generalizable topics** (e.g., "Electric Vehicles" instead of "Tesla Model X").
72
+ - Avoid generic words like "news" or "report".
73
+ - If relevant, include categories such as **Technology, Finance, Politics, Business, or Science**.
74
+ - Return the topics in **JSON format** as a list of strings.
75
+ - Seperate the topics for each articles by line break.
76
+ - Do not include just he company name {company_name}
77
+
78
+
79
+ ### Example:
80
+
81
+ #### Input Article:
82
+ "Tesla has launched a new AI-powered self-driving feature that improves vehicle autonomy and enhances road safety. The update is expected to impact the automotive industry's shift toward electric and smart vehicles."
83
+
84
+ #### Output:
85
+ ["Artificial Intelligence", "Self-Driving Cars", "Automotive Industry", "Electric Vehicles", "Road Safety"]
86
+ :
87
+
88
+ """
89
+
90
+ prompt = ChatPromptTemplate.from_messages(
91
+ [("system", system_message), ("human", "Input Article: \n {articles}")]
92
+ )
93
+ structured_llm = self.llm.with_structured_output(TopicExtraction)
94
+ chain = prompt | structured_llm
95
+ response = await chain.ainvoke(
96
+ ({"company_name": self.company_name, "articles": article})
97
+ )
98
+ return response.topics
99
+
100
+ async def topic_overlap(self, id1: int, id2: int):
101
+ article_1, article_2 = self.articles[id1], self.articles[id2]
102
+
103
+ system_message = """You are an advanced AI specializing in text analysis and topic extraction. Your task is to compare two news articles and extract key topics.
104
+
105
+ ### **Instructions:**
106
+ - Identify **common topics** present in **both articles**.
107
+ - Identify **topics unique to each article**.
108
+ - Use **generalized topics** (e.g., "Electric Vehicles" instead of "Tesla Model X").
109
+ - Ensure topics are **concise and meaningful**.
110
+ ---
111
+ ### **Example:**
112
+ #### **Article 1:**
113
+ "Tesla has launched a new AI-powered self-driving feature that enhances vehicle autonomy and road safety. The update is expected to impact the automotive industry."
114
+
115
+ #### **Article 2:**
116
+ "Regulators are reviewing Tesla’s self-driving technology due to safety concerns. Experts debate whether AI-based vehicle autonomy meets current legal standards."
117
+
118
+ #### **Expected Output:**
119
+ "common_topics": ["Self-Driving Cars", "Artificial Intelligence", "Safety"],
120
+ "unique_topics_1": ["Automotive Industry", "Automotive Industry"],
121
+ "unique_topics_2": ["Regulations", "Legal Standards"]
122
+ """
123
+
124
+ user_message = """
125
+ Here are the news articles on the company.
126
+ Article 1:
127
+ {article_1}
128
+ Article 2:
129
+ {article_2}
130
+ """
131
+
132
+ prompt = ChatPromptTemplate.from_messages(
133
+ [
134
+ ("system", system_message),
135
+ ("human", user_message),
136
+ ]
137
+ )
138
+ structured_llm = self.llm.with_structured_output(TopicOverlap)
139
+ chain = prompt | structured_llm
140
+ response = await chain.ainvoke({"article_1": article_1, "article_2": article_2})
141
+ return {
142
+ "Common Topics ": response.common_topics,
143
+ f"Unique Topics in Article {id1}": response.unique_topics_1,
144
+ f"Unique Topics in Article {id2}": response.unique_topics_2,
145
+ }
146
+
147
+ async def comparative_analysis(self, id1: int, id2: int):
148
+ article_1, article_2 = self.articles[id1], self.articles[id2]
149
+
150
+ system_message = """
151
+ You are an AI assistant that performs Comparative Analysis on given articles.
152
+ Analyze the following articles and provide a comparative analysis. Highlight their key themes, sentiment, and impact.
153
+ Compare how each article portrays the companies and discuss potential implications for investors and the industry.
154
+ Structure your response with 'Comparison' and 'Impact' sections.
155
+ The length of each comparison and impact should be less than 20 words
156
+ Mention the articles ids.
157
+
158
+ ### **Example:**
159
+ #### **Article 1:**
160
+ ""Tesla's New Model Breaks Sales Records.Tesla's latest EV sees record sales in Q3..."
161
+
162
+ #### **Article 2:**
163
+ "Regulatory Scrutiny on Tesla's Self-Driving Tech. Regulators have raised concerns over Tesla’s self-driving software..."
164
+
165
+ #### **Expected Output:**
166
+ "Comparison": "Article 1 highlights Tesla's strong sales, while Article 2 discusses regulatory issues.",
167
+ "Impact": "The first article boosts confidence in Tesla's market growth, while the second raises concerns about future regulatory hurdles."
168
+ """
169
+
170
+ user_message = """
171
+ Here are the news articles on the company.
172
+ Article {id1}:
173
+ {article_1}
174
+ Article {id2}:
175
+ {article_2}
176
+ """
177
+
178
+ prompt = ChatPromptTemplate.from_messages(
179
+ [
180
+ ("system", system_message),
181
+ ("human", user_message),
182
+ ]
183
+ )
184
+ structured_llm = self.llm.with_structured_output(ComparativeAnalyzer)
185
+ chain = prompt | structured_llm
186
+ response = await chain.ainvoke(
187
+ {"article_1": article_1, "article_2": article_2, "id1": id1, "id2": id2}
188
+ )
189
+ return {
190
+ f"comparison of {id1}, {id2}": response.comparison,
191
+ "impact": response.impact,
192
+ }
193
+
194
+ async def main(self, similar_pairs: list):
195
+ """Runs all OpenAI API calls in parallel."""
196
+
197
+ topic_extraction_tasks = [
198
+ self.topic_extraction(article) for article in self.articles
199
+ ]
200
+
201
+ topic_overlap_tasks = [
202
+ self.topic_overlap(id1, id2) for id1, id2, _ in similar_pairs
203
+ ]
204
+
205
+ comparative_analysis_tasks = [
206
+ self.comparative_analysis(id1, id2) for id1, id2, _ in similar_pairs
207
+ ]
208
+
209
+ (
210
+ topic_extraction_results,
211
+ topic_overlap_results,
212
+ comparative_analysis_results,
213
+ ) = await asyncio.gather(
214
+ asyncio.gather(*topic_extraction_tasks),
215
+ asyncio.gather(*topic_overlap_tasks),
216
+ asyncio.gather(*comparative_analysis_tasks),
217
+ )
218
+ return {
219
+ "topic_extraction_results": topic_extraction_results,
220
+ "topic_overlap_results": topic_overlap_results,
221
+ "comparative_analysis_results": comparative_analysis_results,
222
+ }
223
+
224
+ def final_analysis(self, comparative_analysis_articles):
225
+ comparative_results = "Comparative Analysis: \n"
226
+ for comparisons in comparative_analysis_articles:
227
+ comparison, impact = comparisons.values()
228
+ comparative_results += f"comparison: {comparison} \n impact: {impact} \n\n"
229
+
230
+ template = """
231
+ You are an AI assistant that reads a Comparative Analysis of Articles.
232
+ And summarizes them to produce the final sentiment analysis.
233
+ Make the final sentiment analysis less than 20 words
234
+ Comprative Analysis:
235
+ {comparative_results}
236
+ """
237
+ prompt = ChatPromptTemplate.from_template(template)
238
+ structured_llm = self.llm.with_structured_output(FinalAnalysis)
239
+ chain = prompt | structured_llm
240
+ response = chain.invoke({"comparative_results": comparative_results})
241
+ return response.english, response.hindi
requirements.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ beautifulsoup4==4.13.3
3
+ gradio==5.22.0
4
+ gradio_client==1.8.0
5
+ langchain==0.3.20
6
+ langchain-community==0.3.19
7
+ langchain-core==0.3.45
8
+ langchain-openai==0.3.9
9
+ langchain-text-splitters==0.3.6
10
+ multiprocess==0.70.16
11
+ numpy==1.26.4
12
+ openai==1.66.3
13
+ pandas==2.2.3
14
+ pydantic==2.10.6
15
+ pydantic_core==2.27.2
16
+ pydantic-settings==2.8.1
17
+ requests==2.32.3
18
+ scikit-learn==1.6.1
19
+ scipy==1.15.2
20
+ sentence-transformers==3.4.1
21
+ tokenizers==0.20.3
22
+ torch==2.6.0+cu124
23
+ torchaudio==2.6.0
24
+ transformers==4.46.1
25
+ uvicorn==0.34.0
26
+
27
+
28
+
29
+
30
+
31
+
32
+
33
+
34
+
utils.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ from bs4 import BeautifulSoup
3
+ from transformers import pipeline
4
+ import pandas as pd
5
+ from sentence_transformers import SentenceTransformer, util
6
+
7
+ # from transformers import AutoModelForSequenceClassification, AutoTokenizer
8
+ import itertools
9
+ import re
10
+ import heapq
11
+ import torch
12
+ from gtts import gTTS
13
+
14
+
15
+ def filter_articles(articles_list, company_name):
16
+ """
17
+ Filters articles that only contain the company name.
18
+
19
+ Args:
20
+ articles_list (list): List of dictionaries with 'title' and 'summary'.
21
+ company_name (str): The company name to filter articles by.
22
+
23
+ Returns:
24
+ list: A filtered list of articles that contain the company name.
25
+ """
26
+ articles_list_filtered = []
27
+
28
+ for article in articles_list:
29
+ full_text = (article["title"] + " " + article["summary"]).lower()
30
+
31
+ if re.search(company_name.lower(), full_text):
32
+ articles_list_filtered.append(article)
33
+
34
+ return articles_list_filtered
35
+
36
+
37
+ def bs4_extractor(company_name: str):
38
+ """
39
+ Extracts news articles from The New York Times and BBC for a given company.
40
+
41
+ Args:
42
+ company_name (str): The name of the company to search for.
43
+
44
+ Returns:
45
+ list: A list of dictionaries containing article titles and summaries.
46
+ """
47
+ articles_list = []
48
+
49
+ # Fetch and parse NYTimes articles
50
+ nytimes_url = f"https://www.nytimes.com/search?query={company_name}"
51
+ nytimes_page = requests.get(nytimes_url).text
52
+ nytimes_soup = BeautifulSoup(nytimes_page, "html.parser")
53
+
54
+ for article in nytimes_soup.find_all("li", {"data-testid": "search-bodega-result"}):
55
+ try:
56
+ title = article.find("h4").text.strip()
57
+ summary = article.find("p", {"class": "css-e5tzus"}).text.strip()
58
+
59
+ if not title or not summary:
60
+ continue
61
+
62
+ articles_list.append({"title": title, "summary": summary})
63
+ except AttributeError as e:
64
+ print(f"NYTimes Extraction Error: {e}")
65
+ continue
66
+
67
+ # Fetch and parse BBC articles
68
+ bbc_url = f"https://www.bbc.com/search?q={company_name}"
69
+ bbc_page = requests.get(bbc_url).text
70
+ bbc_soup = BeautifulSoup(bbc_page, "html.parser")
71
+
72
+ for article in bbc_soup.find_all("div", {"data-testid": "newport-article"}):
73
+ try:
74
+ title = article.find("h2", {"data-testid": "card-headline"}).text.strip()
75
+ summary = article.find(
76
+ "div", {"class": "sc-4ea10043-3 kMizuB"}
77
+ ).text.strip()
78
+
79
+ if not title or not summary:
80
+ continue
81
+
82
+ articles_list.append({"title": title, "summary": summary})
83
+ except AttributeError as e:
84
+ print(f"BBC Extraction Error: {e}")
85
+ continue
86
+ articles_list = articles_list[:10]
87
+ articles_filtered = filter_articles(articles_list, company_name)
88
+ return articles_filtered
89
+
90
+
91
+ def save_audio(hindi_text):
92
+ tts = gTTS(text=hindi_text, lang="hi", slow=False)
93
+ tts.save("output.mp3")
94
+
95
+
96
+ class SentimentAnalyzer:
97
+
98
+ def __init__(
99
+ self, model_id="mrm8488/deberta-v3-ft-financial-news-sentiment-analysis"
100
+ ):
101
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
102
+ self.pipe = pipeline(task="text-classification", model=model_id, device=device)
103
+
104
+ def classify_sentiments(self, articles_list):
105
+ """
106
+ Classifies the sentiment of each article based on its title and summary.
107
+
108
+ Args:
109
+ articles_list (list of dict): A list of articles with 'title' and 'summary' keys.
110
+
111
+ Returns:
112
+ list of dict: A new list with added 'sentiment' keys.
113
+ """
114
+ for article in articles_list:
115
+ sentiment = self.pipe(f"{article['title']}. {article['summary']}")
116
+ article["sentiment"] = sentiment[0]["label"]
117
+
118
+ return articles_list
119
+
120
+
121
+ class SemanticGrouping:
122
+
123
+ def __init__(self, model_id="sentence-transformers/all-MiniLM-L6-v2"):
124
+
125
+ self.model = SentenceTransformer(model_id)
126
+
127
+ def find_top_k_similar_articles(self, articles, k=5):
128
+ """
129
+ Finds the top-k most similar pairs of articles using cosine similarity.
130
+
131
+ Args:
132
+ articles (list of str): A list of article texts to compare.
133
+ k (int, optional): The number of top similar pairs to return. Defaults to 5.
134
+
135
+ Returns:
136
+ list of tuples: A list of (index1, index2, similarity_score) tuples.
137
+ """
138
+ embeddings = self.model.encode(articles, convert_to_tensor=True)
139
+ cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)
140
+
141
+ pairs = itertools.combinations(range(len(articles)), 2)
142
+ similarity_scores = [(i, j, cosine_scores[i][j].item()) for i, j in pairs]
143
+
144
+ top_k_pairs = heapq.nlargest(k, similarity_scores, key=lambda x: x[2])
145
+
146
+ return top_k_pairs