Spaces:

Senzen
/

Back-End

Sleeping

App Files Files Community

Senzen commited on Mar 22

Commit

c576592

verified ·

1 Parent(s): 012003c

Upload 5 files

Browse files

Files changed (5) hide show

README.md +168 -10
api.py +203 -0
llm_utils.py +241 -0
requirements.txt +34 -0
utils.py +146 -0

README.md CHANGED Viewed

@@ -1,10 +1,168 @@
----
-title: Back End
-emoji: ⚡
-colorFrom: indigo
-colorTo: pink
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: News Summarizer
+emoji: 👁
+colorFrom: gray
+colorTo: green
+sdk: gradio
+sdk_version: 5.22.0
+app_file: app.py
+pinned: false
+short_description: An app for summarizing news articles on orgs.
+---
+# News Summarization and Text-to-Speech Application
+## Overview
+This project is a web-based application that extracts key details from multiple news articles related to a given company, performs sentiment analysis, conducts a comparative analysis, and generates a text-to-speech (TTS) output in Hindi.
+## Features
+- **News Extraction**: Scrapes and displays at least 10 news articles from The New York Times and BBC.
+- **Sentiment Analysis**: Categorizes articles into Positive, Negative, or Neutral sentiments.
+- **Comparative Analysis**: Groups articles with most semantic similarity. Then compares the groups to derive insights on how a company's news coverage varies.
+- **Text-to-Speech (TTS)**: Converts the summarized sentiment report into Hindi speech.
+- **User Interface**: Provides a simple web-based interface using Gradio.
+- **API Integration**: Implements FastAPI for backend communication.
+- **Deployment**: Deployable on Hugging Face Spaces.
+## Tech Stack
+- **Frontend**: Gradio
+- **Backend**: FastAPI
+- **Scraping**: BeautifulSoup
+- **NLP**: OpenAI GPT models, LangChain, Sentence Transformers
+- **Sentiment Analysis**: Pre-trained Transformer model
+- **Text-to-Speech**: Google TTS (gTTS)
+- **Deployment**: Uvicorn, Hugging Face Spaces
+---
+## Installation and Setup
+### 1. Clone the Repository
+```bash
+git clone https://github.com/Senzen18/News-Summarizer.git
+cd News-Summarizer
+```
+### 2. Install Dependencies
+Ensure you have Python 3.8+ installed. Then, run:
+```bash
+pip install -r requirements.txt
+```
+### 3. To run Fast API endpoints
+Start the FastAPI backend:
+```bash
+uvicorn api:app --host 127.0.0.1 --port 8000 --reload
+```
+### 4. To run the both Gradio and Fast API
+Start the FastAPI backend:
+```bash
+gradio app.py
+```
+### 5. Access the Application
+Once started, access the Gradio UI at:
+```
+http://127.0.0.1:7860
+```
+---
+## API Endpoints
+### 1. Fetch News
+**GET** `/news/{company_name}`
+- Fetches the latest articles related to a company.
+- **Example:** `/news/Tesla`
+### 2. Analyze News Sentiment
+**GET** `/analyze-news`
+- Performs sentiment analysis on the extracted articles.
+### 3. Compare News Articles
+**POST** `/compare-news`
+- Performs comparative analysis.
+- **Request Body:**
+```json
+{
+  "api_key": "your-openai-api-key",
+  "model_name": "gpt-4o-mini",
+  "company_name": "Tesla"
+}
+```
+### 4. Generate Hindi Summary
+**GET** `/hindi-summary`
+- Returns the summarized analysis in Hindi and stores the speech file.
+---
+## File Structure
+```
+├── api.py               # FastAPI backend for news extraction, sentiment analysis, and comparison
+├── app.py               # Gradio frontend to interact with users
+├── llm_utils.py         # Handles OpenAI API calls for topic extraction and comparative analysis
+├── utils.py             # Utility functions for web scraping, sentiment analysis, and TTS
+├── requirements.txt     # Dependencies
+└── README.md            # Project documentation
+```
+---
+## Assumptions and Limitations
+- Only extracts articles from The New York Times and BBC.
+- Requires a valid OpenAI API key for sentiment analysis and comparison.
+- Hindi speech output uses gTTS, which requires an internet connection.
+---
+## Deployment
+This project can be deployed on Hugging Face Spaces. To deploy:
+1. Push your repository to GitHub.
+2. Follow [Hugging Face Spaces documentation](https://huggingface.co/docs/spaces) for deployment.
+---
+## Example Output
+```json
+{
+  "Company": "Tesla",
+  "Articles": [
+    {
+      "Title": "Tesla's New Model Breaks Sales Records",
+      "Summary": "Tesla's latest EV sees record sales in Q3...",
+      "Sentiment": "Positive",
+      "Topics": ["Electric Vehicles", "Stock Market", "Innovation"]
+    }
+  ],
+  "Comparative Sentiment Score": {
+    "Sentiment Distribution": {"Positive": 1, "Negative": 1, "Neutral": 0},
+    "Coverage Differences": [{
+      "Comparison": "Article 1 highlights Tesla's strong sales, while Article 2 discusses regulatory issues.",
+      "Impact": "Investors may react positively to growth news but stay cautious due to regulatory scrutiny."
+    }],
+    "Topic Overlap": {
+      "Common Topics": ["Electric Vehicles"],
+      "Unique Topics in Article 1": ["Stock Market", "Innovation"],
+      "Unique Topics in Article 2": ["Regulations", "Autonomous Vehicles"]
+    }
+  },
+  "Final Sentiment Analysis": "Tesla’s latest news coverage is mostly positive. Potential stock growth expected.",
+  "Audio": "[Play Hindi Speech]"
+}
+```
+---
+## Contributing
+Feel free to contribute by:
+- Adding more news sources
+- Improving the sentiment model
+- Enhancing the UI
+---
+## Contact
+For queries, reach out at [[email protected]].

api.py ADDED Viewed

	@@ -0,0 +1,203 @@

+from fastapi import FastAPI, HTTPException
+from utils import bs4_extractor, SentimentAnalyzer, SemanticGrouping, save_audio
+from llm_utils import *
+import openai
+import asyncio
+app = FastAPI()
+# Initialize sentiment analyzer and semantic grouping
+sentiment_analyzer = SentimentAnalyzer()
+semantic_grouping = SemanticGrouping()
+class CompareNewsRequest(BaseModel):
+    api_key: str
+    model_name: str = "gpt-4o-mini"
+    company_name: str
+def check_api_key(api_key: str):
+    if api_key == None:
+        return False
+    elif api_key != None:
+        client = openai.OpenAI(api_key=api_key)
+        try:
+            client.models.list()
+        except openai.AuthenticationError:
+            return False
+        else:
+            return True
+def check_model_name(model_name: str, api_key: str):
+    openai.api_key = api_key
+    model_list = [model.id for model in openai.models.list()]
+    return True if model_name in model_list else False
+# Helper function to get articles and article sentiments
+def get_articles(company_name: str):
+    if not company_name:
+        raise HTTPException(status_code=500, detail="The company name is required.")
+    news_articles = bs4_extractor(company_name)
+    if not news_articles:
+        raise HTTPException(status_code=500, detail="No news found")
+    articles_data = [
+        {"title": article["title"], "summary": article["summary"]}
+        for article in news_articles
+    ]
+    analyzed_articles = sentiment_analyzer.classify_sentiments(articles_data)
+    return news_articles, analyzed_articles
+def get_formatted_output(
+    company_name,
+    analyzed_articles,
+    topic_extraction_results,
+    topic_overlap_results,
+    comparative_analysis_results,
+    final_analysis,
+):
+    articles = analyzed_articles
+    sentiment_distribution = {"positive": 0, "negative": 0, "neutral": 0}
+    for i in range(len(articles)):
+        articles[i]["topics"] = topic_extraction_results[i]
+        sentiment = articles[i]["sentiment"]
+        sentiment_distribution[sentiment] += 1
+    comparative_sentiment_score = {
+        "Sentiment Distribution": sentiment_distribution,
+        "Coverage Differences": comparative_analysis_results,
+        "Topic Overlap": topic_overlap_results,
+    }
+    final_output = {
+        "Company": company_name,
+        "Articles": articles,
+        "Comparative Sentiment Score": comparative_sentiment_score,
+        "Final Sentiment Analysis": final_analysis,
+    }
+    return final_output
+@app.get("/news/{company_name}")
+def get_news(company_name: str):
+    """
+    API endpoint to get news for a company.
+    Fetches news articles from NYTimes and BBC.
+    """
+    try:
+        news_articles = bs4_extractor(company_name)
+        app.state.company_name = company_name
+        if not news_articles:
+            raise HTTPException(status_code=404, detail="No news found")
+        app.state.news_articles = news_articles
+        return {"company": company_name, "articles": news_articles}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/analyze-news")
+def analyze_news():
+    """
+    API endpoint to analyze news articles.
+    Performs sentiment analysis on a list of articles.
+    """
+    if not app.state.news_articles:
+        HTTPException(
+            status_code=500, detail="Type in the name before the news analysis."
+        )
+    try:
+        articles_data = [
+            {"title": article["title"], "summary": article["summary"]}
+            for article in app.state.news_articles
+        ]
+        analyzed_articles = sentiment_analyzer.classify_sentiments(articles_data)
+        app.state.articles_with_sentiments = analyzed_articles
+        return {"analyzed_articles": analyzed_articles}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/compare-news")
+async def compare_news(request_info: CompareNewsRequest):
+    """
+    API endpoint to perform comparative analysis.
+    Uses semantic similarity to find the most related articles.
+    """
+    api_key = request_info.api_key
+    company_name = request_info.company_name
+    model_name = request_info.model_name
+    if not check_api_key(api_key):
+        HTTPException(
+            status_code=500,
+            detail="The entered API key does not seem to be right. Please enter a valid API key",
+        )
+    if not check_model_name(model_name, api_key):
+        HTTPException(
+            status_code=500,
+            detail="The model you specified does not exist.",
+        )
+    news_articles, analyzed_articles = get_articles(company_name)
+    try:
+        articles_text = [
+            f"{article['title']}. {article['summary']}" for article in news_articles
+        ]
+        if len(articles_text) < 2:
+            raise HTTPException(
+                status_code=400, detail="At least two articles required for comparison."
+            )
+        top_similar_articles = semantic_grouping.find_top_k_similar_articles(
+            articles_text, k=5
+        )
+        llm_chatbot = ChatBot(
+            api_key,
+            model_name,
+            analyzed_articles,
+            company_name,
+        )
+        llm_result = await llm_chatbot.main(top_similar_articles)
+        (
+            topic_extraction_results,
+            topic_overlap_results,
+            comparative_analysis_results,
+        ) = llm_result.values()
+        final_analysis_eng, final_analysis_hi = llm_chatbot.final_analysis(
+            comparative_analysis_results
+        )
+        final_output = get_formatted_output(
+            company_name,
+            analyzed_articles,
+            topic_extraction_results,
+            topic_overlap_results,
+            comparative_analysis_results,
+            final_analysis_eng,
+        )
+        app.state.hindi_summary = final_analysis_hi
+        return final_output
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/hindi-summary")
+def get_hindi_summary():
+    if not app.state.hindi_summary:
+        raise HTTPException(
+            status_code=500, detail="Generate the Comparative Analysis first."
+        )
+    save_audio(app.state.hindi_summary)
+    return {"hindi_summary": app.state.hindi_summary}

llm_utils.py ADDED Viewed

	@@ -0,0 +1,241 @@

+import asyncio
+from pydantic import BaseModel, Field
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_openai import ChatOpenAI
+from typing import List
+import asyncio
+class TopicExtraction(BaseModel):
+    """Extracts topics from news article."""
+    topics: List[str] = Field(
+        ..., description="A list of topics covered in the news article."
+    )
+class TopicOverlap(BaseModel):
+    """Extracts topics from news article."""
+    common_topics: List[str] = Field(
+        ..., description="A list of topics covered in the both article."
+    )
+    unique_topics_1: List[str] = Field(
+        ..., description="A list of topics unique to article 1."
+    )
+    unique_topics_2: List[str] = Field(
+        ..., description="A list of topics unique to article 2."
+    )
+class ComparativeAnalyzer(BaseModel):
+    """Compares given pair of articles and extracts comparison and impact."""
+    comparison: str = Field(
+        ..., description="A sentence of comparative insights between articles."
+    )
+    impact: str = Field(
+        ..., description="A sentence of potential impacts from the compared articles."
+    )
+class FinalAnalysis(BaseModel):
+    """Summarizes the Comparative analysis."""
+    english: str = Field(..., description="Summarizes the analysis in english.")
+    hindi: str = Field(..., description="Summarizes the analysis in hindi.")
+class ChatBot:
+    def __init__(
+        self, api_key: str, model: str, articles_dict: list, company_name: str
+    ):
+        self.llm = ChatOpenAI(model=model, api_key=api_key, temperature=0.1)
+        articles_list = []
+        for i, article in enumerate(articles_dict):
+            title = article["title"]
+            summary = article["summary"]
+            sentiment = article["sentiment"]
+            articles_list.append(
+                f"title {title} \n summary {summary} \n sentiment {sentiment}  \n\n"
+            )
+        self.articles = articles_list
+        self.company_name = company_name
+    async def topic_extraction(self, article: str):
+        system_message = """You are an expert in text analysis and topic extraction. Your task is to identify the main topics from a short news articleS.
+        ### Instructions:
+        - Extract **2 to 3 key topics** that summarize the core ideas of the article.
+        - Use **concise, generalizable topics** (e.g., "Electric Vehicles" instead of "Tesla Model X").
+        - Avoid generic words like "news" or "report".
+        - If relevant, include categories such as **Technology, Finance, Politics, Business, or Science**.
+        - Return the topics in **JSON format** as a list of strings.
+        - Seperate the topics for each articles by line break.
+        - Do not include just he company name {company_name}
+        ### Example:
+        #### Input Article:
+        "Tesla has launched a new AI-powered self-driving feature that improves vehicle autonomy and enhances road safety. The update is expected to impact the automotive industry's shift toward electric and smart vehicles."
+        #### Output:
+        ["Artificial Intelligence", "Self-Driving Cars", "Automotive Industry", "Electric Vehicles", "Road Safety"]
+        :
+                        """
+        prompt = ChatPromptTemplate.from_messages(
+            [("system", system_message), ("human", "Input Article: \n {articles}")]
+        )
+        structured_llm = self.llm.with_structured_output(TopicExtraction)
+        chain = prompt | structured_llm
+        response = await chain.ainvoke(
+            ({"company_name": self.company_name, "articles": article})
+        )
+        return response.topics
+    async def topic_overlap(self, id1: int, id2: int):
+        article_1, article_2 = self.articles[id1], self.articles[id2]
+        system_message = """You are an advanced AI specializing in text analysis and topic extraction. Your task is to compare two news articles and extract key topics.
+        ### **Instructions:**
+        - Identify **common topics** present in **both articles**.
+        - Identify **topics unique to each article**.
+        - Use **generalized topics** (e.g., "Electric Vehicles" instead of "Tesla Model X").
+        - Ensure topics are **concise and meaningful**.
+        ---
+        ### **Example:**
+        #### **Article 1:**
+        "Tesla has launched a new AI-powered self-driving feature that enhances vehicle autonomy and road safety. The update is expected to impact the automotive industry."
+        #### **Article 2:**
+        "Regulators are reviewing Tesla’s self-driving technology due to safety concerns. Experts debate whether AI-based vehicle autonomy meets current legal standards."
+        #### **Expected Output:**
+        "common_topics": ["Self-Driving Cars", "Artificial Intelligence", "Safety"],
+        "unique_topics_1": ["Automotive Industry", "Automotive Industry"],
+        "unique_topics_2": ["Regulations", "Legal Standards"]
+                        """
+        user_message = """
+        Here are the news articles on the company.
+        Article 1:
+        {article_1}
+        Article 2:
+        {article_2}
+        """
+        prompt = ChatPromptTemplate.from_messages(
+            [
+                ("system", system_message),
+                ("human", user_message),
+            ]
+        )
+        structured_llm = self.llm.with_structured_output(TopicOverlap)
+        chain = prompt | structured_llm
+        response = await chain.ainvoke({"article_1": article_1, "article_2": article_2})
+        return {
+            "Common Topics ": response.common_topics,
+            f"Unique Topics in Article {id1}": response.unique_topics_1,
+            f"Unique Topics in Article {id2}": response.unique_topics_2,
+        }
+    async def comparative_analysis(self, id1: int, id2: int):
+        article_1, article_2 = self.articles[id1], self.articles[id2]
+        system_message = """
+        You are an AI assistant that performs Comparative Analysis on given articles.
+        Analyze the following articles and provide a comparative analysis. Highlight their key themes, sentiment, and impact.
+        Compare how each article portrays the companies and discuss potential implications for investors and the industry.
+        Structure your response with 'Comparison' and 'Impact' sections.
+        The length of each comparison and impact should be less than 20 words
+        Mention the articles ids.
+        ### **Example:**
+        #### **Article 1:**
+        ""Tesla's New Model Breaks Sales Records.Tesla's latest EV sees record sales in Q3..."
+        #### **Article 2:**
+        "Regulatory Scrutiny on Tesla's Self-Driving Tech. Regulators have raised concerns over Tesla’s self-driving software..."
+        #### **Expected Output:**
+        "Comparison": "Article 1 highlights Tesla's strong sales, while Article 2 discusses regulatory issues.",
+        "Impact": "The first article boosts confidence in Tesla's market growth, while the second raises concerns about future regulatory hurdles."
+        """
+        user_message = """
+        Here are the news articles on the company.
+        Article {id1}:
+        {article_1}
+        Article {id2}:
+        {article_2}
+        """
+        prompt = ChatPromptTemplate.from_messages(
+            [
+                ("system", system_message),
+                ("human", user_message),
+            ]
+        )
+        structured_llm = self.llm.with_structured_output(ComparativeAnalyzer)
+        chain = prompt | structured_llm
+        response = await chain.ainvoke(
+            {"article_1": article_1, "article_2": article_2, "id1": id1, "id2": id2}
+        )
+        return {
+            f"comparison of {id1}, {id2}": response.comparison,
+            "impact": response.impact,
+        }
+    async def main(self, similar_pairs: list):
+        """Runs all OpenAI API calls in parallel."""
+        topic_extraction_tasks = [
+            self.topic_extraction(article) for article in self.articles
+        ]
+        topic_overlap_tasks = [
+            self.topic_overlap(id1, id2) for id1, id2, _ in similar_pairs
+        ]
+        comparative_analysis_tasks = [
+            self.comparative_analysis(id1, id2) for id1, id2, _ in similar_pairs
+        ]
+        (
+            topic_extraction_results,
+            topic_overlap_results,
+            comparative_analysis_results,
+        ) = await asyncio.gather(
+            asyncio.gather(*topic_extraction_tasks),
+            asyncio.gather(*topic_overlap_tasks),
+            asyncio.gather(*comparative_analysis_tasks),
+        )
+        return {
+            "topic_extraction_results": topic_extraction_results,
+            "topic_overlap_results": topic_overlap_results,
+            "comparative_analysis_results": comparative_analysis_results,
+        }
+    def final_analysis(self, comparative_analysis_articles):
+        comparative_results = "Comparative Analysis: \n"
+        for comparisons in comparative_analysis_articles:
+            comparison, impact = comparisons.values()
+            comparative_results += f"comparison: {comparison} \n impact: {impact} \n\n"
+        template = """
+        You are an AI assistant that reads a Comparative Analysis of Articles.
+        And summarizes them to produce the final sentiment analysis.
+        Make the final sentiment analysis less than 20 words
+        Comprative Analysis:
+        {comparative_results}
+        """
+        prompt = ChatPromptTemplate.from_template(template)
+        structured_llm = self.llm.with_structured_output(FinalAnalysis)
+        chain = prompt | structured_llm
+        response = chain.invoke({"comparative_results": comparative_results})
+        return response.english, response.hindi

requirements.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+beautifulsoup4==4.13.3
+gradio==5.22.0
+gradio_client==1.8.0
+langchain==0.3.20
+langchain-community==0.3.19
+langchain-core==0.3.45
+langchain-openai==0.3.9
+langchain-text-splitters==0.3.6
+multiprocess==0.70.16
+numpy==1.26.4
+openai==1.66.3
+pandas==2.2.3
+pydantic==2.10.6
+pydantic_core==2.27.2
+pydantic-settings==2.8.1
+requests==2.32.3
+scikit-learn==1.6.1
+scipy==1.15.2
+sentence-transformers==3.4.1
+tokenizers==0.20.3
+torch==2.6.0+cu124
+torchaudio==2.6.0
+transformers==4.46.1
+uvicorn==0.34.0

utils.py ADDED Viewed

	@@ -0,0 +1,146 @@

+import requests
+from bs4 import BeautifulSoup
+from transformers import pipeline
+import pandas as pd
+from sentence_transformers import SentenceTransformer, util
+# from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import itertools
+import re
+import heapq
+import torch
+from gtts import gTTS
+def filter_articles(articles_list, company_name):
+    """
+    Filters articles that only contain the company name.
+    Args:
+        articles_list (list): List of dictionaries with 'title' and 'summary'.
+        company_name (str): The company name to filter articles by.
+    Returns:
+        list: A filtered list of articles that contain the company name.
+    """
+    articles_list_filtered = []
+    for article in articles_list:
+        full_text = (article["title"] + " " + article["summary"]).lower()
+        if re.search(company_name.lower(), full_text):
+            articles_list_filtered.append(article)
+    return articles_list_filtered
+def bs4_extractor(company_name: str):
+    """
+    Extracts news articles from The New York Times and BBC for a given company.
+    Args:
+        company_name (str): The name of the company to search for.
+    Returns:
+        list: A list of dictionaries containing article titles and summaries.
+    """
+    articles_list = []
+    # Fetch and parse NYTimes articles
+    nytimes_url = f"https://www.nytimes.com/search?query={company_name}"
+    nytimes_page = requests.get(nytimes_url).text
+    nytimes_soup = BeautifulSoup(nytimes_page, "html.parser")
+    for article in nytimes_soup.find_all("li", {"data-testid": "search-bodega-result"}):
+        try:
+            title = article.find("h4").text.strip()
+            summary = article.find("p", {"class": "css-e5tzus"}).text.strip()
+            if not title or not summary:
+                continue
+            articles_list.append({"title": title, "summary": summary})
+        except AttributeError as e:
+            print(f"NYTimes Extraction Error: {e}")
+            continue
+    # Fetch and parse BBC articles
+    bbc_url = f"https://www.bbc.com/search?q={company_name}"
+    bbc_page = requests.get(bbc_url).text
+    bbc_soup = BeautifulSoup(bbc_page, "html.parser")
+    for article in bbc_soup.find_all("div", {"data-testid": "newport-article"}):
+        try:
+            title = article.find("h2", {"data-testid": "card-headline"}).text.strip()
+            summary = article.find(
+                "div", {"class": "sc-4ea10043-3 kMizuB"}
+            ).text.strip()
+            if not title or not summary:
+                continue
+            articles_list.append({"title": title, "summary": summary})
+        except AttributeError as e:
+            print(f"BBC Extraction Error: {e}")
+            continue
+    articles_list = articles_list[:10]
+    articles_filtered = filter_articles(articles_list, company_name)
+    return articles_filtered
+def save_audio(hindi_text):
+    tts = gTTS(text=hindi_text, lang="hi", slow=False)
+    tts.save("output.mp3")
+class SentimentAnalyzer:
+    def __init__(
+        self, model_id="mrm8488/deberta-v3-ft-financial-news-sentiment-analysis"
+    ):
+        device = "cuda:0" if torch.cuda.is_available() else "cpu"
+        self.pipe = pipeline(task="text-classification", model=model_id, device=device)
+    def classify_sentiments(self, articles_list):
+        """
+        Classifies the sentiment of each article based on its title and summary.
+        Args:
+            articles_list (list of dict): A list of articles with 'title' and 'summary' keys.
+        Returns:
+            list of dict: A new list with added 'sentiment' keys.
+        """
+        for article in articles_list:
+            sentiment = self.pipe(f"{article['title']}. {article['summary']}")
+            article["sentiment"] = sentiment[0]["label"]
+        return articles_list
+class SemanticGrouping:
+    def __init__(self, model_id="sentence-transformers/all-MiniLM-L6-v2"):
+        self.model = SentenceTransformer(model_id)
+    def find_top_k_similar_articles(self, articles, k=5):
+        """
+        Finds the top-k most similar pairs of articles using cosine similarity.
+        Args:
+            articles (list of str): A list of article texts to compare.
+            k (int, optional): The number of top similar pairs to return. Defaults to 5.
+        Returns:
+            list of tuples: A list of (index1, index2, similarity_score) tuples.
+        """
+        embeddings = self.model.encode(articles, convert_to_tensor=True)
+        cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)
+        pairs = itertools.combinations(range(len(articles)), 2)
+        similarity_scores = [(i, j, cosine_scores[i][j].item()) for i, j in pairs]
+        top_k_pairs = heapq.nlargest(k, similarity_scores, key=lambda x: x[2])
+        return top_k_pairs