askveracity / docs /data-handling.md
ankanghosh's picture
Upload 21 files
6d11371 verified

Data Handling in AskVeracity

This document explains how data flows through the AskVeracity fact-checking and misinformation detection system, from user input to final verification results.

Data Flow Overview

User Input β†’ Claim Extraction β†’ Category Detection β†’ Evidence Retrieval β†’ Evidence Analysis β†’ Classification β†’ Explanation β†’ Result Display

User Input Processing

Input Sanitization and Extraction

  1. Input Acceptance: The system accepts user input as free-form text through the Streamlit interface.

  2. Claim Extraction (modules/claim_extraction.py):

    • For concise inputs (<30 words), the system preserves the input as-is
    • For longer texts, an LLM extracts the main factual claim
    • Validation ensures the extraction doesn't add information not present in the original
    • Entity preservation is verified using spaCy's NER
  3. Claim Shortening:

    • For evidence retrieval, claims are shortened to preserve key entities and context
    • Preserves entity mentions, key nouns, titles, country references, and negation contexts

Evidence Retrieval and Processing

Multi-source Evidence Gathering

Evidence is collected from multiple sources in parallel (modules/evidence_retrieval.py):

  1. Category Detection (modules/category_detection.py):

    • Detects the claim category (ai, science, technology, politics, business, world, sports, entertainment)
    • Prioritizes sources based on category
    • No category receives preferential weighting; assignment is based purely on keyword matching
  2. Wikipedia evidence:

    • Search Wikipedia API for relevant articles
    • Extract introductory paragraphs
    • Process in parallel for up to 3 top search results
  3. Wikidata evidence:

    • SPARQL queries for structured data
    • Entity extraction with descriptions
  4. News API evidence:

    • Retrieval from NewsAPI.org with date filtering
    • Prioritizes recent articles
    • Extracts titles, descriptions, and content snippets
  5. RSS Feed evidence (modules/rss_feed.py):

    • Parallel retrieval from multiple RSS feeds
    • Category-specific feeds selection
    • Relevance and recency scoring
  6. ClaimReview evidence:

    • Google's Fact Check Tools API integration
    • Retrieves fact-checks from fact-checking organizations
    • Includes ratings and publisher information
  7. Scholarly evidence:

    • OpenAlex API for academic sources
    • Extracts titles, abstracts, and publication dates
  8. Category Fallback mechanism:

    • For AI claims, falls back to technology sources if insufficient evidence (for RSS feeds)
    • For other categories, falls back to default RSS feeds
    • Ensures robust evidence retrieval across related domains

Evidence Preprocessing

Each evidence item is standardized to a consistent format:

Title: [title], Source: [source], Date: [date], URL: [url], Content: [content snippet]

Length limits are applied to reduce token usage:

  • Content snippets are limited to ~1000 characters
  • Evidence items are truncated while maintaining context

Evidence Analysis and Relevance Ranking

Relevance Assessment

Evidence is analyzed and scored for relevance:

  1. Component Extraction:

    • Extract entities, verbs, and keywords from the claim
    • Use NLP processing to identify key claim components
  2. Entity and Verb Matching:

    • Match entities from claim to evidence (case-sensitive and case-insensitive)
    • Match verbs from claim to evidence
    • Score based on matches (entity matches weighted higher than verb matches)
  3. Temporal Relevance:

    • Detection of temporal indicators in claims
    • Date-based filtering for time-sensitive claims
    • Adjusts evidence retrieval window based on claim temporal context
  4. Scoring Formula:

    final_score = (entity_matches * 3.0) + (verb_matches * 2.0)
    

    If no entity or verb matches, fall back to keyword matching:

    final_score = keyword_matches * 1.0
    

Evidence Selection

The system selects the most relevant evidence:

  1. Relevance Sorting:

    • Evidence items sorted by relevance score (descending)
    • Top 10 most relevant items selected
  2. Handling No Evidence:

    • If no evidence is found, a placeholder is returned
    • Ensures graceful handling of edge cases

Truth Classification

Evidence Classification (modules/classification.py)

Each evidence item is classified individually:

  1. LLM Classification:

    • Each evidence item is analyzed by an LLM
    • Classification categories: support, contradict, insufficient
    • Confidence score (0-100) assigned to each classification
    • Structured output parsing with fallback mechanisms
  2. Tense Normalization:

    • Normalizes verb tenses in claims to ensure consistent classification
    • Converts present simple and perfect forms to past tense equivalents
    • Preserves semantic equivalence across tense variations

Verdict Aggregation

Evidence classifications are aggregated to determine the final verdict:

  1. Weighted Aggregation:

    • 55% weight for count of support/contradict items
    • 45% weight for quality (confidence) of support/contradict items
  2. Confidence Calculation:

    • Formula: 1.0 - (min_score / max_score)
    • Higher confidence for consistent evidence
    • Lower confidence for mixed or insufficient evidence
  3. Final Verdict Categories:

    • "True (Based on Evidence)"
    • "False (Based on Evidence)"
    • "Uncertain"

Explanation Generation

Explanation Creation (modules/explanation.py)

Human-readable explanations are generated based on the verdict:

  1. Template Selection:

    • Different prompts for true, false, and uncertain verdicts
    • Special handling for claims containing negation
  2. Confidence Communication:

    • Translation of confidence scores to descriptive language
    • Clear communication of certainty/uncertainty
  3. Very Low Confidence Handling:

    • Special explanations for verdicts with very low confidence (<10%)
    • Strong recommendations to verify with authoritative sources

Result Presentation

Results are presented in the Streamlit UI with multiple components:

  1. Verdict Display:

    • Color-coded verdict (green for true, red for false, gray for uncertain)
    • Confidence percentage
    • Explanation text
  2. Evidence Presentation:

    • Tabbed interface for different evidence views with URLs if available
    • Supporting and contradicting evidence tabs
    • Source distribution summary
  3. Input Guidance:

    • Tips for claim formatting
    • Guidance for time-sensitive claims
    • Suggestions for verb tense based on claim age
  4. Processing Insights:

    • Processing time
    • AI reasoning steps
    • Source distribution statistics

Data Persistence and Privacy

AskVeracity prioritizes user privacy:

  1. No Data Storage:

    • User claims are not stored persistently
    • Results are maintained only in session state
    • No user data is collected or retained
  2. Session Management:

    • Session state in Streamlit manages current user interaction
    • Session is cleared when starting a new verification
  3. API Interaction:

    • External API calls use their respective privacy policies
    • OpenAI API usage follows their data handling practices
  4. Caching:

    • Model caching for performance
    • Resource cleanup on application termination

Performance Tracking

The system includes a performance tracking utility (utils/performance.py):

  1. Metrics Tracked:

    • Claims processed count
    • Evidence retrieval success rates
    • Processing times
    • Confidence scores
    • Source types used
    • Temporal relevance
  2. Usage:

    • Performance metrics are logged during processing
    • Summary of select metrics available in the final result
    • Used for system optimization

Performance Evaluation

The system includes a performance evaluation script (evaluate_performance.py):

  1. Test Claims:

    • Predefined set of test claims with known ground truth labels
    • Claims categorized as "True", "False", or "Uncertain"
  2. Metrics:

    • Overall accuracy: Percentage of claims correctly classified according to ground truth
    • Safety rate: Percentage of claims either correctly classified or safely categorized as "Uncertain" rather than making an incorrect assertion
    • Per-class accuracy and safety rates
    • Average processing time
    • Average confidence score
    • Classification distributions
  3. Visualization:

    • Charts for accuracy by classification type
    • Charts for safety rate by classification type
    • Processing time by classification type
    • Confidence scores by classification type
  4. Results Storage:

    • Detailed results saved to JSON file
    • Visualization charts saved as PNG files
    • All results stored in the results/ directory

Error Handling and Resilience

The system implements robust error handling:

  1. API Error Handling (utils/api_utils.py):

    • Decorator-based error handling
    • Exponential backoff for retries
    • Rate limiting respecting API constraints
  2. Safe JSON Parsing:

    • Defensive parsing of API responses
    • Fallback mechanisms for invalid responses
  3. Graceful Degradation:

    • Multiple fallback strategies
    • Core functionality preservation even when some sources fail
  4. Fallback Mechanisms:

    • Fallback for truth classification when classifier is not called
    • Fallback for explanation generation when explanation generator is not called
    • Ensures complete results even with partial component failures