metadata

license: cc-by-4.0
tags:
  - text
  - news
  - global
  - knowledge-graph
  - geopolitics
dataset_info:
  features:
    - name: GKGRECORDID
      dtype: string
    - name: DATE
      dtype: string
    - name: SourceCollectionIdentifier
      dtype: string
    - name: SourceCommonName
      dtype: string
    - name: DocumentIdentifier
      dtype: string
    - name: V1Counts
      dtype: string
    - name: V2.1Counts
      dtype: string
    - name: V1Themes
      dtype: string
    - name: V2EnhancedThemes
      dtype: string
    - name: V1Locations
      dtype: string
    - name: V2EnhancedLocations
      dtype: string
    - name: V1Persons
      dtype: string
    - name: V2EnhancedPersons
      dtype: string
    - name: V1Organizations
      dtype: string
    - name: V2EnhancedOrganizations
      dtype: string
    - name: V1.5Tone
      dtype: string
    - name: V2.1EnhancedDates
      dtype: string
    - name: V2GCAM
      dtype: string
    - name: V2.1SharingImage
      dtype: string
    - name: V2.1Quotations
      dtype: string
    - name: V2.1AllNames
      dtype: string
    - name: V2.1Amounts
      dtype: string

Dataset Card for dwb2023/gdelt-gkg-2025-v2

Dataset Details

Dataset Description

This dataset contains GDELT Global Knowledge Graph (GKG) data covering February 2025. It captures global event interactions, actor relationships, and contextual narratives to support temporal, spatial, and thematic analysis.

Curated by: dwb2023

Dataset Sources

Repository: http://data.gdeltproject.org/gdeltv2
GKG Documentation: GDELT 2.0 Overview, GDELT GKG Codebook

Uses

Direct Use

This dataset is suitable for:

Temporal analysis of global events

Out-of-Scope Use

Not designed for real-time monitoring due to its historic and static nature
Not intended for medical diagnosis or predictive health modeling

Dataset Structure

Features and Relationships

this dataset focuses on a subset of features from the source GDELT dataset.

Name	Type	Aspect	Description
DATE	string	Metadata	Publication date of the article/document
SourceCollectionIdentifier	string	Metadata	Unique identifier for the source collection
SourceCommonName	string	Metadata	Common/display name of the source
DocumentIdentifier	string	Metadata	Unique URL/identifier of the document
V1Counts	string	Metrics	Original count mentions of numeric values
V2.1Counts	string	Metrics	Enhanced numeric pattern extraction
V1Themes	string	Classification	Original thematic categorization
V2EnhancedThemes	string	Classification	Expanded theme taxonomy and classification
V1Locations	string	Entities	Original geographic mentions
V2EnhancedLocations	string	Entities	Enhanced location extraction with coordinates
V1Persons	string	Entities	Original person name mentions
V2EnhancedPersons	string	Entities	Enhanced person name extraction
V1Organizations	string	Entities	Original organization mentions
V2EnhancedOrganizations	string	Entities	Enhanced organization name extraction
V1.5Tone	string	Sentiment	Original emotional tone scoring
V2.1EnhancedDates	string	Temporal	Temporal reference extraction
V2GCAM	string	Sentiment	Global Content Analysis Measures
V2.1SharingImage	string	Content	URL of document image
V2.1Quotations	string	Content	Direct quote extraction
V2.1AllNames	string	Entities	Comprehensive named entity extraction
V2.1Amounts	string	Metrics	Quantity and measurement extraction

Aspects Overview:

Metadata: Core document information
Metrics: Numerical measurements and counts
Classification: Categorical and thematic analysis
Entities: Named entity recognition (locations, persons, organizations)
Sentiment: Emotional and tone analysis
Temporal: Time-related information
Content: Direct content extraction

Dataset Creation

Curation Rationale

This dataset was curated to capture the rapidly evolving global narrative during February 2025. By zeroing in on this critical period, it offers a granular perspective on how geopolitical events, actor relationships, and thematic discussions shifted amid the escalating pandemic. The enhanced GKG features further enable advanced entity, sentiment, and thematic analysis, making it a valuable resource for studying the socio-political and economic impacts of emergent LLM capabilities.

Curation Approach

A targeted subset of GDELT’s columns was selected to streamline analysis on key entities (locations, persons, organizations), thematic tags, and sentiment scores—core components of many knowledge-graph and text analytics workflows. This approach balances comprehensive coverage with manageable data size and performance. The ETL pipeline used to produce these transformations is documented here: https://gist.github.com/donbr/5293468436a1a39bd2d9f4959cbd4923.

Citation

When using this dataset, please cite both the dataset and original GDELT project:

@misc{gdelt-gkg-2025-v2,
    title = {GDELT Global Knowledge Graph 2025 Dataset},
    author = {dwb2023},
    year = {2025},
    publisher = {Hugging Face},
    url = {https://huggingface.co/datasets/dwb2023/gdelt-gkg-2025-v2}
}

Dataset Card Contact

For questions and comments about this dataset card, please contact dwb2023 through the Hugging Face platform.