File size: 6,314 Bytes
3bb5fb5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
license: cc-by-4.0
tags:
- text
- news
- global
- knowledge-graph
- geopolitics
dataset_info:
features:
- name: GKGRECORDID
dtype: string
- name: DATE
dtype: string
- name: SourceCollectionIdentifier
dtype: string
- name: SourceCommonName
dtype: string
- name: DocumentIdentifier
dtype: string
- name: V1Counts
dtype: string
- name: V2.1Counts
dtype: string
- name: V1Themes
dtype: string
- name: V2EnhancedThemes
dtype: string
- name: V1Locations
dtype: string
- name: V2EnhancedLocations
dtype: string
- name: V1Persons
dtype: string
- name: V2EnhancedPersons
dtype: string
- name: V1Organizations
dtype: string
- name: V2EnhancedOrganizations
dtype: string
- name: V1.5Tone
dtype: string
- name: V2GCAM
dtype: string
- name: V2.1EnhancedDates
dtype: string
- name: V2.1Quotations
dtype: string
- name: V2.1AllNames
dtype: string
- name: V2.1Amounts
dtype: string
- name: tone
dtype: float64
splits:
- name: train
num_bytes: 3331097194
num_examples: 281215
- name: negative_tone
num_bytes: 3331097194
num_examples: 281215
download_size: 2229048020
dataset_size: 6662194388
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: negative_tone
path: data/negative_tone-*
---
# Dataset Card for dwb2023/gdelt-gkg-march2020-v2
## Dataset Details
### Dataset Description
This dataset contains GDELT Global Knowledge Graph (GKG) data covering March 10-22, 2020, during the early phase of the COVID-19 pandemic. It captures global event interactions, actor relationships, and contextual narratives to support temporal, spatial, and thematic analysis.
- **Curated by:** dwb2023
### Dataset Sources
- **Repository:** [http://data.gdeltproject.org/gdeltv2](http://data.gdeltproject.org/gdeltv2)
- **GKG Documentation:** [GDELT 2.0 Overview](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/), [GDELT GKG Codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf)
## Uses
### Direct Use
This dataset is suitable for:
- Temporal analysis of global events
- Relationship mapping of key actors in supply chain and logistics
- Sentiment and thematic analysis of COVID-19 pandemic narratives
### Out-of-Scope Use
- Not designed for real-time monitoring due to its historic and static nature
- Not intended for medical diagnosis or predictive health modeling
## Dataset Structure
### Features and Relationships
- this dataset focuses on a subset of features from the source GDELT dataset.
| Name | Type | Aspect | Description |
|------|------|---------|-------------|
| DATE | string | Metadata | Publication date of the article/document |
| SourceCollectionIdentifier | string | Metadata | Unique identifier for the source collection |
| SourceCommonName | string | Metadata | Common/display name of the source |
| DocumentIdentifier | string | Metadata | Unique URL/identifier of the document |
| V1Counts | string | Metrics | Original count mentions of numeric values |
| V2.1Counts | string | Metrics | Enhanced numeric pattern extraction |
| V1Themes | string | Classification | Original thematic categorization |
| V2EnhancedThemes | string | Classification | Expanded theme taxonomy and classification |
| V1Locations | string | Entities | Original geographic mentions |
| V2EnhancedLocations | string | Entities | Enhanced location extraction with coordinates |
| V1Persons | string | Entities | Original person name mentions |
| V2EnhancedPersons | string | Entities | Enhanced person name extraction |
| V1Organizations | string | Entities | Original organization mentions |
| V2EnhancedOrganizations | string | Entities | Enhanced organization name extraction |
| V1.5Tone | string | Sentiment | Original emotional tone scoring |
| V2GCAM | string | Sentiment | Global Content Analysis Measures |
| V2.1EnhancedDates | string | Temporal | Temporal reference extraction |
| V2.1Quotations | string | Content | Direct quote extraction |
| V2.1AllNames | string | Entities | Comprehensive named entity extraction |
| V2.1Amounts | string | Metrics | Quantity and measurement extraction |
### Aspects Overview:
- **Metadata**: Core document information
- **Metrics**: Numerical measurements and counts
- **Classification**: Categorical and thematic analysis
- **Entities**: Named entity recognition (locations, persons, organizations)
- **Sentiment**: Emotional and tone analysis
- **Temporal**: Time-related information
- **Content**: Direct content extraction
## Dataset Creation
### Curation Rationale
This dataset was curated to capture the rapidly evolving global narrative during the early phase of the COVID-19 pandemic, focusing specifically on March 10–22, 2020. By zeroing in on this critical period, it offers a granular perspective on how geopolitical events, actor relationships, and thematic discussions shifted amid the escalating pandemic. The enhanced GKG features further enable advanced entity, sentiment, and thematic analysis, making it a valuable resource for studying the socio-political and economic impacts of COVID-19 during a pivotal point in global history.
### Curation Approach
A targeted subset of GDELT’s columns was selected to streamline analysis on key entities (locations, persons, organizations), thematic tags, and sentiment scores—core components of many knowledge-graph and text analytics workflows. This approach balances comprehensive coverage with manageable data size and performance. The ETL pipeline used to produce these transformations is documented here:
[https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0](https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0).
## Citation
When using this dataset, please cite both the dataset and original GDELT project:
```bibtex
@misc{gdelt-gkg-march2020,
title = {GDELT Global Knowledge Graph March 2020 Dataset},
author = {dwb2023},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/dwb2023/gdelt-gkg-march2020-v2}
}
```
## Dataset Card Contact
For questions and comments about this dataset card, please contact dwb2023 through the Hugging Face platform.
|