File size: 6,314 Bytes
3bb5fb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: cc-by-4.0
tags:
- text
- news
- global
- knowledge-graph
- geopolitics
dataset_info:
  features:
  - name: GKGRECORDID
    dtype: string
  - name: DATE
    dtype: string
  - name: SourceCollectionIdentifier
    dtype: string
  - name: SourceCommonName
    dtype: string
  - name: DocumentIdentifier
    dtype: string
  - name: V1Counts
    dtype: string
  - name: V2.1Counts
    dtype: string
  - name: V1Themes
    dtype: string
  - name: V2EnhancedThemes
    dtype: string
  - name: V1Locations
    dtype: string
  - name: V2EnhancedLocations
    dtype: string
  - name: V1Persons
    dtype: string
  - name: V2EnhancedPersons
    dtype: string
  - name: V1Organizations
    dtype: string
  - name: V2EnhancedOrganizations
    dtype: string
  - name: V1.5Tone
    dtype: string
  - name: V2GCAM
    dtype: string
  - name: V2.1EnhancedDates
    dtype: string
  - name: V2.1Quotations
    dtype: string
  - name: V2.1AllNames
    dtype: string
  - name: V2.1Amounts
    dtype: string
  - name: tone
    dtype: float64
  splits:
  - name: train
    num_bytes: 3331097194
    num_examples: 281215
  - name: negative_tone
    num_bytes: 3331097194
    num_examples: 281215
  download_size: 2229048020
  dataset_size: 6662194388
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
  - split: negative_tone
    path: data/negative_tone-*
---

# Dataset Card for dwb2023/gdelt-gkg-march2020-v2

## Dataset Details

### Dataset Description

This dataset contains GDELT Global Knowledge Graph (GKG) data covering March 10-22, 2020, during the early phase of the COVID-19 pandemic. It captures global event interactions, actor relationships, and contextual narratives to support temporal, spatial, and thematic analysis.

- **Curated by:** dwb2023

### Dataset Sources

- **Repository:** [http://data.gdeltproject.org/gdeltv2](http://data.gdeltproject.org/gdeltv2)
- **GKG Documentation:** [GDELT 2.0 Overview](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/), [GDELT GKG Codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf)

## Uses

### Direct Use

This dataset is suitable for:

- Temporal analysis of global events
- Relationship mapping of key actors in supply chain and logistics
- Sentiment and thematic analysis of COVID-19 pandemic narratives

### Out-of-Scope Use

- Not designed for real-time monitoring due to its historic and static nature
- Not intended for medical diagnosis or predictive health modeling

## Dataset Structure

### Features and Relationships

- this dataset focuses on a subset of features from the source GDELT dataset.

| Name | Type | Aspect | Description |
|------|------|---------|-------------|
| DATE | string | Metadata | Publication date of the article/document |
| SourceCollectionIdentifier | string | Metadata | Unique identifier for the source collection |
| SourceCommonName | string | Metadata | Common/display name of the source |
| DocumentIdentifier | string | Metadata | Unique URL/identifier of the document |
| V1Counts | string | Metrics | Original count mentions of numeric values |
| V2.1Counts | string | Metrics | Enhanced numeric pattern extraction |
| V1Themes | string | Classification | Original thematic categorization |
| V2EnhancedThemes | string | Classification | Expanded theme taxonomy and classification |
| V1Locations | string | Entities | Original geographic mentions |
| V2EnhancedLocations | string | Entities | Enhanced location extraction with coordinates |
| V1Persons | string | Entities | Original person name mentions |
| V2EnhancedPersons | string | Entities | Enhanced person name extraction |
| V1Organizations | string | Entities | Original organization mentions |
| V2EnhancedOrganizations | string | Entities | Enhanced organization name extraction |
| V1.5Tone | string | Sentiment | Original emotional tone scoring |
| V2GCAM | string | Sentiment | Global Content Analysis Measures |
| V2.1EnhancedDates | string | Temporal | Temporal reference extraction |
| V2.1Quotations | string | Content | Direct quote extraction |
| V2.1AllNames | string | Entities | Comprehensive named entity extraction |
| V2.1Amounts | string | Metrics | Quantity and measurement extraction |

### Aspects Overview:
- **Metadata**: Core document information
- **Metrics**: Numerical measurements and counts
- **Classification**: Categorical and thematic analysis
- **Entities**: Named entity recognition (locations, persons, organizations)
- **Sentiment**: Emotional and tone analysis
- **Temporal**: Time-related information
- **Content**: Direct content extraction

## Dataset Creation

### Curation Rationale
This dataset was curated to capture the rapidly evolving global narrative during the early phase of the COVID-19 pandemic, focusing specifically on March 10–22, 2020. By zeroing in on this critical period, it offers a granular perspective on how geopolitical events, actor relationships, and thematic discussions shifted amid the escalating pandemic. The enhanced GKG features further enable advanced entity, sentiment, and thematic analysis, making it a valuable resource for studying the socio-political and economic impacts of COVID-19 during a pivotal point in global history.

### Curation Approach
A targeted subset of GDELT’s columns was selected to streamline analysis on key entities (locations, persons, organizations), thematic tags, and sentiment scores—core components of many knowledge-graph and text analytics workflows. This approach balances comprehensive coverage with manageable data size and performance. The ETL pipeline used to produce these transformations is documented here:
[https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0](https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0).

## Citation

When using this dataset, please cite both the dataset and original GDELT project:

```bibtex
@misc{gdelt-gkg-march2020,
    title = {GDELT Global Knowledge Graph March 2020 Dataset},
    author = {dwb2023},
    year = {2025},
    publisher = {Hugging Face},
    url = {https://huggingface.co/datasets/dwb2023/gdelt-gkg-march2020-v2}
}
```

## Dataset Card Contact

For questions and comments about this dataset card, please contact dwb2023 through the Hugging Face platform.