Commit
·
3bb5fb5
0
Parent(s):
Initial commit for Hugging Face Spaces
Browse files- .gitignore +6 -0
- .python-version +1 -0
- README.md +61 -0
- app.py +24 -0
- data_access.py +118 -0
- graph_builder.py +220 -0
- graph_config.py +60 -0
- pages/1_🗺️_COVID_Navigator.py +174 -0
- pages/2_🔍_COVID_Event_Graph.py +189 -0
- pages/3_🌐_COVID_Network_Analysis.py +349 -0
- pages/4_🗺️_Feb_2025_Navigator.py +185 -0
- pages/5_🔍_Feb_2025_Event_Graph.py +198 -0
- pages/6_🧪_Feb_2025_Dataset_Explorer.py +250 -0
- requirements.txt +10 -0
- solution_component_notes/gdelt_gkg_duckdb_networkx_v5.ipynb +388 -0
- solution_component_notes/gdelt_prefect_extract_to_hf_ds.py +303 -0
- solution_component_notes/hf_gdelt_dataset_2020_covid.md +166 -0
- solution_component_notes/hf_gdelt_dataset_2025_february.md +149 -0
.gitignore
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
.venv/
|
2 |
+
# add pycache and pyc files
|
3 |
+
__pycache__/
|
4 |
+
*.pyc
|
5 |
+
# add lib directory
|
6 |
+
lib/
|
.python-version
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
3.11
|
README.md
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: >-
|
3 |
+
Unveiling Global Narratives Through Knowledge Graphs: A Case Study Using GDELT
|
4 |
+
and Streamlit
|
5 |
+
emoji: 🔮
|
6 |
+
colorFrom: indigo
|
7 |
+
colorTo: blue
|
8 |
+
sdk: streamlit
|
9 |
+
sdk_version: 1.42.0
|
10 |
+
app_file: app.py
|
11 |
+
pinned: false
|
12 |
+
license: cc-by-4.0
|
13 |
+
short_description: using knowledge graphs for insight
|
14 |
+
---
|
15 |
+
|
16 |
+
**Title:** Unveiling Global Narratives Through Knowledge Graphs: A Case Study Using GDELT and Streamlit
|
17 |
+
**Keywords:** GDELT, Knowledge Graphs, Network Analysis, Sentiment Analysis, Prefect, Hugging Face datasets, DuckDB, Streamlit, Neo4j, NetworkX, st-link-analysis, streamlit-aggrid, pyvis, pandas
|
18 |
+
|
19 |
+
## Abstract
|
20 |
+
The global landscape is increasingly shaped by evolving narratives driven by interconnected events and entities. To better understand these dynamics, we introduce **GDELT Insight Explorer**, a knowledge graph-based platform built using Streamlit, DuckDB, and NetworkX. This paper presents a detailed case study on using the platform to analyze GDELT Global Knowledge Graph (GKG) data from March 2020. We focus on uncovering global narratives and relationships between actors and themes during the early phase of the COVID-19 pandemic. Our findings emphasize the utility of real-time event data visualization and network analysis in tracing narrative propagation and identifying key influencers in global events.
|
21 |
+
|
22 |
+
## 1. Introduction
|
23 |
+
Understanding global narratives requires tools that can capture the complexity of events, their associated entities, and evolving sentiment over time. Traditional tabular analysis methods are often insufficient for capturing these relationships at scale. Knowledge graphs offer a robust solution for modeling and visualizing the interconnected nature of real-world events. This paper documents the development and application of **GDELT Insight Explorer**, a platform designed to leverage GDELT data for interactive exploration and insight generation.
|
24 |
+
|
25 |
+
## 2. Methodology
|
26 |
+
|
27 |
+
### 2.1 Data Source and Processing
|
28 |
+
The application is powered by the GDELT Global Knowledge Graph (GKG) dataset, focusing on data from March 10–22, 2020. The dataset includes key features such as themes, locations, persons, organizations, and sentiment scores. Our ETL pipeline, implemented using Prefect and DuckDB, extracts and transforms the data into a Parquet format for efficient querying and filtering.
|
29 |
+
|
30 |
+
- **Data Filtering:** We prioritize events with a tone score below -6 to identify highly negative narratives.
|
31 |
+
- **Data Storage:** DuckDB is used for in-memory querying, enabling real-time analysis of filtered datasets.
|
32 |
+
- **Graph Construction:** NetworkX and Neo4j are employed for graph creation, with relationships categorized into entities such as persons, organizations, and locations.
|
33 |
+
|
34 |
+
### 2.2 Platform Architecture
|
35 |
+
The platform is built using Streamlit, with a modular architecture that supports multiple analysis modes:
|
36 |
+
- **Event Navigator:** Provides a tabular overview of filtered events with interactive search and filtering.
|
37 |
+
- **Event Graph Explorer:** Visualizes events and their associated entities in a graph format.
|
38 |
+
- **Community Detection and Network Analysis:** Employs NetworkX to detect communities and analyze network metrics such as centrality and density.
|
39 |
+
|
40 |
+
## 3. Findings
|
41 |
+
|
42 |
+
### 3.1 Narrative Detection and Sentiment Analysis
|
43 |
+
The negative tone filter helped identify early COVID-related narratives, revealing clusters of related events involving key global actors. By visualizing these relationships, we observed recurring themes of public health concerns and geopolitical tensions.
|
44 |
+
|
45 |
+
### 3.2 Community Detection
|
46 |
+
Using the Louvain method for community detection, we identified cohesive subgroups within the network. These communities often corresponded to specific geographic regions or thematic clusters, providing deeper insights into localized narratives.
|
47 |
+
|
48 |
+
### 3.3 Real-Time Filtering and Exploration
|
49 |
+
The integration of DuckDB allowed for seamless data filtering and exploration within the Streamlit interface. Users could drill down from high-level overviews to individual event records, facilitating rapid insight generation.
|
50 |
+
|
51 |
+
## 4. Conclusion and Future Work
|
52 |
+
The **GDELT Insight Explorer** demonstrates the potential of combining knowledge graphs and real-time data exploration for uncovering global narratives. Future work will focus on expanding the temporal range of the dataset, integrating additional data sources, and incorporating machine learning models for predictive analysis. The open-source nature of the platform encourages further development and adaptation across different domains.
|
53 |
+
|
54 |
+
## References
|
55 |
+
1. GDELT Project. (n.d.). [https://www.gdeltproject.org](https://www.gdeltproject.org)
|
56 |
+
2. Newman, M. E. J. (2018). *Networks: An Introduction*. Oxford University Press.
|
57 |
+
3. DuckDB. (n.d.). [https://duckdb.org](https://duckdb.org)
|
58 |
+
4. Prefect. (n.d.). [https://www.prefect.io](https://www.prefect.io)
|
59 |
+
|
60 |
+
## Appendix: Application Architecture and Code
|
61 |
+
For implementation details, please refer to the open-source repository: [https://huggingface.co/spaces/dwb2023/insight](https://huggingface.co/spaces/dwb2023/insight).
|
app.py
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
|
3 |
+
st.set_page_config(
|
4 |
+
page_title="GDELT Insight Explorer",
|
5 |
+
layout="wide",
|
6 |
+
page_icon="🔮"
|
7 |
+
)
|
8 |
+
|
9 |
+
st.title("GDELT Insight Explorer: Unveiling Global Event Narratives")
|
10 |
+
st.markdown("""
|
11 |
+
Welcome to the **GDELT Insight Explorer**, a multi-faceted platform that leverages knowledge graph techniques to analyze global events and trends.
|
12 |
+
|
13 |
+
**How to Get Started:**
|
14 |
+
- Use the sidebar to switch between different analysis modes.
|
15 |
+
- Explore datasets, visualize event relationships, and analyze network structures.
|
16 |
+
|
17 |
+
**Available Pages:**
|
18 |
+
- **🗺️ COVID Navigator:** Dive into curated COVID-related event data.
|
19 |
+
- **🔍 COVID Event Graph Explorer:** Inspect detailed event records and their interconnections.
|
20 |
+
- **🌐 Global Network Analysis:** Visualize and analyze the global network of events.
|
21 |
+
- **🗺️ Feb 2025 Navigator:** Investigate recent event data with advanced filtering.
|
22 |
+
- **🔍 Feb 2025 Event Graph Explorer:** Inspect detailed event records and their interconnections.
|
23 |
+
- **🧪 Feb 2025 Dataset Experimentation:** An experiment using the HF dataset directly to investigate impact on query behavior and performance.
|
24 |
+
""")
|
data_access.py
ADDED
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Data access module for GDELT data retrieval and filtering
|
3 |
+
"""
|
4 |
+
import duckdb
|
5 |
+
import pandas as pd
|
6 |
+
|
7 |
+
def get_gdelt_data(
|
8 |
+
limit=10,
|
9 |
+
tone_threshold=-7.0,
|
10 |
+
start_date=None,
|
11 |
+
end_date=None,
|
12 |
+
source_filter=None,
|
13 |
+
themes_filter=None,
|
14 |
+
persons_filter=None,
|
15 |
+
organizations_filter=None,
|
16 |
+
locations_filter=None
|
17 |
+
):
|
18 |
+
"""Get filtered GDELT data from DuckDB with dynamic query parameters."""
|
19 |
+
con = duckdb.connect(database=':memory:')
|
20 |
+
|
21 |
+
# Create view of the dataset
|
22 |
+
con.execute("""
|
23 |
+
CREATE VIEW negative_tone AS (
|
24 |
+
SELECT *
|
25 |
+
FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2@~parquet/default/negative_tone/*.parquet')
|
26 |
+
);
|
27 |
+
""")
|
28 |
+
|
29 |
+
# Base query components
|
30 |
+
base_conditions = [
|
31 |
+
"SourceCollectionIdentifier IS NOT NULL",
|
32 |
+
"DATE IS NOT NULL",
|
33 |
+
"SourceCommonName IS NOT NULL",
|
34 |
+
"DocumentIdentifier IS NOT NULL",
|
35 |
+
"V1Counts IS NOT NULL",
|
36 |
+
"V1Themes IS NOT NULL",
|
37 |
+
"V1Locations IS NOT NULL",
|
38 |
+
"V1Persons IS NOT NULL",
|
39 |
+
"V1Organizations IS NOT NULL",
|
40 |
+
"V2GCAM IS NOT NULL",
|
41 |
+
"\"V2.1Quotations\" IS NOT NULL",
|
42 |
+
"tone <= ?"
|
43 |
+
]
|
44 |
+
params = [tone_threshold]
|
45 |
+
extra_conditions = []
|
46 |
+
|
47 |
+
# Add optional filters
|
48 |
+
if start_date:
|
49 |
+
extra_conditions.append("DATE >= ?")
|
50 |
+
params.append(start_date)
|
51 |
+
if end_date:
|
52 |
+
extra_conditions.append("DATE <= ?")
|
53 |
+
params.append(end_date)
|
54 |
+
if source_filter:
|
55 |
+
extra_conditions.append("SourceCommonName ILIKE ?")
|
56 |
+
params.append(f"%{source_filter}%")
|
57 |
+
if themes_filter:
|
58 |
+
extra_conditions.append("(V1Themes ILIKE ? OR V2EnhancedThemes ILIKE ?)")
|
59 |
+
params.extend([f"%{themes_filter}%", f"%{themes_filter}%"])
|
60 |
+
if persons_filter:
|
61 |
+
extra_conditions.append("(V1Persons ILIKE ? OR V2EnhancedPersons ILIKE ?)")
|
62 |
+
params.extend([f"%{persons_filter}%", f"%{persons_filter}%"])
|
63 |
+
if organizations_filter:
|
64 |
+
extra_conditions.append("(V1Organizations ILIKE ? OR V2EnhancedOrganizations ILIKE ?)")
|
65 |
+
params.extend([f"%{organizations_filter}%", f"%{organizations_filter}%"])
|
66 |
+
if locations_filter:
|
67 |
+
extra_conditions.append("(V1Locations ILIKE ? OR V2EnhancedLocations ILIKE ?)")
|
68 |
+
params.extend([f"%{locations_filter}%", f"%{locations_filter}%"])
|
69 |
+
|
70 |
+
# Combine all conditions
|
71 |
+
all_conditions = base_conditions + extra_conditions
|
72 |
+
where_clause = " AND ".join(all_conditions) if all_conditions else "1=1"
|
73 |
+
|
74 |
+
# Build final query
|
75 |
+
query = f"""
|
76 |
+
SELECT *
|
77 |
+
FROM negative_tone
|
78 |
+
WHERE {where_clause}
|
79 |
+
LIMIT ?;
|
80 |
+
"""
|
81 |
+
params.append(limit)
|
82 |
+
|
83 |
+
# Execute query with parameters
|
84 |
+
results_df = con.execute(query, params).fetchdf()
|
85 |
+
con.close()
|
86 |
+
|
87 |
+
return results_df
|
88 |
+
|
89 |
+
def filter_dataframe(df, source_filter=None, date_filter=None, tone_min=None, tone_max=None):
|
90 |
+
"""Filter dataframe based on provided criteria"""
|
91 |
+
display_df = df[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone']].copy()
|
92 |
+
display_df.columns = ['ID', 'Date', 'Source', 'Tone']
|
93 |
+
|
94 |
+
if source_filter:
|
95 |
+
display_df = display_df[display_df['Source'].str.contains(source_filter, case=False, na=False)]
|
96 |
+
if date_filter:
|
97 |
+
display_df = display_df[display_df['Date'].str.contains(date_filter, na=False)]
|
98 |
+
if tone_min is not None and tone_max is not None:
|
99 |
+
display_df = display_df[
|
100 |
+
(display_df['Tone'] >= tone_min) &
|
101 |
+
(display_df['Tone'] <= tone_max)
|
102 |
+
]
|
103 |
+
|
104 |
+
return display_df
|
105 |
+
|
106 |
+
# Constants for raw data categories
|
107 |
+
GDELT_CATEGORIES = {
|
108 |
+
"Metadata": ["GKGRECORDID", "DATE", "SourceCommonName", "DocumentIdentifier", "V2.1Quotations", "tone"],
|
109 |
+
"Persons": ["V2EnhancedPersons", "V1Persons"],
|
110 |
+
"Organizations": ["V2EnhancedOrganizations", "V1Organizations"],
|
111 |
+
"Locations": ["V2EnhancedLocations", "V1Locations"],
|
112 |
+
"Themes": ["V2EnhancedThemes", "V1Themes"],
|
113 |
+
"Names": ["V2.1AllNames"],
|
114 |
+
"Counts": ["V2.1Counts", "V1Counts"],
|
115 |
+
"Amounts": ["V2.1Amounts"],
|
116 |
+
"V2GCAM": ["V2GCAM"],
|
117 |
+
"V2.1EnhancedDates": ["V2.1EnhancedDates"],
|
118 |
+
}
|
graph_builder.py
ADDED
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Graph builder module for converting GDELT data to graph formats
|
3 |
+
"""
|
4 |
+
import pandas as pd
|
5 |
+
import networkx as nx
|
6 |
+
import json
|
7 |
+
|
8 |
+
class GraphBuilder:
|
9 |
+
"""Base class for building graph from GDELT data"""
|
10 |
+
def process_entities(self, row):
|
11 |
+
"""Process entities from a row and return nodes and relationships"""
|
12 |
+
nodes = []
|
13 |
+
relationships = []
|
14 |
+
event_id = row["GKGRECORDID"]
|
15 |
+
event_date = row["DATE"]
|
16 |
+
event_source = row["SourceCommonName"]
|
17 |
+
event_document_id = row["DocumentIdentifier"]
|
18 |
+
# event_image = row["V2.1SharingImage"] if pd.notna(row["V2.1SharingImage"]) else ""
|
19 |
+
event_quotations = row["V2.1Quotations"] if pd.notna(row["V2.1Quotations"]) else ""
|
20 |
+
event_tone = float(row["tone"]) if pd.notna(row["tone"]) else 0.0
|
21 |
+
|
22 |
+
# Add event node
|
23 |
+
nodes.append({
|
24 |
+
"id": event_id,
|
25 |
+
"type": "event",
|
26 |
+
"properties": {
|
27 |
+
"date": event_date,
|
28 |
+
"source": event_source,
|
29 |
+
"document": event_document_id,
|
30 |
+
# "image": event_image,
|
31 |
+
"quotations": event_quotations,
|
32 |
+
"tone": event_tone
|
33 |
+
}
|
34 |
+
})
|
35 |
+
|
36 |
+
# Process each entity type
|
37 |
+
entity_mappings = {
|
38 |
+
"V2EnhancedPersons": ("Person", "MENTIONED_IN"),
|
39 |
+
"V2EnhancedOrganizations": ("Organization", "MENTIONED_IN"),
|
40 |
+
"V2EnhancedLocations": ("Location", "LOCATED_IN"),
|
41 |
+
"V2EnhancedThemes": ("Theme", "CATEGORIZED_AS"),
|
42 |
+
"V2.1AllNames": ("Name", "MENTIONED_IN"),
|
43 |
+
"V2.1Counts": ("Count", "MENTIONED_IN"),
|
44 |
+
"V2.1Amounts": ("Amount", "MENTIONED_IN"),
|
45 |
+
}
|
46 |
+
|
47 |
+
for field, (label, relationship) in entity_mappings.items():
|
48 |
+
if pd.notna(row[field]):
|
49 |
+
entities = [e.strip() for e in row[field].split(';') if e.strip()]
|
50 |
+
for entity in entities:
|
51 |
+
nodes.append({
|
52 |
+
"id": entity,
|
53 |
+
"type": label.lower(),
|
54 |
+
"properties": {"name": entity}
|
55 |
+
})
|
56 |
+
relationships.append({
|
57 |
+
"from": entity,
|
58 |
+
"to": event_id,
|
59 |
+
"type": relationship,
|
60 |
+
"properties": {"created_at": event_date}
|
61 |
+
})
|
62 |
+
|
63 |
+
return nodes, relationships
|
64 |
+
|
65 |
+
class NetworkXBuilder(GraphBuilder):
|
66 |
+
"""Builder for NetworkX graphs"""
|
67 |
+
def build_graph(self, df):
|
68 |
+
G = nx.Graph()
|
69 |
+
|
70 |
+
for _, row in df.iterrows():
|
71 |
+
nodes, relationships = self.process_entities(row)
|
72 |
+
|
73 |
+
# Add nodes
|
74 |
+
for node in nodes:
|
75 |
+
G.add_node(node["id"],
|
76 |
+
type=node["type"],
|
77 |
+
**node["properties"])
|
78 |
+
|
79 |
+
# Add relationships
|
80 |
+
for rel in relationships:
|
81 |
+
G.add_edge(rel["from"],
|
82 |
+
rel["to"],
|
83 |
+
relationship=rel["type"],
|
84 |
+
**rel["properties"])
|
85 |
+
|
86 |
+
return G
|
87 |
+
|
88 |
+
class Neo4jBuilder(GraphBuilder):
|
89 |
+
def __init__(self, uri, user, password):
|
90 |
+
self.driver = GraphDatabase.driver(uri, auth=(user, password))
|
91 |
+
self.logger = logging.getLogger(__name__)
|
92 |
+
|
93 |
+
def close(self):
|
94 |
+
self.driver.close()
|
95 |
+
|
96 |
+
def build_graph(self, df):
|
97 |
+
with self.driver.session() as session:
|
98 |
+
for _, row in df.iterrows():
|
99 |
+
nodes, relationships = self.process_entities(row)
|
100 |
+
|
101 |
+
# Create nodes and relationships in Neo4j
|
102 |
+
try:
|
103 |
+
session.execute_write(self._create_graph_elements,
|
104 |
+
nodes, relationships)
|
105 |
+
except Exception as e:
|
106 |
+
self.logger.error(f"Error processing row {row['GKGRECORDID']}: {str(e)}")
|
107 |
+
|
108 |
+
def _create_graph_elements(self, tx, nodes, relationships):
|
109 |
+
# Create nodes
|
110 |
+
for node in nodes:
|
111 |
+
query = f"""
|
112 |
+
MERGE (n:{node['type']} {{id: $id}})
|
113 |
+
SET n += $properties
|
114 |
+
"""
|
115 |
+
tx.run(query, id=node["id"], properties=node["properties"])
|
116 |
+
|
117 |
+
# Create relationships
|
118 |
+
for rel in relationships:
|
119 |
+
query = f"""
|
120 |
+
MATCH (a {{id: $from_id}})
|
121 |
+
MATCH (b {{id: $to_id}})
|
122 |
+
MERGE (a)-[r:{rel['type']}]->(b)
|
123 |
+
SET r += $properties
|
124 |
+
"""
|
125 |
+
tx.run(query,
|
126 |
+
from_id=rel["from"],
|
127 |
+
to_id=rel["to"],
|
128 |
+
properties=rel["properties"])
|
129 |
+
|
130 |
+
class StreamlitGraphBuilder:
|
131 |
+
"""Adapted graph builder for Streamlit visualization"""
|
132 |
+
def __init__(self):
|
133 |
+
self.G = nx.Graph()
|
134 |
+
|
135 |
+
def process_row(self, row):
|
136 |
+
"""Process a single row of data"""
|
137 |
+
event_id = row["GKGRECORDID"]
|
138 |
+
event_props = {
|
139 |
+
"type": "event", # already in lowercase
|
140 |
+
"date": row["DATE"],
|
141 |
+
"source": row["SourceCommonName"],
|
142 |
+
"document": row["DocumentIdentifier"],
|
143 |
+
"tone": row["tone"],
|
144 |
+
# Store display name in its original format if needed.
|
145 |
+
"name": row["SourceCommonName"]
|
146 |
+
}
|
147 |
+
|
148 |
+
self.G.add_node(event_id, **event_props)
|
149 |
+
|
150 |
+
# Use lowercase node types for consistency in lookups.
|
151 |
+
entity_types = {
|
152 |
+
"V2EnhancedPersons": ("person", "MENTIONED_IN"),
|
153 |
+
"V2EnhancedOrganizations": ("organization", "MENTIONED_IN"),
|
154 |
+
"V2EnhancedLocations": ("location", "LOCATED_IN"),
|
155 |
+
"V2EnhancedThemes": ("theme", "CATEGORIZED_AS"),
|
156 |
+
"V2.1AllNames": ("name", "MENTIONED_IN"),
|
157 |
+
"V2.1Counts": ("count", "MENTIONED_IN"),
|
158 |
+
"V2.1Amounts": ("amount", "MENTIONED_IN"),
|
159 |
+
}
|
160 |
+
|
161 |
+
for col, (node_type, rel_type) in entity_types.items():
|
162 |
+
if pd.notna(row[col]):
|
163 |
+
# The actual display value (which may be in Parent Case) is preserved in the "name" attribute.
|
164 |
+
entities = [e.strip() for e in row[col].split(';') if e.strip()]
|
165 |
+
for entity in entities:
|
166 |
+
self.G.add_node(entity, type=node_type, name=entity)
|
167 |
+
self.G.add_edge(entity, event_id,
|
168 |
+
relationship=rel_type,
|
169 |
+
date=row["DATE"])
|
170 |
+
|
171 |
+
class StLinkBuilder(GraphBuilder):
|
172 |
+
"""Builder for st-link-analysis compatible graphs"""
|
173 |
+
def build_graph(self, df):
|
174 |
+
"""Build graph in st-link-analysis format"""
|
175 |
+
all_nodes = []
|
176 |
+
all_edges = []
|
177 |
+
edge_counter = 0
|
178 |
+
|
179 |
+
# Track nodes we've already added to avoid duplicates
|
180 |
+
added_nodes = set()
|
181 |
+
|
182 |
+
for _, row in df.iterrows():
|
183 |
+
nodes, relationships = self.process_entities(row)
|
184 |
+
|
185 |
+
# Process nodes
|
186 |
+
for node in nodes:
|
187 |
+
if node["id"] not in added_nodes:
|
188 |
+
stlink_node = {
|
189 |
+
"data": {
|
190 |
+
"id": str(node["id"]),
|
191 |
+
"label": node["type"].upper(),
|
192 |
+
**node["properties"]
|
193 |
+
}
|
194 |
+
}
|
195 |
+
all_nodes.append(stlink_node)
|
196 |
+
added_nodes.add(node["id"])
|
197 |
+
|
198 |
+
# Process relationships/edges
|
199 |
+
for rel in relationships:
|
200 |
+
edge_counter += 1
|
201 |
+
stlink_edge = {
|
202 |
+
"data": {
|
203 |
+
"id": f"e{edge_counter}",
|
204 |
+
"source": str(rel["from"]),
|
205 |
+
"target": str(rel["to"]),
|
206 |
+
"label": rel["type"],
|
207 |
+
**rel["properties"]
|
208 |
+
}
|
209 |
+
}
|
210 |
+
all_edges.append(stlink_edge)
|
211 |
+
|
212 |
+
return {
|
213 |
+
"nodes": all_nodes,
|
214 |
+
"edges": all_edges
|
215 |
+
}
|
216 |
+
|
217 |
+
def write_json(self, graph_data, filename):
|
218 |
+
"""Write graph to JSON file"""
|
219 |
+
with open(filename, 'w') as f:
|
220 |
+
json.dump(graph_data, f, indent=2)
|
graph_config.py
ADDED
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Configuration module for graph visualization styles
|
3 |
+
"""
|
4 |
+
from st_link_analysis import NodeStyle, EdgeStyle
|
5 |
+
|
6 |
+
# Node styles configuration
|
7 |
+
NODE_STYLES = [
|
8 |
+
NodeStyle("EVENT", "#FF7F3E", "name", "description"),
|
9 |
+
NodeStyle("PERSON", "#4CAF50", "name", "person"),
|
10 |
+
NodeStyle("ORGANIZATION", "#9C27B0", "name", "business"),
|
11 |
+
NodeStyle("LOCATION", "#2196F3", "name", "place"),
|
12 |
+
NodeStyle("THEME", "#FFC107", "name", "sell"),
|
13 |
+
NodeStyle("COUNT", "#795548", "name", "inventory"),
|
14 |
+
NodeStyle("AMOUNT", "#607D8B", "name", "wallet"),
|
15 |
+
]
|
16 |
+
|
17 |
+
NODE_TYPES = {
|
18 |
+
'event': {
|
19 |
+
'color': '#1f77b4',
|
20 |
+
'description': 'GDELT Events'
|
21 |
+
},
|
22 |
+
'person': {
|
23 |
+
'color': '#2ca02c',
|
24 |
+
'description': 'Named Persons'
|
25 |
+
},
|
26 |
+
'organization': {
|
27 |
+
'color': '#ffa500',
|
28 |
+
'description': 'Organizations'
|
29 |
+
},
|
30 |
+
'location': {
|
31 |
+
'color': '#ff0000',
|
32 |
+
'description': 'Geographic Locations'
|
33 |
+
},
|
34 |
+
'theme': {
|
35 |
+
'color': '#800080',
|
36 |
+
'description': 'Event Themes'
|
37 |
+
}
|
38 |
+
}
|
39 |
+
|
40 |
+
# Edge styles configuration
|
41 |
+
EDGE_STYLES = [
|
42 |
+
EdgeStyle("MENTIONED_IN", caption="label", directed=True),
|
43 |
+
EdgeStyle("LOCATED_IN", caption="label", directed=True),
|
44 |
+
EdgeStyle("CATEGORIZED_AS", caption="label", directed=True)
|
45 |
+
]
|
46 |
+
|
47 |
+
# Layout options
|
48 |
+
LAYOUT_OPTIONS = ["cose", "circle", "grid", "breadthfirst", "concentric"]
|
49 |
+
|
50 |
+
# Default graph display settings
|
51 |
+
DEFAULT_GRAPH_HEIGHT = 500
|
52 |
+
DEFAULT_LAYOUT = "cose"
|
53 |
+
|
54 |
+
# Column configuration for data grid
|
55 |
+
GRID_COLUMNS = {
|
56 |
+
"ID": {"width": "medium"},
|
57 |
+
"Date": {"width": "small"},
|
58 |
+
"Source": {"width": "medium"},
|
59 |
+
"Tone": {"width": "small", "format": "%.2f"}
|
60 |
+
}
|
pages/1_🗺️_COVID_Navigator.py
ADDED
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import duckdb
|
3 |
+
import pandas as pd
|
4 |
+
from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
|
5 |
+
|
6 |
+
# Constants for raw data categories
|
7 |
+
GDELT_CATEGORIES = {
|
8 |
+
"Metadata": ["GKGRECORDID", "DATE", "SourceCommonName", "DocumentIdentifier", "V2.1Quotations", "tone"],
|
9 |
+
"Persons": ["V2EnhancedPersons", "V1Persons"],
|
10 |
+
"Organizations": ["V2EnhancedOrganizations", "V1Organizations"],
|
11 |
+
"Locations": ["V2EnhancedLocations", "V1Locations"],
|
12 |
+
"Themes": ["V2EnhancedThemes", "V1Themes"],
|
13 |
+
"Names": ["V2.1AllNames"],
|
14 |
+
"Counts": ["V2.1Counts", "V1Counts"],
|
15 |
+
"Amounts": ["V2.1Amounts"],
|
16 |
+
"V2GCAM": ["V2GCAM"],
|
17 |
+
"V2.1EnhancedDates": ["V2.1EnhancedDates"],
|
18 |
+
}
|
19 |
+
|
20 |
+
def initialize_db():
|
21 |
+
"""Initialize database connection and create dataset view"""
|
22 |
+
con = duckdb.connect()
|
23 |
+
con.execute("""
|
24 |
+
CREATE VIEW negative_tone AS (
|
25 |
+
SELECT *
|
26 |
+
FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2@~parquet/default/negative_tone/*.parquet')
|
27 |
+
);
|
28 |
+
""")
|
29 |
+
return con
|
30 |
+
|
31 |
+
def fetch_data(con, source_filter=None, themes_filter=None,
|
32 |
+
start_date=None, end_date=None, limit=50, include_all_columns=False):
|
33 |
+
"""Fetch filtered data from the database"""
|
34 |
+
if include_all_columns:
|
35 |
+
columns = "*"
|
36 |
+
else:
|
37 |
+
columns = "GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, 'V2.1Quotations', SourceCollectionIdentifier"
|
38 |
+
|
39 |
+
query = f"""
|
40 |
+
SELECT {columns}
|
41 |
+
FROM negative_tone
|
42 |
+
WHERE TRUE
|
43 |
+
"""
|
44 |
+
params = []
|
45 |
+
|
46 |
+
if source_filter:
|
47 |
+
query += " AND SourceCommonName ILIKE ?"
|
48 |
+
params.append(f"%{source_filter}%")
|
49 |
+
if start_date:
|
50 |
+
query += " AND DATE >= ?"
|
51 |
+
params.append(start_date)
|
52 |
+
if end_date:
|
53 |
+
query += " AND DATE <= ?"
|
54 |
+
params.append(end_date)
|
55 |
+
if limit:
|
56 |
+
query += f" LIMIT {limit}"
|
57 |
+
|
58 |
+
try:
|
59 |
+
result = con.execute(query, params)
|
60 |
+
return result.fetchdf()
|
61 |
+
except Exception as e:
|
62 |
+
st.error(f"Query execution failed: {str(e)}")
|
63 |
+
return pd.DataFrame()
|
64 |
+
|
65 |
+
def render_data_grid(df):
|
66 |
+
"""
|
67 |
+
Render an interactive data grid (with built‑in filtering) and return the selected row.
|
68 |
+
The grid is configured to show only the desired columns (ID, Date, Source, Tone)
|
69 |
+
and allow filtering/search on each.
|
70 |
+
"""
|
71 |
+
st.subheader("Search and Filter Records")
|
72 |
+
|
73 |
+
# Build grid options with AgGrid
|
74 |
+
gb = GridOptionsBuilder.from_dataframe(df)
|
75 |
+
gb.configure_default_column(filter=True, sortable=True, resizable=True)
|
76 |
+
# Enable single row selection
|
77 |
+
gb.configure_selection('single', use_checkbox=False)
|
78 |
+
grid_options = gb.build()
|
79 |
+
|
80 |
+
# Render AgGrid (the grid will have a filter field for each column)
|
81 |
+
grid_response = AgGrid(
|
82 |
+
df,
|
83 |
+
gridOptions=grid_options,
|
84 |
+
update_mode=GridUpdateMode.SELECTION_CHANGED,
|
85 |
+
height=400,
|
86 |
+
fit_columns_on_grid_load=True
|
87 |
+
)
|
88 |
+
|
89 |
+
selected = grid_response.get('selected_rows')
|
90 |
+
if selected is not None:
|
91 |
+
# If selected is a DataFrame, use iloc to get the first row.
|
92 |
+
if isinstance(selected, pd.DataFrame):
|
93 |
+
if not selected.empty:
|
94 |
+
return selected.iloc[0].to_dict()
|
95 |
+
# Otherwise, if it's a list, get the first element.
|
96 |
+
elif isinstance(selected, list) and len(selected) > 0:
|
97 |
+
return selected[0]
|
98 |
+
return None
|
99 |
+
|
100 |
+
def render_raw_data(record):
|
101 |
+
"""Render raw GDELT data in expandable sections."""
|
102 |
+
st.header("Full Record Details")
|
103 |
+
for category, fields in GDELT_CATEGORIES.items():
|
104 |
+
with st.expander(f"{category}"):
|
105 |
+
for field in fields:
|
106 |
+
if field in record:
|
107 |
+
st.markdown(f"**{field}:**")
|
108 |
+
st.text(record[field])
|
109 |
+
st.divider()
|
110 |
+
|
111 |
+
def main():
|
112 |
+
st.title("🗺️ COVID Dataset Navigator")
|
113 |
+
st.markdown("""
|
114 |
+
**Explore and Analyze COVID-19 Event Data**
|
115 |
+
|
116 |
+
Use the interactive filters on the sidebar to search, sort, and inspect individual records from the GDELT Global Knowledge Graph. Adjust the parameters below to uncover detailed event insights.
|
117 |
+
""")
|
118 |
+
|
119 |
+
# Initialize database connection using context manager
|
120 |
+
with initialize_db() as con:
|
121 |
+
if con is not None:
|
122 |
+
# Add UI components
|
123 |
+
|
124 |
+
# Sidebar controls
|
125 |
+
with st.sidebar:
|
126 |
+
st.header("Search Filters")
|
127 |
+
source = st.text_input("Filter by source name")
|
128 |
+
start_date = st.text_input("Start date (YYYYMMDD)", "20200314")
|
129 |
+
end_date = st.text_input("End date (YYYYMMDD)", "20200315")
|
130 |
+
limit = st.slider("Number of results to display", 10, 500, 100)
|
131 |
+
|
132 |
+
# Fetch initial data view
|
133 |
+
df_initial = fetch_data(
|
134 |
+
con=con,
|
135 |
+
source_filter=source,
|
136 |
+
start_date=start_date,
|
137 |
+
end_date=end_date,
|
138 |
+
limit=limit,
|
139 |
+
include_all_columns=False
|
140 |
+
)
|
141 |
+
|
142 |
+
# Fetch full records for selection
|
143 |
+
df_full = fetch_data(
|
144 |
+
con=con,
|
145 |
+
source_filter=source,
|
146 |
+
start_date=start_date,
|
147 |
+
end_date=end_date,
|
148 |
+
limit=limit,
|
149 |
+
include_all_columns=True
|
150 |
+
)
|
151 |
+
|
152 |
+
# Create a DataFrame for the grid with only the key columns
|
153 |
+
grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', 'SourceCollectionIdentifier']].copy()
|
154 |
+
grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Source Collection ID']
|
155 |
+
|
156 |
+
# Render the interactive data grid at the top
|
157 |
+
selected_row = render_data_grid(grid_df)
|
158 |
+
|
159 |
+
if selected_row:
|
160 |
+
# Find the full record in the original DataFrame using the selected ID
|
161 |
+
selected_id = selected_row['ID']
|
162 |
+
full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
|
163 |
+
|
164 |
+
# Display the raw data below the grid
|
165 |
+
render_raw_data(full_record)
|
166 |
+
else:
|
167 |
+
st.info("Select a record above to view its complete details.")
|
168 |
+
else:
|
169 |
+
st.warning("No matching records found.")
|
170 |
+
|
171 |
+
# Close database connection
|
172 |
+
con.close()
|
173 |
+
|
174 |
+
main()
|
pages/2_🔍_COVID_Event_Graph.py
ADDED
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import duckdb
|
3 |
+
import pandas as pd
|
4 |
+
from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
|
5 |
+
from st_link_analysis import st_link_analysis, NodeStyle, EdgeStyle
|
6 |
+
from graph_builder import StLinkBuilder
|
7 |
+
|
8 |
+
# Node styles configuration
|
9 |
+
NODE_STYLES = [
|
10 |
+
NodeStyle("EVENT", "#FF7F3E", "name", "description"),
|
11 |
+
NodeStyle("PERSON", "#4CAF50", "name", "person"),
|
12 |
+
NodeStyle("NAME", "#2A629A", "created_at", "badge"),
|
13 |
+
NodeStyle("ORGANIZATION", "#9C27B0", "name", "business"),
|
14 |
+
NodeStyle("LOCATION", "#2196F3", "name", "place"),
|
15 |
+
NodeStyle("THEME", "#FFC107", "name", "sell"),
|
16 |
+
NodeStyle("COUNT", "#795548", "name", "inventory"),
|
17 |
+
NodeStyle("AMOUNT", "#607D8B", "name", "wallet"),
|
18 |
+
]
|
19 |
+
|
20 |
+
# Edge styles configuration
|
21 |
+
EDGE_STYLES = [
|
22 |
+
EdgeStyle("MENTIONED_IN", caption="label", directed=True),
|
23 |
+
EdgeStyle("LOCATED_IN", caption="label", directed=True),
|
24 |
+
EdgeStyle("CATEGORIZED_AS", caption="label", directed=True)
|
25 |
+
]
|
26 |
+
|
27 |
+
def initialize_db():
|
28 |
+
"""Initialize database connection and create dataset view"""
|
29 |
+
con = duckdb.connect()
|
30 |
+
con.execute("""
|
31 |
+
CREATE VIEW negative_tone AS (
|
32 |
+
SELECT *
|
33 |
+
FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2@~parquet/default/negative_tone/*.parquet')
|
34 |
+
);
|
35 |
+
""")
|
36 |
+
return con
|
37 |
+
|
38 |
+
def fetch_data(con, source_filter=None,
|
39 |
+
start_date=None, end_date=None, limit=50, include_all_columns=False):
|
40 |
+
"""Fetch filtered data from the database"""
|
41 |
+
if include_all_columns:
|
42 |
+
columns = "*"
|
43 |
+
else:
|
44 |
+
columns = "GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, 'V2.1Quotations', SourceCollectionIdentifier"
|
45 |
+
|
46 |
+
query = f"""
|
47 |
+
SELECT {columns}
|
48 |
+
FROM negative_tone
|
49 |
+
WHERE TRUE
|
50 |
+
"""
|
51 |
+
params = []
|
52 |
+
|
53 |
+
if source_filter:
|
54 |
+
query += " AND SourceCommonName ILIKE ?"
|
55 |
+
params.append(f"%{source_filter}%")
|
56 |
+
if start_date:
|
57 |
+
query += " AND DATE >= ?"
|
58 |
+
params.append(start_date)
|
59 |
+
if end_date:
|
60 |
+
query += " AND DATE <= ?"
|
61 |
+
params.append(end_date)
|
62 |
+
if limit:
|
63 |
+
query += f" LIMIT {limit}"
|
64 |
+
|
65 |
+
try:
|
66 |
+
result = con.execute(query, params)
|
67 |
+
return result.fetchdf()
|
68 |
+
except Exception as e:
|
69 |
+
st.error(f"Query execution failed: {str(e)}")
|
70 |
+
return pd.DataFrame()
|
71 |
+
|
72 |
+
def render_data_grid(df):
|
73 |
+
"""
|
74 |
+
Render an interactive data grid (with built‑in filtering) and return the selected row.
|
75 |
+
The grid is configured to show only the desired columns (ID, Date, Source, Tone)
|
76 |
+
and allow filtering/search on each.
|
77 |
+
"""
|
78 |
+
st.subheader("Search and Filter Records")
|
79 |
+
|
80 |
+
# Build grid options with AgGrid
|
81 |
+
gb = GridOptionsBuilder.from_dataframe(df)
|
82 |
+
gb.configure_default_column(filter=True, sortable=True, resizable=True)
|
83 |
+
# Enable single row selection
|
84 |
+
gb.configure_selection('single', use_checkbox=False)
|
85 |
+
grid_options = gb.build()
|
86 |
+
|
87 |
+
# Render AgGrid (the grid will have a filter field for each column)
|
88 |
+
grid_response = AgGrid(
|
89 |
+
df,
|
90 |
+
gridOptions=grid_options,
|
91 |
+
update_mode=GridUpdateMode.SELECTION_CHANGED,
|
92 |
+
height=400,
|
93 |
+
fit_columns_on_grid_load=True
|
94 |
+
)
|
95 |
+
|
96 |
+
selected = grid_response.get('selected_rows')
|
97 |
+
if selected is not None:
|
98 |
+
# If selected is a DataFrame, use iloc to get the first row.
|
99 |
+
if isinstance(selected, pd.DataFrame):
|
100 |
+
if not selected.empty:
|
101 |
+
return selected.iloc[0].to_dict()
|
102 |
+
# Otherwise, if it's a list, get the first element.
|
103 |
+
elif isinstance(selected, list) and len(selected) > 0:
|
104 |
+
return selected[0]
|
105 |
+
return None
|
106 |
+
|
107 |
+
def render_graph(record):
|
108 |
+
"""
|
109 |
+
Render a graph visualization for the selected record.
|
110 |
+
Uses StLinkBuilder to convert the record into graph format and then
|
111 |
+
displays the graph using st_link_analysis.
|
112 |
+
"""
|
113 |
+
st.subheader(f"Event Graph: {record.get('GKGRECORDID', 'Unknown')}")
|
114 |
+
stlink_builder = StLinkBuilder()
|
115 |
+
# Convert the record (a Series) into a DataFrame with one row
|
116 |
+
record_df = pd.DataFrame([record])
|
117 |
+
graph_data = stlink_builder.build_graph(record_df)
|
118 |
+
return st_link_analysis(
|
119 |
+
elements=graph_data,
|
120 |
+
layout="fcose", # Column configuration for data grid - cose, fcose, breadthfirst, cola
|
121 |
+
node_styles=NODE_STYLES,
|
122 |
+
edge_styles=EDGE_STYLES
|
123 |
+
)
|
124 |
+
|
125 |
+
def main():
|
126 |
+
st.title("🔍 COVID Event Graph Explorer")
|
127 |
+
st.markdown("""
|
128 |
+
**Interactive Event Graph Viewer**
|
129 |
+
|
130 |
+
Filter and select individual COVID-19 event records to display their detailed graph representations. Analyze relationships between events and associated entities using the interactive graph below.
|
131 |
+
""")
|
132 |
+
|
133 |
+
# Initialize database connection using context manager
|
134 |
+
with initialize_db() as con:
|
135 |
+
if con is not None:
|
136 |
+
# Add UI components
|
137 |
+
|
138 |
+
# Sidebar controls
|
139 |
+
with st.sidebar:
|
140 |
+
st.header("Search Filters")
|
141 |
+
source = st.text_input("Filter by source name")
|
142 |
+
start_date = st.text_input("Start date (YYYYMMDD)", "20200314")
|
143 |
+
end_date = st.text_input("End date (YYYYMMDD)", "20200315")
|
144 |
+
limit = st.slider("Number of results to display", 10, 500, 100)
|
145 |
+
|
146 |
+
# Fetch initial data view
|
147 |
+
df_initial = fetch_data(
|
148 |
+
con=con,
|
149 |
+
source_filter=source,
|
150 |
+
start_date=start_date,
|
151 |
+
end_date=end_date,
|
152 |
+
limit=limit,
|
153 |
+
include_all_columns=False
|
154 |
+
)
|
155 |
+
|
156 |
+
# Fetch full records for selection
|
157 |
+
df_full = fetch_data(
|
158 |
+
con=con,
|
159 |
+
source_filter=source,
|
160 |
+
start_date=start_date,
|
161 |
+
end_date=end_date,
|
162 |
+
limit=limit,
|
163 |
+
include_all_columns=True
|
164 |
+
)
|
165 |
+
|
166 |
+
# Create a DataFrame for the grid with only the key columns
|
167 |
+
grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', 'SourceCollectionIdentifier']].copy()
|
168 |
+
grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Source Collection ID']
|
169 |
+
|
170 |
+
# Render the interactive data grid at the top
|
171 |
+
selected_row = render_data_grid(grid_df)
|
172 |
+
|
173 |
+
if selected_row:
|
174 |
+
# Find the full record in the original DataFrame using the selected ID
|
175 |
+
selected_id = selected_row['ID']
|
176 |
+
full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
|
177 |
+
|
178 |
+
# Display the graph and raw data below the grid
|
179 |
+
render_graph(full_record)
|
180 |
+
else:
|
181 |
+
st.info("Use the grid filters above to search and select a record.")
|
182 |
+
|
183 |
+
else:
|
184 |
+
st.warning("No matching records found.")
|
185 |
+
|
186 |
+
# Close database connection
|
187 |
+
con.close()
|
188 |
+
|
189 |
+
main()
|
pages/3_🌐_COVID_Network_Analysis.py
ADDED
@@ -0,0 +1,349 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Network Analysis Page - GDELT Graph Analysis
|
3 |
+
This module provides interactive network analysis of GDELT event data.
|
4 |
+
"""
|
5 |
+
import streamlit as st
|
6 |
+
import networkx as nx
|
7 |
+
from pyvis.network import Network
|
8 |
+
import pandas as pd
|
9 |
+
from datetime import datetime
|
10 |
+
import tempfile
|
11 |
+
import json
|
12 |
+
from typing import Dict, List, Set, Tuple, Optional
|
13 |
+
from pathlib import Path
|
14 |
+
|
15 |
+
from data_access import get_gdelt_data, filter_dataframe, GDELT_CATEGORIES
|
16 |
+
from graph_builder import StreamlitGraphBuilder
|
17 |
+
from graph_config import NODE_TYPES
|
18 |
+
|
19 |
+
# Type aliases for clarity
|
20 |
+
NodeID = str
|
21 |
+
CommunityID = int
|
22 |
+
Community = Set[NodeID]
|
23 |
+
Communities = List[Community]
|
24 |
+
|
25 |
+
def create_legend_html() -> str:
|
26 |
+
"""Create HTML for the visualization legend."""
|
27 |
+
legend_html = """
|
28 |
+
<div style="
|
29 |
+
position: absolute;
|
30 |
+
top: 10px;
|
31 |
+
right: 10px;
|
32 |
+
background-color: rgba(255, 255, 255, 0.9);
|
33 |
+
padding: 10px;
|
34 |
+
border-radius: 5px;
|
35 |
+
border: 1px solid #ddd;
|
36 |
+
z-index: 1000;
|
37 |
+
">
|
38 |
+
<h3 style="margin: 0 0 10px 0;">Legend</h3>
|
39 |
+
"""
|
40 |
+
|
41 |
+
for node_type, info in NODE_TYPES.items():
|
42 |
+
legend_html += f"""
|
43 |
+
<div style="margin: 5px 0;">
|
44 |
+
<span style="
|
45 |
+
display: inline-block;
|
46 |
+
width: 12px;
|
47 |
+
height: 12px;
|
48 |
+
background-color: {info['color']};
|
49 |
+
border-radius: 50%;
|
50 |
+
margin-right: 5px;
|
51 |
+
"></span>
|
52 |
+
<span>{info['description']}</span>
|
53 |
+
</div>
|
54 |
+
"""
|
55 |
+
|
56 |
+
legend_html += "</div>"
|
57 |
+
return legend_html
|
58 |
+
|
59 |
+
class CommunityAnalyzer:
|
60 |
+
"""Handles community detection and analysis for GDELT network graphs."""
|
61 |
+
|
62 |
+
def __init__(self, G: nx.Graph):
|
63 |
+
self.G = G
|
64 |
+
self._communities: Optional[Communities] = None
|
65 |
+
self._analysis: Optional[List[Dict]] = None
|
66 |
+
|
67 |
+
@property
|
68 |
+
def communities(self) -> Communities:
|
69 |
+
"""Cached access to detected communities."""
|
70 |
+
if self._communities is None:
|
71 |
+
self._communities = nx.community.louvain_communities(self.G)
|
72 |
+
return self._communities
|
73 |
+
|
74 |
+
def analyze_composition(self) -> List[Dict]:
|
75 |
+
"""Perform detailed analysis of each community's composition."""
|
76 |
+
if self._analysis is not None:
|
77 |
+
return self._analysis
|
78 |
+
|
79 |
+
analysis_results = []
|
80 |
+
|
81 |
+
for idx, community in enumerate(self.communities):
|
82 |
+
try:
|
83 |
+
# Initialize analysis containers
|
84 |
+
node_types = {ntype: 0 for ntype in NODE_TYPES.keys()}
|
85 |
+
themes: Set[str] = set()
|
86 |
+
entities: Dict[str, int] = {}
|
87 |
+
|
88 |
+
# Analyze community nodes
|
89 |
+
for node in community:
|
90 |
+
attrs = self.G.nodes[node]
|
91 |
+
node_type = attrs.get('type', 'unknown')
|
92 |
+
|
93 |
+
# Update type counts
|
94 |
+
if node_type in node_types:
|
95 |
+
node_types[node_type] += 1
|
96 |
+
|
97 |
+
# Collect themes
|
98 |
+
if node_type == 'theme':
|
99 |
+
theme_name = attrs.get('name', '')
|
100 |
+
if theme_name:
|
101 |
+
themes.add(theme_name)
|
102 |
+
|
103 |
+
# Track entity connections
|
104 |
+
if node_type in {'person', 'organization', 'location'}:
|
105 |
+
name = attrs.get('name', node)
|
106 |
+
entities[name] = self.G.degree(node)
|
107 |
+
|
108 |
+
# Calculate community metrics
|
109 |
+
subgraph = self.G.subgraph(community)
|
110 |
+
n = len(community)
|
111 |
+
possible_edges = (n * (n - 1)) / 2 if n > 1 else 0
|
112 |
+
density = (subgraph.number_of_edges() / possible_edges) if possible_edges > 0 else 0
|
113 |
+
|
114 |
+
# Get top entities by degree
|
115 |
+
top_entities = dict(sorted(entities.items(), key=lambda x: x[1], reverse=True)[:5])
|
116 |
+
|
117 |
+
analysis_results.append({
|
118 |
+
'id': idx,
|
119 |
+
'size': len(community),
|
120 |
+
'node_types': node_types,
|
121 |
+
'themes': sorted(themes),
|
122 |
+
'top_entities': top_entities,
|
123 |
+
'density': density,
|
124 |
+
'internal_edges': subgraph.number_of_edges(),
|
125 |
+
'external_edges': sum(1 for u in community
|
126 |
+
for v in self.G[u]
|
127 |
+
if v not in community)
|
128 |
+
})
|
129 |
+
|
130 |
+
except Exception as e:
|
131 |
+
st.error(f"Error analyzing community {idx}: {str(e)}")
|
132 |
+
continue
|
133 |
+
|
134 |
+
self._analysis = analysis_results
|
135 |
+
return analysis_results
|
136 |
+
|
137 |
+
def display_community_analysis(analysis: List[Dict]) -> None:
|
138 |
+
"""Display detailed community analysis in Streamlit."""
|
139 |
+
# Display summary metrics
|
140 |
+
total_nodes = sum(comm['size'] for comm in analysis)
|
141 |
+
col1, col2, col3 = st.columns(3)
|
142 |
+
with col1:
|
143 |
+
st.metric("Total Communities", len(analysis))
|
144 |
+
with col2:
|
145 |
+
st.metric("Total Nodes", total_nodes)
|
146 |
+
with col3:
|
147 |
+
largest_comm = max(comm['size'] for comm in analysis)
|
148 |
+
st.metric("Largest Community", largest_comm)
|
149 |
+
|
150 |
+
# Display each community in tabs
|
151 |
+
st.subheader("Community Details")
|
152 |
+
tabs = st.tabs([f"Community {comm['id']}" for comm in analysis])
|
153 |
+
for tab, comm in zip(tabs, analysis):
|
154 |
+
with tab:
|
155 |
+
cols = st.columns(2)
|
156 |
+
|
157 |
+
# Left column: Composition
|
158 |
+
with cols[0]:
|
159 |
+
st.subheader("Composition")
|
160 |
+
node_types_df = pd.DataFrame([comm['node_types']]).T
|
161 |
+
node_types_df.columns = ['Count']
|
162 |
+
st.bar_chart(node_types_df)
|
163 |
+
|
164 |
+
st.markdown("**Metrics:**")
|
165 |
+
st.write(f"- Size: {comm['size']} nodes")
|
166 |
+
st.write(f"- Density: {comm['density']:.3f}")
|
167 |
+
st.write(f"- Internal edges: {comm['internal_edges']}")
|
168 |
+
st.write(f"- External edges: {comm['external_edges']}")
|
169 |
+
st.write(f"- % of network: {(comm['size']/total_nodes)*100:.1f}%")
|
170 |
+
|
171 |
+
# Right column: Entities and Themes
|
172 |
+
with cols[1]:
|
173 |
+
if comm['top_entities']:
|
174 |
+
st.subheader("Key Entities")
|
175 |
+
for entity, degree in comm['top_entities'].items():
|
176 |
+
st.write(f"- {entity} ({degree} connections)")
|
177 |
+
|
178 |
+
if comm['themes']:
|
179 |
+
st.subheader("Themes")
|
180 |
+
for theme in sorted(comm['themes']):
|
181 |
+
st.write(f"- {theme}")
|
182 |
+
|
183 |
+
def visualize_with_pyvis(G: nx.Graph, physics: bool = True) -> str:
|
184 |
+
"""Create interactive PyVis visualization with legend."""
|
185 |
+
net = Network(height="600px", width="100%", notebook=False, directed=False)
|
186 |
+
net.from_nx(G)
|
187 |
+
|
188 |
+
# Configure nodes
|
189 |
+
for node in net.nodes:
|
190 |
+
node_type = node.get("type", "unknown")
|
191 |
+
node["color"] = NODE_TYPES.get(node_type, {}).get('color', "#cccccc")
|
192 |
+
node["size"] = 20 if node_type == "event" else 15
|
193 |
+
title_attrs = {k: v for k, v in node.items() if k != "id"}
|
194 |
+
node["title"] = "<br>".join(f"{k}: {v}" for k, v in title_attrs.items())
|
195 |
+
|
196 |
+
# Configure edges
|
197 |
+
for edge in net.edges:
|
198 |
+
edge["title"] = edge.get("relationship", "")
|
199 |
+
edge["color"] = {"color": "#666666", "opacity": 0.5}
|
200 |
+
|
201 |
+
# Physics settings
|
202 |
+
if physics:
|
203 |
+
net.show_buttons(filter_=['physics'])
|
204 |
+
else:
|
205 |
+
net.toggle_physics(False)
|
206 |
+
|
207 |
+
# Generate HTML
|
208 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as f:
|
209 |
+
net.save_graph(f.name)
|
210 |
+
html_content = Path(f.name).read_text(encoding='utf-8')
|
211 |
+
|
212 |
+
# Add legend
|
213 |
+
legend = create_legend_html()
|
214 |
+
html_content = html_content.replace('</body>', f'{legend}</body>')
|
215 |
+
|
216 |
+
return html_content
|
217 |
+
|
218 |
+
def main():
|
219 |
+
st.title("🌐 Global Network Analysis")
|
220 |
+
st.markdown("""
|
221 |
+
**Explore Global Event Networks**
|
222 |
+
|
223 |
+
Dive deep into the interconnected world of negative sentiment events as captured by GDELT. Utilize interactive visualizations and community analysis tools to understand key metrics, structures, and interrelationships.
|
224 |
+
""")
|
225 |
+
|
226 |
+
|
227 |
+
# Initialize session state
|
228 |
+
if 'vis_html' not in st.session_state:
|
229 |
+
st.session_state.vis_html = None
|
230 |
+
|
231 |
+
# Sidebar controls
|
232 |
+
with st.sidebar:
|
233 |
+
st.header("Graph Controls")
|
234 |
+
limit = st.slider("Max records to load", 1, 25, 5)
|
235 |
+
tone_threshold = st.slider("Max tone score", -10.0, -5.0, -7.0)
|
236 |
+
show_physics = st.checkbox("Enable physics", value=True)
|
237 |
+
|
238 |
+
st.header("Advanced Filters")
|
239 |
+
source_filter = st.text_input("Filter by source name")
|
240 |
+
themes_filter = st.text_input("Filter by theme/keyword")
|
241 |
+
start_date = st.text_input("Start date (YYYYMMDD)")
|
242 |
+
end_date = st.text_input("End date (YYYYMMDD)")
|
243 |
+
|
244 |
+
try:
|
245 |
+
# Load and process data
|
246 |
+
df = get_gdelt_data(
|
247 |
+
limit=limit,
|
248 |
+
tone_threshold=tone_threshold,
|
249 |
+
start_date=start_date if start_date else None,
|
250 |
+
end_date=end_date if end_date else None,
|
251 |
+
source_filter=source_filter,
|
252 |
+
themes_filter=themes_filter
|
253 |
+
)
|
254 |
+
|
255 |
+
# Build graph
|
256 |
+
with st.spinner("Building knowledge graph..."):
|
257 |
+
builder = StreamlitGraphBuilder()
|
258 |
+
for _, row in df.iterrows():
|
259 |
+
builder.process_row(row)
|
260 |
+
G = builder.G
|
261 |
+
|
262 |
+
if G.number_of_nodes() == 0:
|
263 |
+
st.warning("No data found matching the specified criteria.")
|
264 |
+
return
|
265 |
+
|
266 |
+
# Display basic metrics
|
267 |
+
col1, col2, col3 = st.columns(3)
|
268 |
+
with col1:
|
269 |
+
st.metric("Total Nodes", G.number_of_nodes())
|
270 |
+
with col2:
|
271 |
+
st.metric("Total Edges", G.number_of_edges())
|
272 |
+
with col3:
|
273 |
+
event_count = sum(1 for _, attr in G.nodes(data=True)
|
274 |
+
if attr.get("type") == "event")
|
275 |
+
st.metric("Negative Events", event_count)
|
276 |
+
|
277 |
+
# Analysis section
|
278 |
+
st.header("NetworkX Graph Analysis")
|
279 |
+
|
280 |
+
# Centrality analysis
|
281 |
+
with st.expander("Centrality Analysis"):
|
282 |
+
degree_centrality = nx.degree_centrality(G)
|
283 |
+
top_nodes = sorted(degree_centrality.items(),
|
284 |
+
key=lambda x: x[1], reverse=True)[:5]
|
285 |
+
|
286 |
+
st.write("Most Connected Nodes:")
|
287 |
+
for node, centrality in top_nodes:
|
288 |
+
node_type = G.nodes[node].get("type", "unknown")
|
289 |
+
st.write(f"- `{node[:30]}` ({node_type}): {centrality:.3f}")
|
290 |
+
|
291 |
+
# Community analysis
|
292 |
+
with st.expander("Community Analysis"):
|
293 |
+
try:
|
294 |
+
analyzer = CommunityAnalyzer(G)
|
295 |
+
analysis = analyzer.analyze_composition()
|
296 |
+
display_community_analysis(analysis)
|
297 |
+
except Exception as e:
|
298 |
+
st.error(f"Community analysis failed: {str(e)}")
|
299 |
+
st.error("Please check the graph structure and try again.")
|
300 |
+
|
301 |
+
# Export options
|
302 |
+
st.header("Export Options")
|
303 |
+
with st.expander("Export Data"):
|
304 |
+
col1, col2, col3 = st.columns(3)
|
305 |
+
|
306 |
+
with col1:
|
307 |
+
# GraphML export
|
308 |
+
graphml_string = "".join(nx.generate_graphml(G))
|
309 |
+
st.download_button(
|
310 |
+
label="Download GraphML",
|
311 |
+
data=graphml_string.encode('utf-8'),
|
312 |
+
file_name=f"gdelt_graph_{datetime.now().isoformat()}.graphml",
|
313 |
+
mime="application/xml"
|
314 |
+
)
|
315 |
+
|
316 |
+
with col2:
|
317 |
+
# JSON network export
|
318 |
+
json_string = json.dumps(nx.node_link_data(G, edges="edges"))
|
319 |
+
st.download_button(
|
320 |
+
label="Download JSON",
|
321 |
+
data=json_string.encode('utf-8'),
|
322 |
+
file_name=f"gdelt_graph_{datetime.now().isoformat()}.json",
|
323 |
+
mime="application/json"
|
324 |
+
)
|
325 |
+
|
326 |
+
with col3:
|
327 |
+
# Community analysis export
|
328 |
+
if 'analysis' in locals():
|
329 |
+
analysis_json = json.dumps(analysis, indent=2)
|
330 |
+
st.download_button(
|
331 |
+
label="Download Analysis",
|
332 |
+
data=analysis_json.encode('utf-8'),
|
333 |
+
file_name=f"community_analysis_{datetime.now().isoformat()}.json",
|
334 |
+
mime="application/json"
|
335 |
+
)
|
336 |
+
|
337 |
+
# Interactive visualization
|
338 |
+
st.header("Network Visualization")
|
339 |
+
with st.expander("Interactive Network", expanded=False):
|
340 |
+
if st.session_state.vis_html is None:
|
341 |
+
with st.spinner("Generating visualization..."):
|
342 |
+
st.session_state.vis_html = visualize_with_pyvis(G, physics=show_physics)
|
343 |
+
st.components.v1.html(st.session_state.vis_html, height=600, scrolling=True)
|
344 |
+
|
345 |
+
except Exception as e:
|
346 |
+
st.error(f"An error occurred: {str(e)}")
|
347 |
+
st.error("Please adjust your filters and try again.")
|
348 |
+
|
349 |
+
main()
|
pages/4_🗺️_Feb_2025_Navigator.py
ADDED
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import duckdb
|
3 |
+
import pandas as pd
|
4 |
+
from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
|
5 |
+
|
6 |
+
# Constants for raw data categories
|
7 |
+
GDELT_CATEGORIES = {
|
8 |
+
"Metadata": ["GKGRECORDID", "DATE", "SourceCommonName", "DocumentIdentifier", "V2.1Quotations", "tone"],
|
9 |
+
"Persons": ["V2EnhancedPersons", "V1Persons"],
|
10 |
+
"Organizations": ["V2EnhancedOrganizations", "V1Organizations"],
|
11 |
+
"Locations": ["V2EnhancedLocations", "V1Locations"],
|
12 |
+
"Themes": ["V2EnhancedThemes", "V1Themes"],
|
13 |
+
"Names": ["V2.1AllNames"],
|
14 |
+
"Counts": ["V2.1Counts", "V1Counts"],
|
15 |
+
"Amounts": ["V2.1Amounts"],
|
16 |
+
"V2GCAM": ["V2GCAM"],
|
17 |
+
"V2.1EnhancedDates": ["V2.1EnhancedDates"],
|
18 |
+
}
|
19 |
+
|
20 |
+
def initialize_db():
|
21 |
+
"""Initialize database connection and create dataset view with optimized tone extraction"""
|
22 |
+
con = duckdb.connect()
|
23 |
+
con.execute("""
|
24 |
+
CREATE VIEW tone_vw AS (
|
25 |
+
SELECT
|
26 |
+
* EXCLUDE ("V1.5Tone"),
|
27 |
+
TRY_CAST(
|
28 |
+
CASE
|
29 |
+
WHEN POSITION(',' IN "V1.5Tone") > 0
|
30 |
+
THEN SUBSTRING("V1.5Tone", 1, POSITION(',' IN "V1.5Tone") - 1)
|
31 |
+
ELSE "V1.5Tone"
|
32 |
+
END
|
33 |
+
AS FLOAT
|
34 |
+
) AS tone
|
35 |
+
FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-2025-v2/**/*.parquet')
|
36 |
+
);
|
37 |
+
""")
|
38 |
+
return con
|
39 |
+
|
40 |
+
def fetch_data(con, source_filter=None,
|
41 |
+
start_date=None, end_date=None, limit=10, include_all_columns=False):
|
42 |
+
"""Fetch filtered data from the database"""
|
43 |
+
if include_all_columns:
|
44 |
+
columns = "*"
|
45 |
+
else:
|
46 |
+
# Changed column specification: use double quotes for column names with periods.
|
47 |
+
columns = 'GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, "V2.1SharingImage", "V2.1Quotations", SourceCollectionIdentifier'
|
48 |
+
|
49 |
+
query = f"""
|
50 |
+
SELECT {columns}
|
51 |
+
FROM tone_vw
|
52 |
+
WHERE TRUE
|
53 |
+
"""
|
54 |
+
params = []
|
55 |
+
|
56 |
+
if source_filter:
|
57 |
+
query += " AND SourceCommonName ILIKE ?"
|
58 |
+
params.append(f"%{source_filter}%")
|
59 |
+
if start_date:
|
60 |
+
query += " AND DATE >= ?"
|
61 |
+
params.append(start_date)
|
62 |
+
if end_date:
|
63 |
+
query += " AND DATE <= ?"
|
64 |
+
params.append(end_date)
|
65 |
+
if limit:
|
66 |
+
query += f" LIMIT {limit}"
|
67 |
+
|
68 |
+
try:
|
69 |
+
result = con.execute(query, params)
|
70 |
+
return result.fetchdf()
|
71 |
+
except Exception as e:
|
72 |
+
st.error(f"Query execution failed: {str(e)}")
|
73 |
+
return pd.DataFrame()
|
74 |
+
|
75 |
+
def render_data_grid(df):
|
76 |
+
"""
|
77 |
+
Render an interactive data grid (with built‑in filtering) and return the selected row.
|
78 |
+
The grid is configured to show only the desired columns (ID, Date, Source, Tone)
|
79 |
+
and allow filtering/search on each.
|
80 |
+
"""
|
81 |
+
st.subheader("Search and Filter Records")
|
82 |
+
|
83 |
+
# Build grid options with AgGrid
|
84 |
+
gb = GridOptionsBuilder.from_dataframe(df)
|
85 |
+
gb.configure_default_column(filter=True, sortable=True, resizable=True)
|
86 |
+
# Enable single row selection
|
87 |
+
gb.configure_selection('single', use_checkbox=False)
|
88 |
+
grid_options = gb.build()
|
89 |
+
|
90 |
+
# Render AgGrid (the grid will have a filter field for each column)
|
91 |
+
grid_response = AgGrid(
|
92 |
+
df,
|
93 |
+
gridOptions=grid_options,
|
94 |
+
update_mode=GridUpdateMode.SELECTION_CHANGED,
|
95 |
+
height=400,
|
96 |
+
fit_columns_on_grid_load=True
|
97 |
+
)
|
98 |
+
|
99 |
+
selected = grid_response.get('selected_rows')
|
100 |
+
if selected is not None:
|
101 |
+
# If selected is a DataFrame, use iloc to get the first row.
|
102 |
+
if isinstance(selected, pd.DataFrame):
|
103 |
+
if not selected.empty:
|
104 |
+
return selected.iloc[0].to_dict()
|
105 |
+
# Otherwise, if it's a list, get the first element.
|
106 |
+
elif isinstance(selected, list) and len(selected) > 0:
|
107 |
+
return selected[0]
|
108 |
+
return None
|
109 |
+
|
110 |
+
def render_raw_data(record):
|
111 |
+
"""Render raw GDELT data in expandable sections."""
|
112 |
+
st.header("Full Record Details")
|
113 |
+
for category, fields in GDELT_CATEGORIES.items():
|
114 |
+
with st.expander(f"{category}"):
|
115 |
+
for field in fields:
|
116 |
+
if field in record:
|
117 |
+
st.markdown(f"**{field}:**")
|
118 |
+
st.text(record[field])
|
119 |
+
st.divider()
|
120 |
+
|
121 |
+
def main():
|
122 |
+
st.title("🗺️ GDELT Feb 2025 Navigator")
|
123 |
+
st.markdown("""
|
124 |
+
**Investigate Recent Global Events (Feb 2025)**
|
125 |
+
|
126 |
+
Leverage advanced filters and interactive grids to explore the latest data from the GDELT Global Knowledge Graph. This navigator is optimized for recent events, offering insights into evolving global narratives.
|
127 |
+
""")
|
128 |
+
|
129 |
+
|
130 |
+
# Initialize database connection using context manager
|
131 |
+
with initialize_db() as con:
|
132 |
+
if con is not None:
|
133 |
+
# Add UI components
|
134 |
+
|
135 |
+
# Sidebar controls
|
136 |
+
with st.sidebar:
|
137 |
+
st.header("Search Filters")
|
138 |
+
source = st.text_input("Filter by source name")
|
139 |
+
start_date = st.text_input("Start date (YYYYMMDD)", "20250210")
|
140 |
+
end_date = st.text_input("End date (YYYYMMDD)", "20250211")
|
141 |
+
limit = st.slider("Number of results to display", 10, 500, 10)
|
142 |
+
|
143 |
+
# Fetch initial data view
|
144 |
+
df_initial = fetch_data(
|
145 |
+
con=con,
|
146 |
+
source_filter=source,
|
147 |
+
start_date=start_date,
|
148 |
+
end_date=end_date,
|
149 |
+
limit=limit,
|
150 |
+
include_all_columns=False
|
151 |
+
)
|
152 |
+
|
153 |
+
# Fetch full records for selection
|
154 |
+
df_full = fetch_data(
|
155 |
+
con=con,
|
156 |
+
source_filter=source,
|
157 |
+
start_date=start_date,
|
158 |
+
end_date=end_date,
|
159 |
+
limit=limit,
|
160 |
+
include_all_columns=True
|
161 |
+
)
|
162 |
+
|
163 |
+
# Create a DataFrame for the grid with only the key columns
|
164 |
+
grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', "V2.1SharingImage", 'SourceCollectionIdentifier']].copy()
|
165 |
+
grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Image', 'Source Collection ID']
|
166 |
+
|
167 |
+
# Render the interactive data grid at the top
|
168 |
+
selected_row = render_data_grid(grid_df)
|
169 |
+
|
170 |
+
if selected_row:
|
171 |
+
# Find the full record in the original DataFrame using the selected ID
|
172 |
+
selected_id = selected_row['ID']
|
173 |
+
full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
|
174 |
+
|
175 |
+
# Display the raw data below the grid
|
176 |
+
render_raw_data(full_record)
|
177 |
+
else:
|
178 |
+
st.info("Select a record above to view its complete details.")
|
179 |
+
else:
|
180 |
+
st.warning("No matching records found.")
|
181 |
+
|
182 |
+
# Close database connection
|
183 |
+
con.close()
|
184 |
+
|
185 |
+
main()
|
pages/5_🔍_Feb_2025_Event_Graph.py
ADDED
@@ -0,0 +1,198 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import duckdb
|
3 |
+
import pandas as pd
|
4 |
+
from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
|
5 |
+
from st_link_analysis import st_link_analysis, NodeStyle, EdgeStyle
|
6 |
+
from graph_builder import StLinkBuilder
|
7 |
+
|
8 |
+
# Node styles configuration
|
9 |
+
NODE_STYLES = [
|
10 |
+
NodeStyle("EVENT", "#FF7F3E", "name", "description"),
|
11 |
+
NodeStyle("PERSON", "#4CAF50", "name", "person"),
|
12 |
+
NodeStyle("NAME", "#2A629A", "created_at", "badge"),
|
13 |
+
NodeStyle("ORGANIZATION", "#9C27B0", "name", "business"),
|
14 |
+
NodeStyle("LOCATION", "#2196F3", "name", "place"),
|
15 |
+
NodeStyle("THEME", "#FFC107", "name", "sell"),
|
16 |
+
NodeStyle("COUNT", "#795548", "name", "inventory"),
|
17 |
+
NodeStyle("AMOUNT", "#607D8B", "name", "wallet"),
|
18 |
+
]
|
19 |
+
|
20 |
+
# Edge styles configuration
|
21 |
+
EDGE_STYLES = [
|
22 |
+
EdgeStyle("MENTIONED_IN", caption="label", directed=True),
|
23 |
+
EdgeStyle("LOCATED_IN", caption="label", directed=True),
|
24 |
+
EdgeStyle("CATEGORIZED_AS", caption="label", directed=True)
|
25 |
+
]
|
26 |
+
|
27 |
+
def initialize_db():
|
28 |
+
"""Initialize database connection and create dataset view with optimized tone extraction"""
|
29 |
+
con = duckdb.connect()
|
30 |
+
con.execute("""
|
31 |
+
CREATE VIEW tone_vw AS (
|
32 |
+
SELECT
|
33 |
+
* EXCLUDE ("V1.5Tone"),
|
34 |
+
TRY_CAST(
|
35 |
+
CASE
|
36 |
+
WHEN POSITION(',' IN "V1.5Tone") > 0
|
37 |
+
THEN SUBSTRING("V1.5Tone", 1, POSITION(',' IN "V1.5Tone") - 1)
|
38 |
+
ELSE "V1.5Tone"
|
39 |
+
END
|
40 |
+
AS FLOAT
|
41 |
+
) AS tone
|
42 |
+
FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-2025-v2/**/*.parquet')
|
43 |
+
);
|
44 |
+
""")
|
45 |
+
return con
|
46 |
+
|
47 |
+
def fetch_data(con, source_filter=None,
|
48 |
+
start_date=None, end_date=None, limit=50, include_all_columns=False):
|
49 |
+
"""Fetch filtered data from the database"""
|
50 |
+
if include_all_columns:
|
51 |
+
columns = "*"
|
52 |
+
else:
|
53 |
+
columns = "GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, 'V2.1Quotations', SourceCollectionIdentifier"
|
54 |
+
|
55 |
+
query = f"""
|
56 |
+
SELECT {columns}
|
57 |
+
FROM tone_vw
|
58 |
+
WHERE TRUE
|
59 |
+
"""
|
60 |
+
params = []
|
61 |
+
|
62 |
+
if source_filter:
|
63 |
+
query += " AND SourceCommonName ILIKE ?"
|
64 |
+
params.append(f"%{source_filter}%")
|
65 |
+
if start_date:
|
66 |
+
query += " AND DATE >= ?"
|
67 |
+
params.append(start_date)
|
68 |
+
if end_date:
|
69 |
+
query += " AND DATE <= ?"
|
70 |
+
params.append(end_date)
|
71 |
+
if limit:
|
72 |
+
query += f" LIMIT {limit}"
|
73 |
+
|
74 |
+
try:
|
75 |
+
result = con.execute(query, params)
|
76 |
+
return result.fetchdf()
|
77 |
+
except Exception as e:
|
78 |
+
st.error(f"Query execution failed: {str(e)}")
|
79 |
+
return pd.DataFrame()
|
80 |
+
|
81 |
+
def render_data_grid(df):
|
82 |
+
"""
|
83 |
+
Render an interactive data grid (with built‑in filtering) and return the selected row.
|
84 |
+
The grid is configured to show only the desired columns (ID, Date, Source, Tone)
|
85 |
+
and allow filtering/search on each.
|
86 |
+
"""
|
87 |
+
st.subheader("Search and Filter Records")
|
88 |
+
|
89 |
+
# Build grid options with AgGrid
|
90 |
+
gb = GridOptionsBuilder.from_dataframe(df)
|
91 |
+
gb.configure_default_column(filter=True, sortable=True, resizable=True)
|
92 |
+
# Enable single row selection
|
93 |
+
gb.configure_selection('single', use_checkbox=False)
|
94 |
+
grid_options = gb.build()
|
95 |
+
|
96 |
+
# Render AgGrid (the grid will have a filter field for each column)
|
97 |
+
grid_response = AgGrid(
|
98 |
+
df,
|
99 |
+
gridOptions=grid_options,
|
100 |
+
update_mode=GridUpdateMode.SELECTION_CHANGED,
|
101 |
+
height=400,
|
102 |
+
fit_columns_on_grid_load=True
|
103 |
+
)
|
104 |
+
|
105 |
+
selected = grid_response.get('selected_rows')
|
106 |
+
if selected is not None:
|
107 |
+
# If selected is a DataFrame, use iloc to get the first row.
|
108 |
+
if isinstance(selected, pd.DataFrame):
|
109 |
+
if not selected.empty:
|
110 |
+
return selected.iloc[0].to_dict()
|
111 |
+
# Otherwise, if it's a list, get the first element.
|
112 |
+
elif isinstance(selected, list) and len(selected) > 0:
|
113 |
+
return selected[0]
|
114 |
+
return None
|
115 |
+
|
116 |
+
def render_graph(record):
|
117 |
+
"""
|
118 |
+
Render a graph visualization for the selected record.
|
119 |
+
Uses StLinkBuilder to convert the record into graph format and then
|
120 |
+
displays the graph using st_link_analysis.
|
121 |
+
"""
|
122 |
+
st.subheader(f"Event Graph: {record.get('GKGRECORDID', 'Unknown')}")
|
123 |
+
stlink_builder = StLinkBuilder()
|
124 |
+
# Convert the record (a Series) into a DataFrame with one row
|
125 |
+
record_df = pd.DataFrame([record])
|
126 |
+
graph_data = stlink_builder.build_graph(record_df)
|
127 |
+
return st_link_analysis(
|
128 |
+
elements=graph_data,
|
129 |
+
layout="fcose", # Column configuration for data grid - cose, fcose, breadthfirst, cola
|
130 |
+
node_styles=NODE_STYLES,
|
131 |
+
edge_styles=EDGE_STYLES
|
132 |
+
)
|
133 |
+
|
134 |
+
def main():
|
135 |
+
st.title("🔍 GDELT Feb 2025 Event Graph Explorer")
|
136 |
+
st.markdown("""
|
137 |
+
**Investigate Recent Global Events (Feb 2025) in an Interactive Event Graph Viewer**
|
138 |
+
|
139 |
+
Filter and select individual event records to display their detailed graph representations. Analyze relationships between events and associated entities using the interactive graph below.
|
140 |
+
""")
|
141 |
+
|
142 |
+
# Initialize database connection using context manager
|
143 |
+
with initialize_db() as con:
|
144 |
+
if con is not None:
|
145 |
+
# Add UI components
|
146 |
+
|
147 |
+
# Sidebar controls
|
148 |
+
with st.sidebar:
|
149 |
+
st.header("Search Filters")
|
150 |
+
source = st.text_input("Filter by source name")
|
151 |
+
start_date = st.text_input("Start date (YYYYMMDD)", "20250210")
|
152 |
+
end_date = st.text_input("End date (YYYYMMDD)", "20250211")
|
153 |
+
limit = st.slider("Number of results to display", 10, 500, 100)
|
154 |
+
|
155 |
+
# Fetch initial data view
|
156 |
+
df_initial = fetch_data(
|
157 |
+
con=con,
|
158 |
+
source_filter=source,
|
159 |
+
start_date=start_date,
|
160 |
+
end_date=end_date,
|
161 |
+
limit=limit,
|
162 |
+
include_all_columns=False
|
163 |
+
)
|
164 |
+
|
165 |
+
# Fetch full records for selection
|
166 |
+
df_full = fetch_data(
|
167 |
+
con=con,
|
168 |
+
source_filter=source,
|
169 |
+
start_date=start_date,
|
170 |
+
end_date=end_date,
|
171 |
+
limit=limit,
|
172 |
+
include_all_columns=True
|
173 |
+
)
|
174 |
+
|
175 |
+
# Create a DataFrame for the grid with only the key columns
|
176 |
+
grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', 'SourceCollectionIdentifier']].copy()
|
177 |
+
grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Source Collection ID']
|
178 |
+
|
179 |
+
# Render the interactive data grid at the top
|
180 |
+
selected_row = render_data_grid(grid_df)
|
181 |
+
|
182 |
+
if selected_row:
|
183 |
+
# Find the full record in the original DataFrame using the selected ID
|
184 |
+
selected_id = selected_row['ID']
|
185 |
+
full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
|
186 |
+
|
187 |
+
# Display the graph and raw data below the grid
|
188 |
+
render_graph(full_record)
|
189 |
+
else:
|
190 |
+
st.info("Use the grid filters above to search and select a record.")
|
191 |
+
|
192 |
+
else:
|
193 |
+
st.warning("No matching records found.")
|
194 |
+
|
195 |
+
# Close database connection
|
196 |
+
con.close()
|
197 |
+
|
198 |
+
main()
|
pages/6_🧪_Feb_2025_Dataset_Explorer.py
ADDED
@@ -0,0 +1,250 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import pandas as pd
|
3 |
+
from datasets import load_dataset
|
4 |
+
import re
|
5 |
+
from datetime import datetime, date
|
6 |
+
from io import StringIO
|
7 |
+
from typing import Optional, Tuple, List, Dict, Any
|
8 |
+
|
9 |
+
# Constants
|
10 |
+
DEFAULT_SAMPLE_SIZE = 1000
|
11 |
+
DATE_FORMAT = "%Y%m%d"
|
12 |
+
FULL_DATE_FORMAT = f"{DATE_FORMAT}%H%M%S"
|
13 |
+
|
14 |
+
# Load dataset with enhanced caching and validation
|
15 |
+
@st.cache_data(ttl=3600, show_spinner="Loading dataset...")
|
16 |
+
def load_data(sample_size: int = DEFAULT_SAMPLE_SIZE) -> pd.DataFrame:
|
17 |
+
"""
|
18 |
+
Load and validate dataset with error handling.
|
19 |
+
|
20 |
+
Args:
|
21 |
+
sample_size (int): Number of records to load
|
22 |
+
|
23 |
+
Returns:
|
24 |
+
pd.DataFrame: Loaded and validated dataframe
|
25 |
+
"""
|
26 |
+
try:
|
27 |
+
dataset = load_dataset(
|
28 |
+
"dwb2023/gdelt-gkg-2025-v2",
|
29 |
+
data_files={
|
30 |
+
"train": [
|
31 |
+
"gdelt_gkg_20250210.parquet",
|
32 |
+
"gdelt_gkg_20250211.parquet",
|
33 |
+
]
|
34 |
+
},
|
35 |
+
split="train"
|
36 |
+
)
|
37 |
+
df = pd.DataFrame(dataset)
|
38 |
+
|
39 |
+
# Basic data validation
|
40 |
+
if df.empty:
|
41 |
+
st.error("Loaded dataset is empty")
|
42 |
+
return pd.DataFrame()
|
43 |
+
|
44 |
+
if "DATE" not in df.columns:
|
45 |
+
st.error("Dataset missing required DATE column")
|
46 |
+
return pd.DataFrame()
|
47 |
+
|
48 |
+
return df
|
49 |
+
|
50 |
+
except Exception as e:
|
51 |
+
st.error(f"Error loading dataset: {str(e)}")
|
52 |
+
st.stop()
|
53 |
+
return pd.DataFrame()
|
54 |
+
|
55 |
+
def initialize_app(df: pd.DataFrame) -> None:
|
56 |
+
"""Initialize the Streamlit app interface."""
|
57 |
+
st.title("GDELT GKG 2025 Dataset Explorer")
|
58 |
+
|
59 |
+
with st.sidebar:
|
60 |
+
st.header("Search Criteria")
|
61 |
+
st.markdown("🔍 Filter dataset using the controls below")
|
62 |
+
|
63 |
+
def extract_unique_themes(df: pd.DataFrame, column: str) -> List[str]:
|
64 |
+
"""
|
65 |
+
Extract and clean unique themes from semicolon-separated column.
|
66 |
+
|
67 |
+
Args:
|
68 |
+
df (pd.DataFrame): Input dataframe
|
69 |
+
column (str): Column name containing themes
|
70 |
+
|
71 |
+
Returns:
|
72 |
+
List[str]: Sorted list of unique themes
|
73 |
+
"""
|
74 |
+
if df.empty:
|
75 |
+
return []
|
76 |
+
|
77 |
+
return sorted({
|
78 |
+
theme.split(",")[0].strip()
|
79 |
+
for themes in df[column].dropna().str.split(";")
|
80 |
+
for theme in themes if theme.strip()
|
81 |
+
})
|
82 |
+
|
83 |
+
def get_date_range(df: pd.DataFrame, date_col: str) -> Tuple[date, date]:
|
84 |
+
"""
|
85 |
+
Get min/max dates from dataset with fallback defaults.
|
86 |
+
|
87 |
+
Args:
|
88 |
+
df (pd.DataFrame): Input dataframe
|
89 |
+
date_col (str): Column name containing dates
|
90 |
+
|
91 |
+
Returns:
|
92 |
+
Tuple[date, date]: (min_date, max_date) as date objects
|
93 |
+
"""
|
94 |
+
try:
|
95 |
+
# Convert YYYYMMDDHHMMSS string format to datetime using constant
|
96 |
+
dates = pd.to_datetime(df[date_col], format=FULL_DATE_FORMAT)
|
97 |
+
return dates.min().date(), dates.max().date()
|
98 |
+
except Exception as e:
|
99 |
+
st.warning(f"Date range detection failed: {str(e)}")
|
100 |
+
return datetime(2025, 2, 10).date(), datetime(2025, 2, 11).date()
|
101 |
+
|
102 |
+
def create_filters(df: pd.DataFrame) -> Dict[str, Any]:
|
103 |
+
"""
|
104 |
+
Generate sidebar filters and return filter state.
|
105 |
+
|
106 |
+
Args:
|
107 |
+
df (pd.DataFrame): Input dataframe
|
108 |
+
|
109 |
+
Returns:
|
110 |
+
Dict[str, Any]: Dictionary of filter settings
|
111 |
+
"""
|
112 |
+
filters = {}
|
113 |
+
|
114 |
+
with st.sidebar:
|
115 |
+
# Theme multi-select
|
116 |
+
filters["themes"] = st.multiselect(
|
117 |
+
"V2EnhancedThemes (exact match)",
|
118 |
+
options=extract_unique_themes(df, "V2EnhancedThemes"),
|
119 |
+
help="Select exact themes to include (supports multiple selection)"
|
120 |
+
)
|
121 |
+
|
122 |
+
# Text-based filters
|
123 |
+
text_filters = {
|
124 |
+
"source_common_name": ("SourceCommonName", "partial name match"),
|
125 |
+
"document_identifier": ("DocumentIdentifier", "partial identifier match"),
|
126 |
+
"sharing_image": ("V2.1SharingImage", "partial image URL match")
|
127 |
+
}
|
128 |
+
|
129 |
+
for key, (label, help_text) in text_filters.items():
|
130 |
+
filters[key] = st.text_input(
|
131 |
+
f"{label} ({help_text})",
|
132 |
+
placeholder=f"Enter {help_text}...",
|
133 |
+
help=f"Case-insensitive {help_text}"
|
134 |
+
)
|
135 |
+
|
136 |
+
# Date range with dataset-based defaults
|
137 |
+
date_col = "DATE"
|
138 |
+
min_date, max_date = get_date_range(df, date_col)
|
139 |
+
|
140 |
+
filters["date_range"] = st.date_input(
|
141 |
+
"Date range",
|
142 |
+
value=(min_date, max_date),
|
143 |
+
min_value=min_date,
|
144 |
+
max_value=max_date,
|
145 |
+
)
|
146 |
+
|
147 |
+
# Record limit
|
148 |
+
filters["record_limit"] = st.number_input(
|
149 |
+
"Max records to display",
|
150 |
+
min_value=100,
|
151 |
+
max_value=5000,
|
152 |
+
value=1000,
|
153 |
+
step=100,
|
154 |
+
help="Limit results for better performance"
|
155 |
+
)
|
156 |
+
|
157 |
+
return filters
|
158 |
+
|
159 |
+
def apply_filters(df: pd.DataFrame, filters: Dict[str, Any]) -> pd.DataFrame:
|
160 |
+
"""
|
161 |
+
Apply all filters to dataframe using vectorized operations.
|
162 |
+
|
163 |
+
Args:
|
164 |
+
df (pd.DataFrame): Input dataframe to filter
|
165 |
+
filters (Dict[str, Any]): Dictionary containing filter parameters:
|
166 |
+
- themes (list): List of themes to match exactly
|
167 |
+
- source_common_name (str): Partial match for source name
|
168 |
+
- document_identifier (str): Partial match for document ID
|
169 |
+
- sharing_image (str): Partial match for image URL
|
170 |
+
- date_range (tuple): (start_date, end_date) tuple
|
171 |
+
- record_limit (int): Maximum number of records to return
|
172 |
+
|
173 |
+
Returns:
|
174 |
+
pd.DataFrame: Filtered dataframe
|
175 |
+
"""
|
176 |
+
filtered_df = df.copy()
|
177 |
+
|
178 |
+
# Theme exact match filter - set regex groups to be non-capturing using (?:) syntax
|
179 |
+
if filters["themes"]:
|
180 |
+
pattern = r'(?:^|;)(?:{})(?:$|,|;)'.format('|'.join(map(re.escape, filters["themes"])))
|
181 |
+
filtered_df = filtered_df[filtered_df["V2EnhancedThemes"].str.contains(pattern, na=False)]
|
182 |
+
|
183 |
+
# Text partial match filters
|
184 |
+
text_columns = {
|
185 |
+
"source_common_name": "SourceCommonName",
|
186 |
+
"document_identifier": "DocumentIdentifier",
|
187 |
+
"sharing_image": "V2.1SharingImage"
|
188 |
+
}
|
189 |
+
|
190 |
+
for filter_key, col_name in text_columns.items():
|
191 |
+
if value := filters.get(filter_key):
|
192 |
+
filtered_df = filtered_df[
|
193 |
+
filtered_df[col_name]
|
194 |
+
.str.contains(re.escape(value), case=False, na=False)
|
195 |
+
]
|
196 |
+
|
197 |
+
# Date range filter with validation
|
198 |
+
if len(filters["date_range"]) == 2:
|
199 |
+
start_date, end_date = filters["date_range"]
|
200 |
+
|
201 |
+
# Validate date range
|
202 |
+
if start_date > end_date:
|
203 |
+
st.error("Start date must be before end date")
|
204 |
+
return filtered_df
|
205 |
+
|
206 |
+
date_col = "DATE"
|
207 |
+
try:
|
208 |
+
# Convert full datetime strings to datetime objects using constant
|
209 |
+
date_series = pd.to_datetime(filtered_df[date_col], format=FULL_DATE_FORMAT)
|
210 |
+
|
211 |
+
# Create timestamps for start/end of day
|
212 |
+
start_timestamp = pd.Timestamp(start_date).normalize() # Start of day
|
213 |
+
end_timestamp = pd.Timestamp(end_date) + pd.Timedelta(days=1) - pd.Timedelta(seconds=1) # End of day
|
214 |
+
|
215 |
+
filtered_df = filtered_df[
|
216 |
+
(date_series >= start_timestamp) &
|
217 |
+
(date_series <= end_timestamp)
|
218 |
+
]
|
219 |
+
except Exception as e:
|
220 |
+
st.error(f"Error applying date filter: {str(e)}")
|
221 |
+
return filtered_df
|
222 |
+
|
223 |
+
# Apply record limit
|
224 |
+
return filtered_df.head(filters["record_limit"])
|
225 |
+
|
226 |
+
def main():
|
227 |
+
"""Main application entry point."""
|
228 |
+
df = load_data()
|
229 |
+
if df.empty:
|
230 |
+
st.warning("No data available - check data source")
|
231 |
+
return
|
232 |
+
|
233 |
+
initialize_app(df)
|
234 |
+
filters = create_filters(df)
|
235 |
+
filtered_df = apply_filters(df, filters)
|
236 |
+
|
237 |
+
# Display results
|
238 |
+
st.subheader(f"Results: {len(filtered_df)} records")
|
239 |
+
|
240 |
+
st.dataframe(filtered_df, use_container_width=True)
|
241 |
+
|
242 |
+
st.download_button(
|
243 |
+
label="Download CSV",
|
244 |
+
data=filtered_df.to_csv(index=False).encode(),
|
245 |
+
file_name="filtered_results.csv",
|
246 |
+
mime="text/csv",
|
247 |
+
help="Download filtered results as CSV"
|
248 |
+
)
|
249 |
+
|
250 |
+
main()
|
requirements.txt
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
streamlit
|
2 |
+
duckdb
|
3 |
+
networkx
|
4 |
+
pandas
|
5 |
+
pyvis
|
6 |
+
datasets
|
7 |
+
huggingface_hub
|
8 |
+
python-dateutil
|
9 |
+
st-link-analysis
|
10 |
+
streamlit-aggrid
|
solution_component_notes/gdelt_gkg_duckdb_networkx_v5.ipynb
ADDED
@@ -0,0 +1,388 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"metadata": {
|
6 |
+
"id": "N20l3SsqSVUM"
|
7 |
+
},
|
8 |
+
"source": [
|
9 |
+
"## Leveraging DuckDB with HF Datasets - GDELT Global KG\n",
|
10 |
+
"\n",
|
11 |
+
"This notebook demonstrates how to seamlessly transform **GDELT** knowledge graph data into a coherent format that can be pushed to both **NetworkX** and **Neo4j**. It provides a **referenceable pipeline** for data professionals, researchers, and solution architects who need to:\n",
|
12 |
+
"\n",
|
13 |
+
"1. **Ingest and Query Data Efficiently** \n",
|
14 |
+
" - Utilize **DuckDB** to load just the required portions of large Parquet datasets, enabling targeted data exploration and analysis.\n",
|
15 |
+
" - It also allows for iteratively honing in on a specific segment of data using splits - helping to maximize performance / cost / efficiency.\n",
|
16 |
+
"\n",
|
17 |
+
"2. **Maintain Consistent Graph Modeling** \n",
|
18 |
+
" - Leverage a shared parsing and entity extraction layer to build consistent node and relationship structures in both an **in-memory** graph (NetworkX) and a **Neo4j** database. (not a requirement per se - but an approach I wanted to start with)\n",
|
19 |
+
"\n",
|
20 |
+
"3. **Run Advanced Queries and Analytics** \n",
|
21 |
+
" - Illustrate critical tasks like **centrality** and **community detection** to pinpoint influential nodes and groupings, and execute **Cypher** queries for real-time insights.\n",
|
22 |
+
"\n",
|
23 |
+
"4. **Visualize and Export** \n",
|
24 |
+
" - Produce simple web-based **PyVis** visualizations or **matplotlib** plots.\n",
|
25 |
+
" - more importantly the data can also be exported in **JSON** and GraphML for integration with other graph tooling. (D3.js, Cytoscape, etc.)"
|
26 |
+
]
|
27 |
+
},
|
28 |
+
{
|
29 |
+
"cell_type": "code",
|
30 |
+
"execution_count": null,
|
31 |
+
"metadata": {
|
32 |
+
"id": "DCPEB5tpfW44"
|
33 |
+
},
|
34 |
+
"outputs": [],
|
35 |
+
"source": [
|
36 |
+
"%pip install -q duckdb networkx pandas neo4j pyvis"
|
37 |
+
]
|
38 |
+
},
|
39 |
+
{
|
40 |
+
"cell_type": "code",
|
41 |
+
"execution_count": null,
|
42 |
+
"metadata": {
|
43 |
+
"id": "A1vEyOkm7LPV"
|
44 |
+
},
|
45 |
+
"outputs": [],
|
46 |
+
"source": [
|
47 |
+
"from google.colab import userdata\n",
|
48 |
+
"\n",
|
49 |
+
"URI = userdata.get('NEO4J_URI')\n",
|
50 |
+
"USER = 'neo4j'\n",
|
51 |
+
"PASSWORD = userdata.get('NEO4J_PASSWORD')"
|
52 |
+
]
|
53 |
+
},
|
54 |
+
{
|
55 |
+
"cell_type": "code",
|
56 |
+
"execution_count": null,
|
57 |
+
"metadata": {
|
58 |
+
"id": "cm8t66uPy_C7"
|
59 |
+
},
|
60 |
+
"outputs": [],
|
61 |
+
"source": [
|
62 |
+
"import duckdb\n",
|
63 |
+
"import networkx as nx\n",
|
64 |
+
"from neo4j import GraphDatabase\n",
|
65 |
+
"import logging\n",
|
66 |
+
"from datetime import datetime\n",
|
67 |
+
"import pandas as pd\n",
|
68 |
+
"from pyvis.network import Network\n",
|
69 |
+
"\n",
|
70 |
+
"def get_gdelt_data(limit=100):\n",
|
71 |
+
" \"\"\"Get data from DuckDB with specified limit\"\"\"\n",
|
72 |
+
" con = duckdb.connect(database=':memory:')\n",
|
73 |
+
"\n",
|
74 |
+
" # Create view of the dataset\n",
|
75 |
+
" con.execute(\"\"\"\n",
|
76 |
+
" CREATE VIEW train AS (\n",
|
77 |
+
" SELECT *\n",
|
78 |
+
" FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2/*.parquet')\n",
|
79 |
+
" );\n",
|
80 |
+
" \"\"\")\n",
|
81 |
+
"\n",
|
82 |
+
" # Single query with limit\n",
|
83 |
+
" query = f\"\"\"\n",
|
84 |
+
" SELECT\n",
|
85 |
+
" GKGRECORDID,\n",
|
86 |
+
" DATE,\n",
|
87 |
+
" SourceCommonName,\n",
|
88 |
+
" DocumentIdentifier,\n",
|
89 |
+
" V2EnhancedPersons,\n",
|
90 |
+
" V2EnhancedOrganizations,\n",
|
91 |
+
" V2EnhancedLocations,\n",
|
92 |
+
" V2EnhancedThemes,\n",
|
93 |
+
" CAST(SPLIT_PART(\"V1.5Tone\", ',', 1) AS FLOAT) as tone\n",
|
94 |
+
" FROM train\n",
|
95 |
+
" LIMIT {limit}\n",
|
96 |
+
" \"\"\"\n",
|
97 |
+
"\n",
|
98 |
+
" results_df = con.execute(query).fetchdf()\n",
|
99 |
+
" con.close()\n",
|
100 |
+
" return results_df\n",
|
101 |
+
"\n",
|
102 |
+
"class GraphBuilder:\n",
|
103 |
+
" \"\"\"Base class for building graph from GDELT data\"\"\"\n",
|
104 |
+
" def process_entities(self, row):\n",
|
105 |
+
" \"\"\"Process entities from a row and return nodes and relationships\"\"\"\n",
|
106 |
+
" nodes = []\n",
|
107 |
+
" relationships = []\n",
|
108 |
+
" event_id = row[\"GKGRECORDID\"]\n",
|
109 |
+
" event_date = row[\"DATE\"]\n",
|
110 |
+
" event_source = row[\"SourceCommonName\"]\n",
|
111 |
+
" event_document_id = row[\"DocumentIdentifier\"]\n",
|
112 |
+
" event_tone = float(row[\"tone\"]) if pd.notna(row[\"tone\"]) else 0.0\n",
|
113 |
+
"\n",
|
114 |
+
" # Add event node\n",
|
115 |
+
" nodes.append({\n",
|
116 |
+
" \"id\": event_id,\n",
|
117 |
+
" \"type\": \"event\",\n",
|
118 |
+
" \"properties\": {\n",
|
119 |
+
" \"date\": event_date,\n",
|
120 |
+
" \"source\": event_source,\n",
|
121 |
+
" \"document\": event_document_id,\n",
|
122 |
+
" \"tone\": event_tone\n",
|
123 |
+
" }\n",
|
124 |
+
" })\n",
|
125 |
+
"\n",
|
126 |
+
" # Process each entity type\n",
|
127 |
+
" entity_mappings = {\n",
|
128 |
+
" \"V2EnhancedPersons\": (\"Person\", \"MENTIONED_IN\"),\n",
|
129 |
+
" \"V2EnhancedOrganizations\": (\"Organization\", \"MENTIONED_IN\"),\n",
|
130 |
+
" \"V2EnhancedLocations\": (\"Location\", \"LOCATED_IN\"),\n",
|
131 |
+
" \"V2EnhancedThemes\": (\"Theme\", \"CATEGORIZED_AS\")\n",
|
132 |
+
" }\n",
|
133 |
+
"\n",
|
134 |
+
" for field, (label, relationship) in entity_mappings.items():\n",
|
135 |
+
" if pd.notna(row[field]):\n",
|
136 |
+
" entities = [e.strip() for e in row[field].split(';') if e.strip()]\n",
|
137 |
+
" for entity in entities:\n",
|
138 |
+
" nodes.append({\n",
|
139 |
+
" \"id\": entity,\n",
|
140 |
+
" \"type\": label.lower(),\n",
|
141 |
+
" \"properties\": {\"name\": entity}\n",
|
142 |
+
" })\n",
|
143 |
+
" relationships.append({\n",
|
144 |
+
" \"from\": entity,\n",
|
145 |
+
" \"to\": event_id,\n",
|
146 |
+
" \"type\": relationship,\n",
|
147 |
+
" \"properties\": {\"created_at\": event_date}\n",
|
148 |
+
" })\n",
|
149 |
+
"\n",
|
150 |
+
" return nodes, relationships\n",
|
151 |
+
"\n",
|
152 |
+
"class NetworkXBuilder(GraphBuilder):\n",
|
153 |
+
" def build_graph(self, df):\n",
|
154 |
+
" G = nx.Graph()\n",
|
155 |
+
"\n",
|
156 |
+
" for _, row in df.iterrows():\n",
|
157 |
+
" nodes, relationships = self.process_entities(row)\n",
|
158 |
+
"\n",
|
159 |
+
" # Add nodes\n",
|
160 |
+
" for node in nodes:\n",
|
161 |
+
" G.add_node(node[\"id\"],\n",
|
162 |
+
" type=node[\"type\"],\n",
|
163 |
+
" **node[\"properties\"])\n",
|
164 |
+
"\n",
|
165 |
+
" # Add relationships\n",
|
166 |
+
" for rel in relationships:\n",
|
167 |
+
" G.add_edge(rel[\"from\"],\n",
|
168 |
+
" rel[\"to\"],\n",
|
169 |
+
" relationship=rel[\"type\"],\n",
|
170 |
+
" **rel[\"properties\"])\n",
|
171 |
+
"\n",
|
172 |
+
" return G\n",
|
173 |
+
"\n",
|
174 |
+
"class Neo4jBuilder(GraphBuilder):\n",
|
175 |
+
" def __init__(self, uri, user, password):\n",
|
176 |
+
" self.driver = GraphDatabase.driver(uri, auth=(user, password))\n",
|
177 |
+
" self.logger = logging.getLogger(__name__)\n",
|
178 |
+
"\n",
|
179 |
+
" def close(self):\n",
|
180 |
+
" self.driver.close()\n",
|
181 |
+
"\n",
|
182 |
+
" def build_graph(self, df):\n",
|
183 |
+
" with self.driver.session() as session:\n",
|
184 |
+
" for _, row in df.iterrows():\n",
|
185 |
+
" nodes, relationships = self.process_entities(row)\n",
|
186 |
+
"\n",
|
187 |
+
" # Create nodes and relationships in Neo4j\n",
|
188 |
+
" try:\n",
|
189 |
+
" session.execute_write(self._create_graph_elements,\n",
|
190 |
+
" nodes, relationships)\n",
|
191 |
+
" except Exception as e:\n",
|
192 |
+
" self.logger.error(f\"Error processing row {row['GKGRECORDID']}: {str(e)}\")\n",
|
193 |
+
"\n",
|
194 |
+
" def _create_graph_elements(self, tx, nodes, relationships):\n",
|
195 |
+
" # Create nodes\n",
|
196 |
+
" for node in nodes:\n",
|
197 |
+
" query = f\"\"\"\n",
|
198 |
+
" MERGE (n:{node['type']} {{id: $id}})\n",
|
199 |
+
" SET n += $properties\n",
|
200 |
+
" \"\"\"\n",
|
201 |
+
" tx.run(query, id=node[\"id\"], properties=node[\"properties\"])\n",
|
202 |
+
"\n",
|
203 |
+
" # Create relationships\n",
|
204 |
+
" for rel in relationships:\n",
|
205 |
+
" query = f\"\"\"\n",
|
206 |
+
" MATCH (a {{id: $from_id}})\n",
|
207 |
+
" MATCH (b {{id: $to_id}})\n",
|
208 |
+
" MERGE (a)-[r:{rel['type']}]->(b)\n",
|
209 |
+
" SET r += $properties\n",
|
210 |
+
" \"\"\"\n",
|
211 |
+
" tx.run(query,\n",
|
212 |
+
" from_id=rel[\"from\"],\n",
|
213 |
+
" to_id=rel[\"to\"],\n",
|
214 |
+
" properties=rel[\"properties\"])"
|
215 |
+
]
|
216 |
+
},
|
217 |
+
{
|
218 |
+
"cell_type": "code",
|
219 |
+
"execution_count": null,
|
220 |
+
"metadata": {
|
221 |
+
"id": "ghbLZNLe23x1"
|
222 |
+
},
|
223 |
+
"outputs": [],
|
224 |
+
"source": [
|
225 |
+
"if __name__ == \"__main__\":\n",
|
226 |
+
" # Get data once\n",
|
227 |
+
" df = get_gdelt_data(limit=25) # Get 25 records\n",
|
228 |
+
"\n",
|
229 |
+
" # Build NetworkX graph\n",
|
230 |
+
" nx_builder = NetworkXBuilder()\n",
|
231 |
+
" G = nx_builder.build_graph(df)\n",
|
232 |
+
"\n",
|
233 |
+
" # Print graph information\n",
|
234 |
+
" print(f\"NetworkX Graph Summary:\")\n",
|
235 |
+
" print(f\"Nodes: {G.number_of_nodes()}\")\n",
|
236 |
+
" print(f\"Edges: {G.number_of_edges()}\")\n",
|
237 |
+
"\n",
|
238 |
+
" # Print node types distribution\n",
|
239 |
+
" node_types = {}\n",
|
240 |
+
" for _, attr in G.nodes(data=True):\n",
|
241 |
+
" node_type = attr.get('type', 'unknown')\n",
|
242 |
+
" node_types[node_type] = node_types.get(node_type, 0) + 1\n",
|
243 |
+
"\n",
|
244 |
+
" print(\"\\nNode types distribution:\")\n",
|
245 |
+
" for ntype, count in node_types.items():\n",
|
246 |
+
" print(f\"{ntype}: {count}\")\n",
|
247 |
+
"\n",
|
248 |
+
" # Build Neo4j graph\n",
|
249 |
+
" neo4j_builder = Neo4jBuilder(URI, USER, PASSWORD)\n",
|
250 |
+
" try:\n",
|
251 |
+
" neo4j_builder.build_graph(df)\n",
|
252 |
+
" finally:\n",
|
253 |
+
" neo4j_builder.close()"
|
254 |
+
]
|
255 |
+
},
|
256 |
+
{
|
257 |
+
"cell_type": "code",
|
258 |
+
"execution_count": null,
|
259 |
+
"metadata": {
|
260 |
+
"id": "mkJKz_soTsAY"
|
261 |
+
},
|
262 |
+
"outputs": [],
|
263 |
+
"source": [
|
264 |
+
"# run cypher query for validation\n",
|
265 |
+
"\n",
|
266 |
+
"from neo4j import GraphDatabase\n",
|
267 |
+
"\n",
|
268 |
+
"class Neo4jQuery:\n",
|
269 |
+
" def __init__(self, uri, user, password):\n",
|
270 |
+
" self.driver = GraphDatabase.driver(uri, auth=(user, password))\n",
|
271 |
+
" self.logger = logging.getLogger(__name__)\n",
|
272 |
+
"\n",
|
273 |
+
" def close(self):\n",
|
274 |
+
" self.driver.close()\n",
|
275 |
+
"\n",
|
276 |
+
" def run_query(self, query):\n",
|
277 |
+
" with self.driver.session() as session:\n",
|
278 |
+
" result = session.run(query)\n",
|
279 |
+
" return result.data()\n",
|
280 |
+
"\n",
|
281 |
+
"query_1 = \"\"\"\n",
|
282 |
+
"// Count nodes by type\n",
|
283 |
+
"MATCH (n)\n",
|
284 |
+
"RETURN labels(n) as type, count(*) as count\n",
|
285 |
+
"ORDER BY count DESC;\n",
|
286 |
+
"\"\"\"\n"
|
287 |
+
]
|
288 |
+
},
|
289 |
+
{
|
290 |
+
"cell_type": "code",
|
291 |
+
"execution_count": null,
|
292 |
+
"metadata": {
|
293 |
+
"id": "mrlWADO93ize"
|
294 |
+
},
|
295 |
+
"outputs": [],
|
296 |
+
"source": [
|
297 |
+
"def visualize_graph(G, output_file='gdelt_network.html'):\n",
|
298 |
+
" \"\"\"Visualize NetworkX graph using Pyvis\"\"\"\n",
|
299 |
+
" # Create Pyvis network\n",
|
300 |
+
" net = Network(notebook=True,\n",
|
301 |
+
" height='750px',\n",
|
302 |
+
" width='100%',\n",
|
303 |
+
" bgcolor='#ffffff',\n",
|
304 |
+
" font_color='#000000')\n",
|
305 |
+
"\n",
|
306 |
+
" # Configure physics\n",
|
307 |
+
" net.force_atlas_2based(gravity=-50,\n",
|
308 |
+
" central_gravity=0.01,\n",
|
309 |
+
" spring_length=100,\n",
|
310 |
+
" spring_strength=0.08,\n",
|
311 |
+
" damping=0.4,\n",
|
312 |
+
" overlap=0)\n",
|
313 |
+
"\n",
|
314 |
+
" # Color mapping for node types\n",
|
315 |
+
" color_map = {\n",
|
316 |
+
" 'event': '#1f77b4', # Blue\n",
|
317 |
+
" 'person': '#00ff00', # Green\n",
|
318 |
+
" 'organization': '#ffa500', # Orange\n",
|
319 |
+
" 'location': '#ff0000', # Red\n",
|
320 |
+
" 'theme': '#800080' # Purple\n",
|
321 |
+
" }\n",
|
322 |
+
"\n",
|
323 |
+
" # Add nodes\n",
|
324 |
+
" for node, attr in G.nodes(data=True):\n",
|
325 |
+
" node_type = attr.get('type', 'unknown')\n",
|
326 |
+
" title = f\"Type: {node_type}\\n\"\n",
|
327 |
+
" for k, v in attr.items():\n",
|
328 |
+
" if k != 'type':\n",
|
329 |
+
" title += f\"{k}: {v}\\n\"\n",
|
330 |
+
"\n",
|
331 |
+
" net.add_node(node,\n",
|
332 |
+
" title=title,\n",
|
333 |
+
" label=str(node)[:20] + '...' if len(str(node)) > 20 else str(node),\n",
|
334 |
+
" color=color_map.get(node_type, '#gray'),\n",
|
335 |
+
" size=20 if node_type == 'event' else 15)\n",
|
336 |
+
"\n",
|
337 |
+
" # Add edges\n",
|
338 |
+
" for source, target, attr in G.edges(data=True):\n",
|
339 |
+
" net.add_edge(source,\n",
|
340 |
+
" target,\n",
|
341 |
+
" title=f\"{attr.get('relationship', '')}\\nDate: {attr.get('created_at', '')}\",\n",
|
342 |
+
" color='#666666')\n",
|
343 |
+
"\n",
|
344 |
+
" # Save visualization\n",
|
345 |
+
" net.show(output_file)\n",
|
346 |
+
" return f\"Graph visualization saved to {output_file}\"\n",
|
347 |
+
"\n",
|
348 |
+
"# Usage example:\n",
|
349 |
+
"if __name__ == \"__main__\":\n",
|
350 |
+
" visualize_graph(G)"
|
351 |
+
]
|
352 |
+
},
|
353 |
+
{
|
354 |
+
"cell_type": "code",
|
355 |
+
"execution_count": null,
|
356 |
+
"metadata": {
|
357 |
+
"id": "RqFRO1atnIIT"
|
358 |
+
},
|
359 |
+
"outputs": [],
|
360 |
+
"source": [
|
361 |
+
"!pip show duckdb"
|
362 |
+
]
|
363 |
+
},
|
364 |
+
{
|
365 |
+
"cell_type": "code",
|
366 |
+
"execution_count": null,
|
367 |
+
"metadata": {
|
368 |
+
"id": "95ML8u0LnKif"
|
369 |
+
},
|
370 |
+
"outputs": [],
|
371 |
+
"source": []
|
372 |
+
}
|
373 |
+
],
|
374 |
+
"metadata": {
|
375 |
+
"colab": {
|
376 |
+
"provenance": []
|
377 |
+
},
|
378 |
+
"kernelspec": {
|
379 |
+
"display_name": "Python 3",
|
380 |
+
"name": "python3"
|
381 |
+
},
|
382 |
+
"language_info": {
|
383 |
+
"name": "python"
|
384 |
+
}
|
385 |
+
},
|
386 |
+
"nbformat": 4,
|
387 |
+
"nbformat_minor": 0
|
388 |
+
}
|
solution_component_notes/gdelt_prefect_extract_to_hf_ds.py
ADDED
@@ -0,0 +1,303 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import asyncio
|
3 |
+
from prefect import flow, task, get_run_logger
|
4 |
+
from prefect.tasks import task_input_hash
|
5 |
+
from prefect.blocks.system import Secret, JSON
|
6 |
+
from prefect.task_runners import ConcurrentTaskRunner
|
7 |
+
from prefect.concurrency.sync import concurrency
|
8 |
+
from pathlib import Path
|
9 |
+
import datetime
|
10 |
+
from datetime import timedelta
|
11 |
+
import pandas as pd
|
12 |
+
from tqdm import tqdm
|
13 |
+
from huggingface_hub import HfApi, hf_hub_url, list_datasets
|
14 |
+
import requests
|
15 |
+
import zipfile
|
16 |
+
from typing import List, Dict, Optional
|
17 |
+
|
18 |
+
# --- Constants ---
|
19 |
+
# Set a global concurrency limit for Hugging Face uploads
|
20 |
+
REPO_ID = "dwb2023/gdelt-gkg-march2020-v2"
|
21 |
+
|
22 |
+
BASE_URL = "http://data.gdeltproject.org/gdeltv2"
|
23 |
+
|
24 |
+
# Complete Column List
|
25 |
+
GKG_COLUMNS = [
|
26 |
+
'GKGRECORDID', # Unique identifier
|
27 |
+
'DATE', # Publication date
|
28 |
+
'SourceCollectionIdentifier', # Source type
|
29 |
+
'SourceCommonName', # Source name
|
30 |
+
'DocumentIdentifier', # Document URL/ID
|
31 |
+
'V1Counts', # Counts of various types
|
32 |
+
'V2.1Counts', # Enhanced counts with positions
|
33 |
+
'V1Themes', # Theme tags
|
34 |
+
'V2EnhancedThemes', # Themes with positions
|
35 |
+
'V1Locations', # Location mentions
|
36 |
+
'V2EnhancedLocations', # Locations with positions
|
37 |
+
'V1Persons', # Person names
|
38 |
+
'V2EnhancedPersons', # Persons with positions
|
39 |
+
'V1Organizations', # Organization names
|
40 |
+
'V2EnhancedOrganizations', # Organizations with positions
|
41 |
+
'V1.5Tone', # Emotional dimensions
|
42 |
+
'V2.1EnhancedDates', # Date mentions
|
43 |
+
'V2GCAM', # Global Content Analysis Measures
|
44 |
+
'V2.1SharingImage', # Publisher selected image
|
45 |
+
'V2.1RelatedImages', # Article images
|
46 |
+
'V2.1SocialImageEmbeds', # Social media images
|
47 |
+
'V2.1SocialVideoEmbeds', # Social media videos
|
48 |
+
'V2.1Quotations', # Quote extractions
|
49 |
+
'V2.1AllNames', # Named entities
|
50 |
+
'V2.1Amounts', # Numeric amounts
|
51 |
+
'V2.1TranslationInfo', # Translation metadata
|
52 |
+
'V2ExtrasXML' # Additional XML data
|
53 |
+
]
|
54 |
+
|
55 |
+
# Priority Columns
|
56 |
+
PRIORITY_COLUMNS = [
|
57 |
+
'GKGRECORDID', # Unique identifier
|
58 |
+
'DATE', # Publication date
|
59 |
+
'SourceCollectionIdentifier', # Source type
|
60 |
+
'SourceCommonName', # Source name
|
61 |
+
'DocumentIdentifier', # Document URL/ID
|
62 |
+
'V1Counts', # Numeric mentions
|
63 |
+
'V2.1Counts', # Enhanced counts
|
64 |
+
'V1Themes', # Theme tags
|
65 |
+
'V2EnhancedThemes', # Enhanced themes
|
66 |
+
'V1Locations', # Geographic data
|
67 |
+
'V2EnhancedLocations', # Enhanced locations
|
68 |
+
'V1Persons', # Person mentions
|
69 |
+
'V2EnhancedPersons', # Enhanced persons
|
70 |
+
'V1Organizations', # Organization mentions
|
71 |
+
'V2EnhancedOrganizations', # Enhanced organizations
|
72 |
+
'V1.5Tone', # Sentiment scores
|
73 |
+
'V2.1EnhancedDates', # Date mentions
|
74 |
+
'V2GCAM', # Enhanced sentiment
|
75 |
+
'V2.1Quotations', # Direct quotes
|
76 |
+
'V2.1AllNames', # All named entities
|
77 |
+
'V2.1Amounts' # Numeric data
|
78 |
+
]
|
79 |
+
|
80 |
+
# --- Tasks ---
|
81 |
+
|
82 |
+
@task(retries=3, retry_delay_seconds=30, log_prints=True)
|
83 |
+
def setup_directories(base_path: Path) -> dict:
|
84 |
+
"""Create processing directories."""
|
85 |
+
logger = get_run_logger()
|
86 |
+
try:
|
87 |
+
raw_dir = base_path / "gdelt_raw"
|
88 |
+
processed_dir = base_path / "gdelt_processed"
|
89 |
+
raw_dir.mkdir(parents=True, exist_ok=True)
|
90 |
+
processed_dir.mkdir(parents=True, exist_ok=True)
|
91 |
+
logger.info("Directories created successfully")
|
92 |
+
return {"raw": raw_dir, "processed": processed_dir}
|
93 |
+
except Exception as e:
|
94 |
+
logger.error(f"Directory creation failed: {str(e)}")
|
95 |
+
raise
|
96 |
+
|
97 |
+
@task(retries=2, log_prints=True)
|
98 |
+
def generate_gdelt_urls(start_date: datetime.datetime, end_date: datetime.datetime) -> Dict[datetime.date, List[str]]:
|
99 |
+
"""
|
100 |
+
Generate a dictionary keyed by date. Each value is a list of URLs (one per 15-minute interval).
|
101 |
+
"""
|
102 |
+
logger = get_run_logger()
|
103 |
+
url_groups = {}
|
104 |
+
try:
|
105 |
+
current_date = start_date.date()
|
106 |
+
while current_date <= end_date.date():
|
107 |
+
urls = [
|
108 |
+
f"{BASE_URL}/{current_date.strftime('%Y%m%d')}{hour:02}{minute:02}00.gkg.csv.zip"
|
109 |
+
for hour in range(24)
|
110 |
+
for minute in [0, 15, 30, 45]
|
111 |
+
]
|
112 |
+
url_groups[current_date] = urls
|
113 |
+
current_date += timedelta(days=1)
|
114 |
+
logger.info(f"Generated URL groups for dates: {list(url_groups.keys())}")
|
115 |
+
return url_groups
|
116 |
+
except Exception as e:
|
117 |
+
logger.error(f"URL generation failed: {str(e)}")
|
118 |
+
raise
|
119 |
+
|
120 |
+
@task(retries=3, retry_delay_seconds=30, log_prints=True)
|
121 |
+
def download_file(url: str, raw_dir: Path) -> Path:
|
122 |
+
"""Download a single CSV (zip) file from the given URL."""
|
123 |
+
logger = get_run_logger()
|
124 |
+
try:
|
125 |
+
response = requests.get(url, timeout=10)
|
126 |
+
response.raise_for_status()
|
127 |
+
filename = Path(url).name
|
128 |
+
zip_path = raw_dir / filename
|
129 |
+
with zip_path.open('wb') as f:
|
130 |
+
f.write(response.content)
|
131 |
+
logger.info(f"Downloaded {filename}")
|
132 |
+
|
133 |
+
# Optionally, extract the CSV from the ZIP archive.
|
134 |
+
with zipfile.ZipFile(zip_path, 'r') as z:
|
135 |
+
# Assuming the zip contains one CSV file.
|
136 |
+
csv_names = z.namelist()
|
137 |
+
if csv_names:
|
138 |
+
extracted_csv = raw_dir / csv_names[0]
|
139 |
+
z.extractall(path=raw_dir)
|
140 |
+
logger.info(f"Extracted {csv_names[0]}")
|
141 |
+
return extracted_csv
|
142 |
+
else:
|
143 |
+
raise ValueError("Zip file is empty.")
|
144 |
+
except Exception as e:
|
145 |
+
logger.error(f"Error downloading {url}: {str(e)}")
|
146 |
+
raise
|
147 |
+
|
148 |
+
@task(retries=2, log_prints=True)
|
149 |
+
def convert_and_filter_combined(csv_paths: List[Path], processed_dir: Path, date: datetime.date) -> Path:
|
150 |
+
"""
|
151 |
+
Combine multiple CSV files (for one day) into a single DataFrame,
|
152 |
+
filter to only the required columns, optimize data types,
|
153 |
+
and write out as a single Parquet file.
|
154 |
+
"""
|
155 |
+
logger = get_run_logger()
|
156 |
+
try:
|
157 |
+
dfs = []
|
158 |
+
for csv_path in csv_paths:
|
159 |
+
df = pd.read_csv(
|
160 |
+
csv_path,
|
161 |
+
sep='\t',
|
162 |
+
names=GKG_COLUMNS,
|
163 |
+
dtype='string',
|
164 |
+
quoting=3,
|
165 |
+
na_values=[''],
|
166 |
+
encoding='utf-8',
|
167 |
+
encoding_errors='replace'
|
168 |
+
)
|
169 |
+
dfs.append(df)
|
170 |
+
combined_df = pd.concat(dfs, ignore_index=True)
|
171 |
+
filtered_df = combined_df[PRIORITY_COLUMNS].copy()
|
172 |
+
# Convert the date field to datetime; adjust the format if necessary.
|
173 |
+
if 'V2.1DATE' in filtered_df.columns:
|
174 |
+
filtered_df['V2.1DATE'] = pd.to_datetime(
|
175 |
+
filtered_df['V2.1DATE'], format='%Y%m%d%H%M%S', errors='coerce'
|
176 |
+
)
|
177 |
+
output_filename = f"gdelt_gkg_{date.strftime('%Y%m%d')}.parquet"
|
178 |
+
output_path = processed_dir / output_filename
|
179 |
+
filtered_df.to_parquet(output_path, engine='pyarrow', compression='snappy', index=False)
|
180 |
+
logger.info(f"Converted and filtered data for {date} into {output_filename}")
|
181 |
+
return output_path
|
182 |
+
except Exception as e:
|
183 |
+
logger.error(f"Error processing CSVs for {date}: {str(e)}")
|
184 |
+
raise
|
185 |
+
|
186 |
+
@task(retries=3, retry_delay_seconds=30, log_prints=True)
|
187 |
+
def upload_to_hf(file_path: Path, token: str) -> bool:
|
188 |
+
"""Upload task with global concurrency limit."""
|
189 |
+
logger = get_run_logger()
|
190 |
+
try:
|
191 |
+
with concurrency("hf_uploads", occupy=1):
|
192 |
+
# Enable the optimized HF Transfer backend.
|
193 |
+
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
|
194 |
+
|
195 |
+
api = HfApi()
|
196 |
+
api.upload_file(
|
197 |
+
path_or_fileobj=str(file_path),
|
198 |
+
path_in_repo=file_path.name,
|
199 |
+
repo_id=REPO_ID,
|
200 |
+
repo_type="dataset",
|
201 |
+
token=token,
|
202 |
+
)
|
203 |
+
logger.info(f"Uploaded {file_path.name}")
|
204 |
+
return True
|
205 |
+
except Exception as e:
|
206 |
+
logger.error(f"Upload failed for {file_path.name}: {str(e)}")
|
207 |
+
raise
|
208 |
+
|
209 |
+
@task(retries=3, retry_delay_seconds=120, log_prints=True)
|
210 |
+
def create_hf_repo(token: str) -> bool:
|
211 |
+
"""
|
212 |
+
Validate that the Hugging Face dataset repository exists; create it if not.
|
213 |
+
"""
|
214 |
+
logger = get_run_logger()
|
215 |
+
try:
|
216 |
+
api = HfApi()
|
217 |
+
datasets = [ds.id for ds in list_datasets(token=token)]
|
218 |
+
if REPO_ID in datasets:
|
219 |
+
logger.info(f"Dataset repository '{REPO_ID}' already exists.")
|
220 |
+
return True
|
221 |
+
# Create the repository if it doesn't exist.
|
222 |
+
api.create_repo(repo_id=REPO_ID, repo_type="dataset", token=token, private=False)
|
223 |
+
logger.info(f"Successfully created dataset repository: {REPO_ID}")
|
224 |
+
return True
|
225 |
+
except Exception as e:
|
226 |
+
logger.error(f"Failed to create or validate dataset repo '{REPO_ID}': {str(e)}")
|
227 |
+
raise RuntimeError(f"Repository validation/creation failed for '{REPO_ID}'") from e
|
228 |
+
|
229 |
+
@flow(name="Process Single Day", log_prints=True)
|
230 |
+
def process_single_day(
|
231 |
+
date: datetime.date, urls: List[str], directories: dict, hf_token: str
|
232 |
+
) -> bool:
|
233 |
+
"""
|
234 |
+
Process one day's data by:
|
235 |
+
1. Downloading all CSV files concurrently.
|
236 |
+
2. Merging, filtering, and optimizing the CSVs.
|
237 |
+
3. Writing out a single daily Parquet file.
|
238 |
+
4. Uploading the file to the Hugging Face Hub.
|
239 |
+
"""
|
240 |
+
logger = get_run_logger()
|
241 |
+
try:
|
242 |
+
# Download and process data (unlimited concurrency)
|
243 |
+
csv_paths = [download_file(url, directories["raw"]) for url in urls]
|
244 |
+
daily_parquet = convert_and_filter_combined(csv_paths, directories["processed"], date)
|
245 |
+
|
246 |
+
# Upload with global concurrency limit
|
247 |
+
upload_to_hf(daily_parquet, hf_token) # <-- Throttled to 2 concurrent
|
248 |
+
|
249 |
+
logger.info(f"Completed {date}")
|
250 |
+
return True
|
251 |
+
except Exception as e:
|
252 |
+
logger.error(f"Day {date} failed: {str(e)}")
|
253 |
+
raise
|
254 |
+
|
255 |
+
@flow(
|
256 |
+
name="Process Date Range",
|
257 |
+
task_runner=ConcurrentTaskRunner(), # Parallel subflows
|
258 |
+
log_prints=True
|
259 |
+
)
|
260 |
+
def process_date_range(base_path: Path = Path("data")):
|
261 |
+
"""
|
262 |
+
Main ETL flow:
|
263 |
+
1. Load parameters and credentials.
|
264 |
+
2. Validate (or create) the Hugging Face repository.
|
265 |
+
3. Setup directories.
|
266 |
+
4. Generate URL groups by date.
|
267 |
+
5. Process each day concurrently.
|
268 |
+
"""
|
269 |
+
logger = get_run_logger()
|
270 |
+
|
271 |
+
# Load parameters from a JSON block.
|
272 |
+
json_block = JSON.load("gdelt-etl-parameters")
|
273 |
+
params = json_block.value
|
274 |
+
start_date = datetime.datetime.fromisoformat(params.get("start_date", "2020-03-16T00:00:00"))
|
275 |
+
end_date = datetime.datetime.fromisoformat(params.get("end_date", "2020-03-22T00:00:00"))
|
276 |
+
|
277 |
+
# Load the Hugging Face token from a Secret block.
|
278 |
+
secret_block = Secret.load("huggingface-token")
|
279 |
+
hf_token = secret_block.get()
|
280 |
+
|
281 |
+
# Validate or create the repository.
|
282 |
+
create_hf_repo(hf_token)
|
283 |
+
|
284 |
+
directories = setup_directories(base_path)
|
285 |
+
url_groups = generate_gdelt_urls(start_date, end_date)
|
286 |
+
|
287 |
+
# Process days concurrently (subflows)
|
288 |
+
futures = [process_single_day(date, urls, directories, hf_token)
|
289 |
+
for date, urls in url_groups.items()]
|
290 |
+
|
291 |
+
# Wait for completion (optional error handling)
|
292 |
+
for future in futures:
|
293 |
+
try:
|
294 |
+
future.result()
|
295 |
+
except Exception as e:
|
296 |
+
logger.error(f"Failed day: {str(e)}")
|
297 |
+
|
298 |
+
# --- Entry Point ---
|
299 |
+
if __name__ == "__main__":
|
300 |
+
process_date_range.serve(
|
301 |
+
name="gdelt-etl-production-v2",
|
302 |
+
tags=["gdelt", "etl", "production"],
|
303 |
+
)
|
solution_component_notes/hf_gdelt_dataset_2020_covid.md
ADDED
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
tags:
|
4 |
+
- text
|
5 |
+
- news
|
6 |
+
- global
|
7 |
+
- knowledge-graph
|
8 |
+
- geopolitics
|
9 |
+
dataset_info:
|
10 |
+
features:
|
11 |
+
- name: GKGRECORDID
|
12 |
+
dtype: string
|
13 |
+
- name: DATE
|
14 |
+
dtype: string
|
15 |
+
- name: SourceCollectionIdentifier
|
16 |
+
dtype: string
|
17 |
+
- name: SourceCommonName
|
18 |
+
dtype: string
|
19 |
+
- name: DocumentIdentifier
|
20 |
+
dtype: string
|
21 |
+
- name: V1Counts
|
22 |
+
dtype: string
|
23 |
+
- name: V2.1Counts
|
24 |
+
dtype: string
|
25 |
+
- name: V1Themes
|
26 |
+
dtype: string
|
27 |
+
- name: V2EnhancedThemes
|
28 |
+
dtype: string
|
29 |
+
- name: V1Locations
|
30 |
+
dtype: string
|
31 |
+
- name: V2EnhancedLocations
|
32 |
+
dtype: string
|
33 |
+
- name: V1Persons
|
34 |
+
dtype: string
|
35 |
+
- name: V2EnhancedPersons
|
36 |
+
dtype: string
|
37 |
+
- name: V1Organizations
|
38 |
+
dtype: string
|
39 |
+
- name: V2EnhancedOrganizations
|
40 |
+
dtype: string
|
41 |
+
- name: V1.5Tone
|
42 |
+
dtype: string
|
43 |
+
- name: V2GCAM
|
44 |
+
dtype: string
|
45 |
+
- name: V2.1EnhancedDates
|
46 |
+
dtype: string
|
47 |
+
- name: V2.1Quotations
|
48 |
+
dtype: string
|
49 |
+
- name: V2.1AllNames
|
50 |
+
dtype: string
|
51 |
+
- name: V2.1Amounts
|
52 |
+
dtype: string
|
53 |
+
- name: tone
|
54 |
+
dtype: float64
|
55 |
+
splits:
|
56 |
+
- name: train
|
57 |
+
num_bytes: 3331097194
|
58 |
+
num_examples: 281215
|
59 |
+
- name: negative_tone
|
60 |
+
num_bytes: 3331097194
|
61 |
+
num_examples: 281215
|
62 |
+
download_size: 2229048020
|
63 |
+
dataset_size: 6662194388
|
64 |
+
configs:
|
65 |
+
- config_name: default
|
66 |
+
data_files:
|
67 |
+
- split: train
|
68 |
+
path: data/train-*
|
69 |
+
- split: negative_tone
|
70 |
+
path: data/negative_tone-*
|
71 |
+
---
|
72 |
+
|
73 |
+
# Dataset Card for dwb2023/gdelt-gkg-march2020-v2
|
74 |
+
|
75 |
+
## Dataset Details
|
76 |
+
|
77 |
+
### Dataset Description
|
78 |
+
|
79 |
+
This dataset contains GDELT Global Knowledge Graph (GKG) data covering March 10-22, 2020, during the early phase of the COVID-19 pandemic. It captures global event interactions, actor relationships, and contextual narratives to support temporal, spatial, and thematic analysis.
|
80 |
+
|
81 |
+
- **Curated by:** dwb2023
|
82 |
+
|
83 |
+
### Dataset Sources
|
84 |
+
|
85 |
+
- **Repository:** [http://data.gdeltproject.org/gdeltv2](http://data.gdeltproject.org/gdeltv2)
|
86 |
+
- **GKG Documentation:** [GDELT 2.0 Overview](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/), [GDELT GKG Codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf)
|
87 |
+
|
88 |
+
## Uses
|
89 |
+
|
90 |
+
### Direct Use
|
91 |
+
|
92 |
+
This dataset is suitable for:
|
93 |
+
|
94 |
+
- Temporal analysis of global events
|
95 |
+
- Relationship mapping of key actors in supply chain and logistics
|
96 |
+
- Sentiment and thematic analysis of COVID-19 pandemic narratives
|
97 |
+
|
98 |
+
### Out-of-Scope Use
|
99 |
+
|
100 |
+
- Not designed for real-time monitoring due to its historic and static nature
|
101 |
+
- Not intended for medical diagnosis or predictive health modeling
|
102 |
+
|
103 |
+
## Dataset Structure
|
104 |
+
|
105 |
+
### Features and Relationships
|
106 |
+
|
107 |
+
- this dataset focuses on a subset of features from the source GDELT dataset.
|
108 |
+
|
109 |
+
| Name | Type | Aspect | Description |
|
110 |
+
|------|------|---------|-------------|
|
111 |
+
| DATE | string | Metadata | Publication date of the article/document |
|
112 |
+
| SourceCollectionIdentifier | string | Metadata | Unique identifier for the source collection |
|
113 |
+
| SourceCommonName | string | Metadata | Common/display name of the source |
|
114 |
+
| DocumentIdentifier | string | Metadata | Unique URL/identifier of the document |
|
115 |
+
| V1Counts | string | Metrics | Original count mentions of numeric values |
|
116 |
+
| V2.1Counts | string | Metrics | Enhanced numeric pattern extraction |
|
117 |
+
| V1Themes | string | Classification | Original thematic categorization |
|
118 |
+
| V2EnhancedThemes | string | Classification | Expanded theme taxonomy and classification |
|
119 |
+
| V1Locations | string | Entities | Original geographic mentions |
|
120 |
+
| V2EnhancedLocations | string | Entities | Enhanced location extraction with coordinates |
|
121 |
+
| V1Persons | string | Entities | Original person name mentions |
|
122 |
+
| V2EnhancedPersons | string | Entities | Enhanced person name extraction |
|
123 |
+
| V1Organizations | string | Entities | Original organization mentions |
|
124 |
+
| V2EnhancedOrganizations | string | Entities | Enhanced organization name extraction |
|
125 |
+
| V1.5Tone | string | Sentiment | Original emotional tone scoring |
|
126 |
+
| V2GCAM | string | Sentiment | Global Content Analysis Measures |
|
127 |
+
| V2.1EnhancedDates | string | Temporal | Temporal reference extraction |
|
128 |
+
| V2.1Quotations | string | Content | Direct quote extraction |
|
129 |
+
| V2.1AllNames | string | Entities | Comprehensive named entity extraction |
|
130 |
+
| V2.1Amounts | string | Metrics | Quantity and measurement extraction |
|
131 |
+
|
132 |
+
### Aspects Overview:
|
133 |
+
- **Metadata**: Core document information
|
134 |
+
- **Metrics**: Numerical measurements and counts
|
135 |
+
- **Classification**: Categorical and thematic analysis
|
136 |
+
- **Entities**: Named entity recognition (locations, persons, organizations)
|
137 |
+
- **Sentiment**: Emotional and tone analysis
|
138 |
+
- **Temporal**: Time-related information
|
139 |
+
- **Content**: Direct content extraction
|
140 |
+
|
141 |
+
## Dataset Creation
|
142 |
+
|
143 |
+
### Curation Rationale
|
144 |
+
This dataset was curated to capture the rapidly evolving global narrative during the early phase of the COVID-19 pandemic, focusing specifically on March 10–22, 2020. By zeroing in on this critical period, it offers a granular perspective on how geopolitical events, actor relationships, and thematic discussions shifted amid the escalating pandemic. The enhanced GKG features further enable advanced entity, sentiment, and thematic analysis, making it a valuable resource for studying the socio-political and economic impacts of COVID-19 during a pivotal point in global history.
|
145 |
+
|
146 |
+
### Curation Approach
|
147 |
+
A targeted subset of GDELT’s columns was selected to streamline analysis on key entities (locations, persons, organizations), thematic tags, and sentiment scores—core components of many knowledge-graph and text analytics workflows. This approach balances comprehensive coverage with manageable data size and performance. The ETL pipeline used to produce these transformations is documented here:
|
148 |
+
[https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0](https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0).
|
149 |
+
|
150 |
+
## Citation
|
151 |
+
|
152 |
+
When using this dataset, please cite both the dataset and original GDELT project:
|
153 |
+
|
154 |
+
```bibtex
|
155 |
+
@misc{gdelt-gkg-march2020,
|
156 |
+
title = {GDELT Global Knowledge Graph March 2020 Dataset},
|
157 |
+
author = {dwb2023},
|
158 |
+
year = {2025},
|
159 |
+
publisher = {Hugging Face},
|
160 |
+
url = {https://huggingface.co/datasets/dwb2023/gdelt-gkg-march2020-v2}
|
161 |
+
}
|
162 |
+
```
|
163 |
+
|
164 |
+
## Dataset Card Contact
|
165 |
+
|
166 |
+
For questions and comments about this dataset card, please contact dwb2023 through the Hugging Face platform.
|
solution_component_notes/hf_gdelt_dataset_2025_february.md
ADDED
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
tags:
|
4 |
+
- text
|
5 |
+
- news
|
6 |
+
- global
|
7 |
+
- knowledge-graph
|
8 |
+
- geopolitics
|
9 |
+
dataset_info:
|
10 |
+
features:
|
11 |
+
- name: GKGRECORDID
|
12 |
+
dtype: string
|
13 |
+
- name: DATE
|
14 |
+
dtype: string
|
15 |
+
- name: SourceCollectionIdentifier
|
16 |
+
dtype: string
|
17 |
+
- name: SourceCommonName
|
18 |
+
dtype: string
|
19 |
+
- name: DocumentIdentifier
|
20 |
+
dtype: string
|
21 |
+
- name: V1Counts
|
22 |
+
dtype: string
|
23 |
+
- name: V2.1Counts
|
24 |
+
dtype: string
|
25 |
+
- name: V1Themes
|
26 |
+
dtype: string
|
27 |
+
- name: V2EnhancedThemes
|
28 |
+
dtype: string
|
29 |
+
- name: V1Locations
|
30 |
+
dtype: string
|
31 |
+
- name: V2EnhancedLocations
|
32 |
+
dtype: string
|
33 |
+
- name: V1Persons
|
34 |
+
dtype: string
|
35 |
+
- name: V2EnhancedPersons
|
36 |
+
dtype: string
|
37 |
+
- name: V1Organizations
|
38 |
+
dtype: string
|
39 |
+
- name: V2EnhancedOrganizations
|
40 |
+
dtype: string
|
41 |
+
- name: V1.5Tone
|
42 |
+
dtype: string
|
43 |
+
- name: V2.1EnhancedDates
|
44 |
+
dtype: string
|
45 |
+
- name: V2GCAM
|
46 |
+
dtype: string
|
47 |
+
- name: V2.1SharingImage
|
48 |
+
dtype: string
|
49 |
+
- name: V2.1Quotations
|
50 |
+
dtype: string
|
51 |
+
- name: V2.1AllNames
|
52 |
+
dtype: string
|
53 |
+
- name: V2.1Amounts
|
54 |
+
dtype: string
|
55 |
+
---
|
56 |
+
|
57 |
+
# Dataset Card for dwb2023/gdelt-gkg-2025-v2
|
58 |
+
|
59 |
+
## Dataset Details
|
60 |
+
|
61 |
+
### Dataset Description
|
62 |
+
|
63 |
+
This dataset contains GDELT Global Knowledge Graph (GKG) data covering February 2025. It captures global event interactions, actor relationships, and contextual narratives to support temporal, spatial, and thematic analysis.
|
64 |
+
|
65 |
+
- **Curated by:** dwb2023
|
66 |
+
|
67 |
+
### Dataset Sources
|
68 |
+
|
69 |
+
- **Repository:** [http://data.gdeltproject.org/gdeltv2](http://data.gdeltproject.org/gdeltv2)
|
70 |
+
- **GKG Documentation:** [GDELT 2.0 Overview](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/), [GDELT GKG Codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf)
|
71 |
+
|
72 |
+
## Uses
|
73 |
+
|
74 |
+
### Direct Use
|
75 |
+
|
76 |
+
This dataset is suitable for:
|
77 |
+
|
78 |
+
- Temporal analysis of global events
|
79 |
+
|
80 |
+
### Out-of-Scope Use
|
81 |
+
|
82 |
+
- Not designed for real-time monitoring due to its historic and static nature
|
83 |
+
- Not intended for medical diagnosis or predictive health modeling
|
84 |
+
|
85 |
+
## Dataset Structure
|
86 |
+
|
87 |
+
### Features and Relationships
|
88 |
+
|
89 |
+
- this dataset focuses on a subset of features from the source GDELT dataset.
|
90 |
+
|
91 |
+
| Name | Type | Aspect | Description |
|
92 |
+
|------|------|---------|-------------|
|
93 |
+
| DATE | string | Metadata | Publication date of the article/document |
|
94 |
+
| SourceCollectionIdentifier | string | Metadata | Unique identifier for the source collection |
|
95 |
+
| SourceCommonName | string | Metadata | Common/display name of the source |
|
96 |
+
| DocumentIdentifier | string | Metadata | Unique URL/identifier of the document |
|
97 |
+
| V1Counts | string | Metrics | Original count mentions of numeric values |
|
98 |
+
| V2.1Counts | string | Metrics | Enhanced numeric pattern extraction |
|
99 |
+
| V1Themes | string | Classification | Original thematic categorization |
|
100 |
+
| V2EnhancedThemes | string | Classification | Expanded theme taxonomy and classification |
|
101 |
+
| V1Locations | string | Entities | Original geographic mentions |
|
102 |
+
| V2EnhancedLocations | string | Entities | Enhanced location extraction with coordinates |
|
103 |
+
| V1Persons | string | Entities | Original person name mentions |
|
104 |
+
| V2EnhancedPersons | string | Entities | Enhanced person name extraction |
|
105 |
+
| V1Organizations | string | Entities | Original organization mentions |
|
106 |
+
| V2EnhancedOrganizations | string | Entities | Enhanced organization name extraction |
|
107 |
+
| V1.5Tone | string | Sentiment | Original emotional tone scoring |
|
108 |
+
| V2.1EnhancedDates | string | Temporal | Temporal reference extraction |
|
109 |
+
| V2GCAM | string | Sentiment | Global Content Analysis Measures |
|
110 |
+
| V2.1SharingImage | string | Content | URL of document image |
|
111 |
+
| V2.1Quotations | string | Content | Direct quote extraction |
|
112 |
+
| V2.1AllNames | string | Entities | Comprehensive named entity extraction |
|
113 |
+
| V2.1Amounts | string | Metrics | Quantity and measurement extraction |
|
114 |
+
|
115 |
+
### Aspects Overview:
|
116 |
+
- **Metadata**: Core document information
|
117 |
+
- **Metrics**: Numerical measurements and counts
|
118 |
+
- **Classification**: Categorical and thematic analysis
|
119 |
+
- **Entities**: Named entity recognition (locations, persons, organizations)
|
120 |
+
- **Sentiment**: Emotional and tone analysis
|
121 |
+
- **Temporal**: Time-related information
|
122 |
+
- **Content**: Direct content extraction
|
123 |
+
|
124 |
+
## Dataset Creation
|
125 |
+
|
126 |
+
### Curation Rationale
|
127 |
+
This dataset was curated to capture the rapidly evolving global narrative during February 2025. By zeroing in on this critical period, it offers a granular perspective on how geopolitical events, actor relationships, and thematic discussions shifted amid the escalating pandemic. The enhanced GKG features further enable advanced entity, sentiment, and thematic analysis, making it a valuable resource for studying the socio-political and economic impacts of emergent LLM capabilities.
|
128 |
+
|
129 |
+
### Curation Approach
|
130 |
+
A targeted subset of GDELT’s columns was selected to streamline analysis on key entities (locations, persons, organizations), thematic tags, and sentiment scores—core components of many knowledge-graph and text analytics workflows. This approach balances comprehensive coverage with manageable data size and performance. The ETL pipeline used to produce these transformations is documented here:
|
131 |
+
[https://gist.github.com/donbr/5293468436a1a39bd2d9f4959cbd4923](https://gist.github.com/donbr/5293468436a1a39bd2d9f4959cbd4923).
|
132 |
+
|
133 |
+
## Citation
|
134 |
+
|
135 |
+
When using this dataset, please cite both the dataset and original GDELT project:
|
136 |
+
|
137 |
+
```bibtex
|
138 |
+
@misc{gdelt-gkg-2025-v2,
|
139 |
+
title = {GDELT Global Knowledge Graph 2025 Dataset},
|
140 |
+
author = {dwb2023},
|
141 |
+
year = {2025},
|
142 |
+
publisher = {Hugging Face},
|
143 |
+
url = {https://huggingface.co/datasets/dwb2023/gdelt-gkg-2025-v2}
|
144 |
+
}
|
145 |
+
```
|
146 |
+
|
147 |
+
## Dataset Card Contact
|
148 |
+
|
149 |
+
For questions and comments about this dataset card, please contact dwb2023 through the Hugging Face platform.
|