dwb2023 commited on
Commit
3bb5fb5
·
0 Parent(s):

Initial commit for Hugging Face Spaces

Browse files
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ .venv/
2
+ # add pycache and pyc files
3
+ __pycache__/
4
+ *.pyc
5
+ # add lib directory
6
+ lib/
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.11
README.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: >-
3
+ Unveiling Global Narratives Through Knowledge Graphs: A Case Study Using GDELT
4
+ and Streamlit
5
+ emoji: 🔮
6
+ colorFrom: indigo
7
+ colorTo: blue
8
+ sdk: streamlit
9
+ sdk_version: 1.42.0
10
+ app_file: app.py
11
+ pinned: false
12
+ license: cc-by-4.0
13
+ short_description: using knowledge graphs for insight
14
+ ---
15
+
16
+ **Title:** Unveiling Global Narratives Through Knowledge Graphs: A Case Study Using GDELT and Streamlit
17
+ **Keywords:** GDELT, Knowledge Graphs, Network Analysis, Sentiment Analysis, Prefect, Hugging Face datasets, DuckDB, Streamlit, Neo4j, NetworkX, st-link-analysis, streamlit-aggrid, pyvis, pandas
18
+
19
+ ## Abstract
20
+ The global landscape is increasingly shaped by evolving narratives driven by interconnected events and entities. To better understand these dynamics, we introduce **GDELT Insight Explorer**, a knowledge graph-based platform built using Streamlit, DuckDB, and NetworkX. This paper presents a detailed case study on using the platform to analyze GDELT Global Knowledge Graph (GKG) data from March 2020. We focus on uncovering global narratives and relationships between actors and themes during the early phase of the COVID-19 pandemic. Our findings emphasize the utility of real-time event data visualization and network analysis in tracing narrative propagation and identifying key influencers in global events.
21
+
22
+ ## 1. Introduction
23
+ Understanding global narratives requires tools that can capture the complexity of events, their associated entities, and evolving sentiment over time. Traditional tabular analysis methods are often insufficient for capturing these relationships at scale. Knowledge graphs offer a robust solution for modeling and visualizing the interconnected nature of real-world events. This paper documents the development and application of **GDELT Insight Explorer**, a platform designed to leverage GDELT data for interactive exploration and insight generation.
24
+
25
+ ## 2. Methodology
26
+
27
+ ### 2.1 Data Source and Processing
28
+ The application is powered by the GDELT Global Knowledge Graph (GKG) dataset, focusing on data from March 10–22, 2020. The dataset includes key features such as themes, locations, persons, organizations, and sentiment scores. Our ETL pipeline, implemented using Prefect and DuckDB, extracts and transforms the data into a Parquet format for efficient querying and filtering.
29
+
30
+ - **Data Filtering:** We prioritize events with a tone score below -6 to identify highly negative narratives.
31
+ - **Data Storage:** DuckDB is used for in-memory querying, enabling real-time analysis of filtered datasets.
32
+ - **Graph Construction:** NetworkX and Neo4j are employed for graph creation, with relationships categorized into entities such as persons, organizations, and locations.
33
+
34
+ ### 2.2 Platform Architecture
35
+ The platform is built using Streamlit, with a modular architecture that supports multiple analysis modes:
36
+ - **Event Navigator:** Provides a tabular overview of filtered events with interactive search and filtering.
37
+ - **Event Graph Explorer:** Visualizes events and their associated entities in a graph format.
38
+ - **Community Detection and Network Analysis:** Employs NetworkX to detect communities and analyze network metrics such as centrality and density.
39
+
40
+ ## 3. Findings
41
+
42
+ ### 3.1 Narrative Detection and Sentiment Analysis
43
+ The negative tone filter helped identify early COVID-related narratives, revealing clusters of related events involving key global actors. By visualizing these relationships, we observed recurring themes of public health concerns and geopolitical tensions.
44
+
45
+ ### 3.2 Community Detection
46
+ Using the Louvain method for community detection, we identified cohesive subgroups within the network. These communities often corresponded to specific geographic regions or thematic clusters, providing deeper insights into localized narratives.
47
+
48
+ ### 3.3 Real-Time Filtering and Exploration
49
+ The integration of DuckDB allowed for seamless data filtering and exploration within the Streamlit interface. Users could drill down from high-level overviews to individual event records, facilitating rapid insight generation.
50
+
51
+ ## 4. Conclusion and Future Work
52
+ The **GDELT Insight Explorer** demonstrates the potential of combining knowledge graphs and real-time data exploration for uncovering global narratives. Future work will focus on expanding the temporal range of the dataset, integrating additional data sources, and incorporating machine learning models for predictive analysis. The open-source nature of the platform encourages further development and adaptation across different domains.
53
+
54
+ ## References
55
+ 1. GDELT Project. (n.d.). [https://www.gdeltproject.org](https://www.gdeltproject.org)
56
+ 2. Newman, M. E. J. (2018). *Networks: An Introduction*. Oxford University Press.
57
+ 3. DuckDB. (n.d.). [https://duckdb.org](https://duckdb.org)
58
+ 4. Prefect. (n.d.). [https://www.prefect.io](https://www.prefect.io)
59
+
60
+ ## Appendix: Application Architecture and Code
61
+ For implementation details, please refer to the open-source repository: [https://huggingface.co/spaces/dwb2023/insight](https://huggingface.co/spaces/dwb2023/insight).
app.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ st.set_page_config(
4
+ page_title="GDELT Insight Explorer",
5
+ layout="wide",
6
+ page_icon="🔮"
7
+ )
8
+
9
+ st.title("GDELT Insight Explorer: Unveiling Global Event Narratives")
10
+ st.markdown("""
11
+ Welcome to the **GDELT Insight Explorer**, a multi-faceted platform that leverages knowledge graph techniques to analyze global events and trends.
12
+
13
+ **How to Get Started:**
14
+ - Use the sidebar to switch between different analysis modes.
15
+ - Explore datasets, visualize event relationships, and analyze network structures.
16
+
17
+ **Available Pages:**
18
+ - **🗺️ COVID Navigator:** Dive into curated COVID-related event data.
19
+ - **🔍 COVID Event Graph Explorer:** Inspect detailed event records and their interconnections.
20
+ - **🌐 Global Network Analysis:** Visualize and analyze the global network of events.
21
+ - **🗺️ Feb 2025 Navigator:** Investigate recent event data with advanced filtering.
22
+ - **🔍 Feb 2025 Event Graph Explorer:** Inspect detailed event records and their interconnections.
23
+ - **🧪 Feb 2025 Dataset Experimentation:** An experiment using the HF dataset directly to investigate impact on query behavior and performance.
24
+ """)
data_access.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data access module for GDELT data retrieval and filtering
3
+ """
4
+ import duckdb
5
+ import pandas as pd
6
+
7
+ def get_gdelt_data(
8
+ limit=10,
9
+ tone_threshold=-7.0,
10
+ start_date=None,
11
+ end_date=None,
12
+ source_filter=None,
13
+ themes_filter=None,
14
+ persons_filter=None,
15
+ organizations_filter=None,
16
+ locations_filter=None
17
+ ):
18
+ """Get filtered GDELT data from DuckDB with dynamic query parameters."""
19
+ con = duckdb.connect(database=':memory:')
20
+
21
+ # Create view of the dataset
22
+ con.execute("""
23
+ CREATE VIEW negative_tone AS (
24
+ SELECT *
25
+ FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2@~parquet/default/negative_tone/*.parquet')
26
+ );
27
+ """)
28
+
29
+ # Base query components
30
+ base_conditions = [
31
+ "SourceCollectionIdentifier IS NOT NULL",
32
+ "DATE IS NOT NULL",
33
+ "SourceCommonName IS NOT NULL",
34
+ "DocumentIdentifier IS NOT NULL",
35
+ "V1Counts IS NOT NULL",
36
+ "V1Themes IS NOT NULL",
37
+ "V1Locations IS NOT NULL",
38
+ "V1Persons IS NOT NULL",
39
+ "V1Organizations IS NOT NULL",
40
+ "V2GCAM IS NOT NULL",
41
+ "\"V2.1Quotations\" IS NOT NULL",
42
+ "tone <= ?"
43
+ ]
44
+ params = [tone_threshold]
45
+ extra_conditions = []
46
+
47
+ # Add optional filters
48
+ if start_date:
49
+ extra_conditions.append("DATE >= ?")
50
+ params.append(start_date)
51
+ if end_date:
52
+ extra_conditions.append("DATE <= ?")
53
+ params.append(end_date)
54
+ if source_filter:
55
+ extra_conditions.append("SourceCommonName ILIKE ?")
56
+ params.append(f"%{source_filter}%")
57
+ if themes_filter:
58
+ extra_conditions.append("(V1Themes ILIKE ? OR V2EnhancedThemes ILIKE ?)")
59
+ params.extend([f"%{themes_filter}%", f"%{themes_filter}%"])
60
+ if persons_filter:
61
+ extra_conditions.append("(V1Persons ILIKE ? OR V2EnhancedPersons ILIKE ?)")
62
+ params.extend([f"%{persons_filter}%", f"%{persons_filter}%"])
63
+ if organizations_filter:
64
+ extra_conditions.append("(V1Organizations ILIKE ? OR V2EnhancedOrganizations ILIKE ?)")
65
+ params.extend([f"%{organizations_filter}%", f"%{organizations_filter}%"])
66
+ if locations_filter:
67
+ extra_conditions.append("(V1Locations ILIKE ? OR V2EnhancedLocations ILIKE ?)")
68
+ params.extend([f"%{locations_filter}%", f"%{locations_filter}%"])
69
+
70
+ # Combine all conditions
71
+ all_conditions = base_conditions + extra_conditions
72
+ where_clause = " AND ".join(all_conditions) if all_conditions else "1=1"
73
+
74
+ # Build final query
75
+ query = f"""
76
+ SELECT *
77
+ FROM negative_tone
78
+ WHERE {where_clause}
79
+ LIMIT ?;
80
+ """
81
+ params.append(limit)
82
+
83
+ # Execute query with parameters
84
+ results_df = con.execute(query, params).fetchdf()
85
+ con.close()
86
+
87
+ return results_df
88
+
89
+ def filter_dataframe(df, source_filter=None, date_filter=None, tone_min=None, tone_max=None):
90
+ """Filter dataframe based on provided criteria"""
91
+ display_df = df[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone']].copy()
92
+ display_df.columns = ['ID', 'Date', 'Source', 'Tone']
93
+
94
+ if source_filter:
95
+ display_df = display_df[display_df['Source'].str.contains(source_filter, case=False, na=False)]
96
+ if date_filter:
97
+ display_df = display_df[display_df['Date'].str.contains(date_filter, na=False)]
98
+ if tone_min is not None and tone_max is not None:
99
+ display_df = display_df[
100
+ (display_df['Tone'] >= tone_min) &
101
+ (display_df['Tone'] <= tone_max)
102
+ ]
103
+
104
+ return display_df
105
+
106
+ # Constants for raw data categories
107
+ GDELT_CATEGORIES = {
108
+ "Metadata": ["GKGRECORDID", "DATE", "SourceCommonName", "DocumentIdentifier", "V2.1Quotations", "tone"],
109
+ "Persons": ["V2EnhancedPersons", "V1Persons"],
110
+ "Organizations": ["V2EnhancedOrganizations", "V1Organizations"],
111
+ "Locations": ["V2EnhancedLocations", "V1Locations"],
112
+ "Themes": ["V2EnhancedThemes", "V1Themes"],
113
+ "Names": ["V2.1AllNames"],
114
+ "Counts": ["V2.1Counts", "V1Counts"],
115
+ "Amounts": ["V2.1Amounts"],
116
+ "V2GCAM": ["V2GCAM"],
117
+ "V2.1EnhancedDates": ["V2.1EnhancedDates"],
118
+ }
graph_builder.py ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Graph builder module for converting GDELT data to graph formats
3
+ """
4
+ import pandas as pd
5
+ import networkx as nx
6
+ import json
7
+
8
+ class GraphBuilder:
9
+ """Base class for building graph from GDELT data"""
10
+ def process_entities(self, row):
11
+ """Process entities from a row and return nodes and relationships"""
12
+ nodes = []
13
+ relationships = []
14
+ event_id = row["GKGRECORDID"]
15
+ event_date = row["DATE"]
16
+ event_source = row["SourceCommonName"]
17
+ event_document_id = row["DocumentIdentifier"]
18
+ # event_image = row["V2.1SharingImage"] if pd.notna(row["V2.1SharingImage"]) else ""
19
+ event_quotations = row["V2.1Quotations"] if pd.notna(row["V2.1Quotations"]) else ""
20
+ event_tone = float(row["tone"]) if pd.notna(row["tone"]) else 0.0
21
+
22
+ # Add event node
23
+ nodes.append({
24
+ "id": event_id,
25
+ "type": "event",
26
+ "properties": {
27
+ "date": event_date,
28
+ "source": event_source,
29
+ "document": event_document_id,
30
+ # "image": event_image,
31
+ "quotations": event_quotations,
32
+ "tone": event_tone
33
+ }
34
+ })
35
+
36
+ # Process each entity type
37
+ entity_mappings = {
38
+ "V2EnhancedPersons": ("Person", "MENTIONED_IN"),
39
+ "V2EnhancedOrganizations": ("Organization", "MENTIONED_IN"),
40
+ "V2EnhancedLocations": ("Location", "LOCATED_IN"),
41
+ "V2EnhancedThemes": ("Theme", "CATEGORIZED_AS"),
42
+ "V2.1AllNames": ("Name", "MENTIONED_IN"),
43
+ "V2.1Counts": ("Count", "MENTIONED_IN"),
44
+ "V2.1Amounts": ("Amount", "MENTIONED_IN"),
45
+ }
46
+
47
+ for field, (label, relationship) in entity_mappings.items():
48
+ if pd.notna(row[field]):
49
+ entities = [e.strip() for e in row[field].split(';') if e.strip()]
50
+ for entity in entities:
51
+ nodes.append({
52
+ "id": entity,
53
+ "type": label.lower(),
54
+ "properties": {"name": entity}
55
+ })
56
+ relationships.append({
57
+ "from": entity,
58
+ "to": event_id,
59
+ "type": relationship,
60
+ "properties": {"created_at": event_date}
61
+ })
62
+
63
+ return nodes, relationships
64
+
65
+ class NetworkXBuilder(GraphBuilder):
66
+ """Builder for NetworkX graphs"""
67
+ def build_graph(self, df):
68
+ G = nx.Graph()
69
+
70
+ for _, row in df.iterrows():
71
+ nodes, relationships = self.process_entities(row)
72
+
73
+ # Add nodes
74
+ for node in nodes:
75
+ G.add_node(node["id"],
76
+ type=node["type"],
77
+ **node["properties"])
78
+
79
+ # Add relationships
80
+ for rel in relationships:
81
+ G.add_edge(rel["from"],
82
+ rel["to"],
83
+ relationship=rel["type"],
84
+ **rel["properties"])
85
+
86
+ return G
87
+
88
+ class Neo4jBuilder(GraphBuilder):
89
+ def __init__(self, uri, user, password):
90
+ self.driver = GraphDatabase.driver(uri, auth=(user, password))
91
+ self.logger = logging.getLogger(__name__)
92
+
93
+ def close(self):
94
+ self.driver.close()
95
+
96
+ def build_graph(self, df):
97
+ with self.driver.session() as session:
98
+ for _, row in df.iterrows():
99
+ nodes, relationships = self.process_entities(row)
100
+
101
+ # Create nodes and relationships in Neo4j
102
+ try:
103
+ session.execute_write(self._create_graph_elements,
104
+ nodes, relationships)
105
+ except Exception as e:
106
+ self.logger.error(f"Error processing row {row['GKGRECORDID']}: {str(e)}")
107
+
108
+ def _create_graph_elements(self, tx, nodes, relationships):
109
+ # Create nodes
110
+ for node in nodes:
111
+ query = f"""
112
+ MERGE (n:{node['type']} {{id: $id}})
113
+ SET n += $properties
114
+ """
115
+ tx.run(query, id=node["id"], properties=node["properties"])
116
+
117
+ # Create relationships
118
+ for rel in relationships:
119
+ query = f"""
120
+ MATCH (a {{id: $from_id}})
121
+ MATCH (b {{id: $to_id}})
122
+ MERGE (a)-[r:{rel['type']}]->(b)
123
+ SET r += $properties
124
+ """
125
+ tx.run(query,
126
+ from_id=rel["from"],
127
+ to_id=rel["to"],
128
+ properties=rel["properties"])
129
+
130
+ class StreamlitGraphBuilder:
131
+ """Adapted graph builder for Streamlit visualization"""
132
+ def __init__(self):
133
+ self.G = nx.Graph()
134
+
135
+ def process_row(self, row):
136
+ """Process a single row of data"""
137
+ event_id = row["GKGRECORDID"]
138
+ event_props = {
139
+ "type": "event", # already in lowercase
140
+ "date": row["DATE"],
141
+ "source": row["SourceCommonName"],
142
+ "document": row["DocumentIdentifier"],
143
+ "tone": row["tone"],
144
+ # Store display name in its original format if needed.
145
+ "name": row["SourceCommonName"]
146
+ }
147
+
148
+ self.G.add_node(event_id, **event_props)
149
+
150
+ # Use lowercase node types for consistency in lookups.
151
+ entity_types = {
152
+ "V2EnhancedPersons": ("person", "MENTIONED_IN"),
153
+ "V2EnhancedOrganizations": ("organization", "MENTIONED_IN"),
154
+ "V2EnhancedLocations": ("location", "LOCATED_IN"),
155
+ "V2EnhancedThemes": ("theme", "CATEGORIZED_AS"),
156
+ "V2.1AllNames": ("name", "MENTIONED_IN"),
157
+ "V2.1Counts": ("count", "MENTIONED_IN"),
158
+ "V2.1Amounts": ("amount", "MENTIONED_IN"),
159
+ }
160
+
161
+ for col, (node_type, rel_type) in entity_types.items():
162
+ if pd.notna(row[col]):
163
+ # The actual display value (which may be in Parent Case) is preserved in the "name" attribute.
164
+ entities = [e.strip() for e in row[col].split(';') if e.strip()]
165
+ for entity in entities:
166
+ self.G.add_node(entity, type=node_type, name=entity)
167
+ self.G.add_edge(entity, event_id,
168
+ relationship=rel_type,
169
+ date=row["DATE"])
170
+
171
+ class StLinkBuilder(GraphBuilder):
172
+ """Builder for st-link-analysis compatible graphs"""
173
+ def build_graph(self, df):
174
+ """Build graph in st-link-analysis format"""
175
+ all_nodes = []
176
+ all_edges = []
177
+ edge_counter = 0
178
+
179
+ # Track nodes we've already added to avoid duplicates
180
+ added_nodes = set()
181
+
182
+ for _, row in df.iterrows():
183
+ nodes, relationships = self.process_entities(row)
184
+
185
+ # Process nodes
186
+ for node in nodes:
187
+ if node["id"] not in added_nodes:
188
+ stlink_node = {
189
+ "data": {
190
+ "id": str(node["id"]),
191
+ "label": node["type"].upper(),
192
+ **node["properties"]
193
+ }
194
+ }
195
+ all_nodes.append(stlink_node)
196
+ added_nodes.add(node["id"])
197
+
198
+ # Process relationships/edges
199
+ for rel in relationships:
200
+ edge_counter += 1
201
+ stlink_edge = {
202
+ "data": {
203
+ "id": f"e{edge_counter}",
204
+ "source": str(rel["from"]),
205
+ "target": str(rel["to"]),
206
+ "label": rel["type"],
207
+ **rel["properties"]
208
+ }
209
+ }
210
+ all_edges.append(stlink_edge)
211
+
212
+ return {
213
+ "nodes": all_nodes,
214
+ "edges": all_edges
215
+ }
216
+
217
+ def write_json(self, graph_data, filename):
218
+ """Write graph to JSON file"""
219
+ with open(filename, 'w') as f:
220
+ json.dump(graph_data, f, indent=2)
graph_config.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration module for graph visualization styles
3
+ """
4
+ from st_link_analysis import NodeStyle, EdgeStyle
5
+
6
+ # Node styles configuration
7
+ NODE_STYLES = [
8
+ NodeStyle("EVENT", "#FF7F3E", "name", "description"),
9
+ NodeStyle("PERSON", "#4CAF50", "name", "person"),
10
+ NodeStyle("ORGANIZATION", "#9C27B0", "name", "business"),
11
+ NodeStyle("LOCATION", "#2196F3", "name", "place"),
12
+ NodeStyle("THEME", "#FFC107", "name", "sell"),
13
+ NodeStyle("COUNT", "#795548", "name", "inventory"),
14
+ NodeStyle("AMOUNT", "#607D8B", "name", "wallet"),
15
+ ]
16
+
17
+ NODE_TYPES = {
18
+ 'event': {
19
+ 'color': '#1f77b4',
20
+ 'description': 'GDELT Events'
21
+ },
22
+ 'person': {
23
+ 'color': '#2ca02c',
24
+ 'description': 'Named Persons'
25
+ },
26
+ 'organization': {
27
+ 'color': '#ffa500',
28
+ 'description': 'Organizations'
29
+ },
30
+ 'location': {
31
+ 'color': '#ff0000',
32
+ 'description': 'Geographic Locations'
33
+ },
34
+ 'theme': {
35
+ 'color': '#800080',
36
+ 'description': 'Event Themes'
37
+ }
38
+ }
39
+
40
+ # Edge styles configuration
41
+ EDGE_STYLES = [
42
+ EdgeStyle("MENTIONED_IN", caption="label", directed=True),
43
+ EdgeStyle("LOCATED_IN", caption="label", directed=True),
44
+ EdgeStyle("CATEGORIZED_AS", caption="label", directed=True)
45
+ ]
46
+
47
+ # Layout options
48
+ LAYOUT_OPTIONS = ["cose", "circle", "grid", "breadthfirst", "concentric"]
49
+
50
+ # Default graph display settings
51
+ DEFAULT_GRAPH_HEIGHT = 500
52
+ DEFAULT_LAYOUT = "cose"
53
+
54
+ # Column configuration for data grid
55
+ GRID_COLUMNS = {
56
+ "ID": {"width": "medium"},
57
+ "Date": {"width": "small"},
58
+ "Source": {"width": "medium"},
59
+ "Tone": {"width": "small", "format": "%.2f"}
60
+ }
pages/1_🗺️_COVID_Navigator.py ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import duckdb
3
+ import pandas as pd
4
+ from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
5
+
6
+ # Constants for raw data categories
7
+ GDELT_CATEGORIES = {
8
+ "Metadata": ["GKGRECORDID", "DATE", "SourceCommonName", "DocumentIdentifier", "V2.1Quotations", "tone"],
9
+ "Persons": ["V2EnhancedPersons", "V1Persons"],
10
+ "Organizations": ["V2EnhancedOrganizations", "V1Organizations"],
11
+ "Locations": ["V2EnhancedLocations", "V1Locations"],
12
+ "Themes": ["V2EnhancedThemes", "V1Themes"],
13
+ "Names": ["V2.1AllNames"],
14
+ "Counts": ["V2.1Counts", "V1Counts"],
15
+ "Amounts": ["V2.1Amounts"],
16
+ "V2GCAM": ["V2GCAM"],
17
+ "V2.1EnhancedDates": ["V2.1EnhancedDates"],
18
+ }
19
+
20
+ def initialize_db():
21
+ """Initialize database connection and create dataset view"""
22
+ con = duckdb.connect()
23
+ con.execute("""
24
+ CREATE VIEW negative_tone AS (
25
+ SELECT *
26
+ FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2@~parquet/default/negative_tone/*.parquet')
27
+ );
28
+ """)
29
+ return con
30
+
31
+ def fetch_data(con, source_filter=None, themes_filter=None,
32
+ start_date=None, end_date=None, limit=50, include_all_columns=False):
33
+ """Fetch filtered data from the database"""
34
+ if include_all_columns:
35
+ columns = "*"
36
+ else:
37
+ columns = "GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, 'V2.1Quotations', SourceCollectionIdentifier"
38
+
39
+ query = f"""
40
+ SELECT {columns}
41
+ FROM negative_tone
42
+ WHERE TRUE
43
+ """
44
+ params = []
45
+
46
+ if source_filter:
47
+ query += " AND SourceCommonName ILIKE ?"
48
+ params.append(f"%{source_filter}%")
49
+ if start_date:
50
+ query += " AND DATE >= ?"
51
+ params.append(start_date)
52
+ if end_date:
53
+ query += " AND DATE <= ?"
54
+ params.append(end_date)
55
+ if limit:
56
+ query += f" LIMIT {limit}"
57
+
58
+ try:
59
+ result = con.execute(query, params)
60
+ return result.fetchdf()
61
+ except Exception as e:
62
+ st.error(f"Query execution failed: {str(e)}")
63
+ return pd.DataFrame()
64
+
65
+ def render_data_grid(df):
66
+ """
67
+ Render an interactive data grid (with built‑in filtering) and return the selected row.
68
+ The grid is configured to show only the desired columns (ID, Date, Source, Tone)
69
+ and allow filtering/search on each.
70
+ """
71
+ st.subheader("Search and Filter Records")
72
+
73
+ # Build grid options with AgGrid
74
+ gb = GridOptionsBuilder.from_dataframe(df)
75
+ gb.configure_default_column(filter=True, sortable=True, resizable=True)
76
+ # Enable single row selection
77
+ gb.configure_selection('single', use_checkbox=False)
78
+ grid_options = gb.build()
79
+
80
+ # Render AgGrid (the grid will have a filter field for each column)
81
+ grid_response = AgGrid(
82
+ df,
83
+ gridOptions=grid_options,
84
+ update_mode=GridUpdateMode.SELECTION_CHANGED,
85
+ height=400,
86
+ fit_columns_on_grid_load=True
87
+ )
88
+
89
+ selected = grid_response.get('selected_rows')
90
+ if selected is not None:
91
+ # If selected is a DataFrame, use iloc to get the first row.
92
+ if isinstance(selected, pd.DataFrame):
93
+ if not selected.empty:
94
+ return selected.iloc[0].to_dict()
95
+ # Otherwise, if it's a list, get the first element.
96
+ elif isinstance(selected, list) and len(selected) > 0:
97
+ return selected[0]
98
+ return None
99
+
100
+ def render_raw_data(record):
101
+ """Render raw GDELT data in expandable sections."""
102
+ st.header("Full Record Details")
103
+ for category, fields in GDELT_CATEGORIES.items():
104
+ with st.expander(f"{category}"):
105
+ for field in fields:
106
+ if field in record:
107
+ st.markdown(f"**{field}:**")
108
+ st.text(record[field])
109
+ st.divider()
110
+
111
+ def main():
112
+ st.title("🗺️ COVID Dataset Navigator")
113
+ st.markdown("""
114
+ **Explore and Analyze COVID-19 Event Data**
115
+
116
+ Use the interactive filters on the sidebar to search, sort, and inspect individual records from the GDELT Global Knowledge Graph. Adjust the parameters below to uncover detailed event insights.
117
+ """)
118
+
119
+ # Initialize database connection using context manager
120
+ with initialize_db() as con:
121
+ if con is not None:
122
+ # Add UI components
123
+
124
+ # Sidebar controls
125
+ with st.sidebar:
126
+ st.header("Search Filters")
127
+ source = st.text_input("Filter by source name")
128
+ start_date = st.text_input("Start date (YYYYMMDD)", "20200314")
129
+ end_date = st.text_input("End date (YYYYMMDD)", "20200315")
130
+ limit = st.slider("Number of results to display", 10, 500, 100)
131
+
132
+ # Fetch initial data view
133
+ df_initial = fetch_data(
134
+ con=con,
135
+ source_filter=source,
136
+ start_date=start_date,
137
+ end_date=end_date,
138
+ limit=limit,
139
+ include_all_columns=False
140
+ )
141
+
142
+ # Fetch full records for selection
143
+ df_full = fetch_data(
144
+ con=con,
145
+ source_filter=source,
146
+ start_date=start_date,
147
+ end_date=end_date,
148
+ limit=limit,
149
+ include_all_columns=True
150
+ )
151
+
152
+ # Create a DataFrame for the grid with only the key columns
153
+ grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', 'SourceCollectionIdentifier']].copy()
154
+ grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Source Collection ID']
155
+
156
+ # Render the interactive data grid at the top
157
+ selected_row = render_data_grid(grid_df)
158
+
159
+ if selected_row:
160
+ # Find the full record in the original DataFrame using the selected ID
161
+ selected_id = selected_row['ID']
162
+ full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
163
+
164
+ # Display the raw data below the grid
165
+ render_raw_data(full_record)
166
+ else:
167
+ st.info("Select a record above to view its complete details.")
168
+ else:
169
+ st.warning("No matching records found.")
170
+
171
+ # Close database connection
172
+ con.close()
173
+
174
+ main()
pages/2_🔍_COVID_Event_Graph.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import duckdb
3
+ import pandas as pd
4
+ from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
5
+ from st_link_analysis import st_link_analysis, NodeStyle, EdgeStyle
6
+ from graph_builder import StLinkBuilder
7
+
8
+ # Node styles configuration
9
+ NODE_STYLES = [
10
+ NodeStyle("EVENT", "#FF7F3E", "name", "description"),
11
+ NodeStyle("PERSON", "#4CAF50", "name", "person"),
12
+ NodeStyle("NAME", "#2A629A", "created_at", "badge"),
13
+ NodeStyle("ORGANIZATION", "#9C27B0", "name", "business"),
14
+ NodeStyle("LOCATION", "#2196F3", "name", "place"),
15
+ NodeStyle("THEME", "#FFC107", "name", "sell"),
16
+ NodeStyle("COUNT", "#795548", "name", "inventory"),
17
+ NodeStyle("AMOUNT", "#607D8B", "name", "wallet"),
18
+ ]
19
+
20
+ # Edge styles configuration
21
+ EDGE_STYLES = [
22
+ EdgeStyle("MENTIONED_IN", caption="label", directed=True),
23
+ EdgeStyle("LOCATED_IN", caption="label", directed=True),
24
+ EdgeStyle("CATEGORIZED_AS", caption="label", directed=True)
25
+ ]
26
+
27
+ def initialize_db():
28
+ """Initialize database connection and create dataset view"""
29
+ con = duckdb.connect()
30
+ con.execute("""
31
+ CREATE VIEW negative_tone AS (
32
+ SELECT *
33
+ FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2@~parquet/default/negative_tone/*.parquet')
34
+ );
35
+ """)
36
+ return con
37
+
38
+ def fetch_data(con, source_filter=None,
39
+ start_date=None, end_date=None, limit=50, include_all_columns=False):
40
+ """Fetch filtered data from the database"""
41
+ if include_all_columns:
42
+ columns = "*"
43
+ else:
44
+ columns = "GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, 'V2.1Quotations', SourceCollectionIdentifier"
45
+
46
+ query = f"""
47
+ SELECT {columns}
48
+ FROM negative_tone
49
+ WHERE TRUE
50
+ """
51
+ params = []
52
+
53
+ if source_filter:
54
+ query += " AND SourceCommonName ILIKE ?"
55
+ params.append(f"%{source_filter}%")
56
+ if start_date:
57
+ query += " AND DATE >= ?"
58
+ params.append(start_date)
59
+ if end_date:
60
+ query += " AND DATE <= ?"
61
+ params.append(end_date)
62
+ if limit:
63
+ query += f" LIMIT {limit}"
64
+
65
+ try:
66
+ result = con.execute(query, params)
67
+ return result.fetchdf()
68
+ except Exception as e:
69
+ st.error(f"Query execution failed: {str(e)}")
70
+ return pd.DataFrame()
71
+
72
+ def render_data_grid(df):
73
+ """
74
+ Render an interactive data grid (with built‑in filtering) and return the selected row.
75
+ The grid is configured to show only the desired columns (ID, Date, Source, Tone)
76
+ and allow filtering/search on each.
77
+ """
78
+ st.subheader("Search and Filter Records")
79
+
80
+ # Build grid options with AgGrid
81
+ gb = GridOptionsBuilder.from_dataframe(df)
82
+ gb.configure_default_column(filter=True, sortable=True, resizable=True)
83
+ # Enable single row selection
84
+ gb.configure_selection('single', use_checkbox=False)
85
+ grid_options = gb.build()
86
+
87
+ # Render AgGrid (the grid will have a filter field for each column)
88
+ grid_response = AgGrid(
89
+ df,
90
+ gridOptions=grid_options,
91
+ update_mode=GridUpdateMode.SELECTION_CHANGED,
92
+ height=400,
93
+ fit_columns_on_grid_load=True
94
+ )
95
+
96
+ selected = grid_response.get('selected_rows')
97
+ if selected is not None:
98
+ # If selected is a DataFrame, use iloc to get the first row.
99
+ if isinstance(selected, pd.DataFrame):
100
+ if not selected.empty:
101
+ return selected.iloc[0].to_dict()
102
+ # Otherwise, if it's a list, get the first element.
103
+ elif isinstance(selected, list) and len(selected) > 0:
104
+ return selected[0]
105
+ return None
106
+
107
+ def render_graph(record):
108
+ """
109
+ Render a graph visualization for the selected record.
110
+ Uses StLinkBuilder to convert the record into graph format and then
111
+ displays the graph using st_link_analysis.
112
+ """
113
+ st.subheader(f"Event Graph: {record.get('GKGRECORDID', 'Unknown')}")
114
+ stlink_builder = StLinkBuilder()
115
+ # Convert the record (a Series) into a DataFrame with one row
116
+ record_df = pd.DataFrame([record])
117
+ graph_data = stlink_builder.build_graph(record_df)
118
+ return st_link_analysis(
119
+ elements=graph_data,
120
+ layout="fcose", # Column configuration for data grid - cose, fcose, breadthfirst, cola
121
+ node_styles=NODE_STYLES,
122
+ edge_styles=EDGE_STYLES
123
+ )
124
+
125
+ def main():
126
+ st.title("🔍 COVID Event Graph Explorer")
127
+ st.markdown("""
128
+ **Interactive Event Graph Viewer**
129
+
130
+ Filter and select individual COVID-19 event records to display their detailed graph representations. Analyze relationships between events and associated entities using the interactive graph below.
131
+ """)
132
+
133
+ # Initialize database connection using context manager
134
+ with initialize_db() as con:
135
+ if con is not None:
136
+ # Add UI components
137
+
138
+ # Sidebar controls
139
+ with st.sidebar:
140
+ st.header("Search Filters")
141
+ source = st.text_input("Filter by source name")
142
+ start_date = st.text_input("Start date (YYYYMMDD)", "20200314")
143
+ end_date = st.text_input("End date (YYYYMMDD)", "20200315")
144
+ limit = st.slider("Number of results to display", 10, 500, 100)
145
+
146
+ # Fetch initial data view
147
+ df_initial = fetch_data(
148
+ con=con,
149
+ source_filter=source,
150
+ start_date=start_date,
151
+ end_date=end_date,
152
+ limit=limit,
153
+ include_all_columns=False
154
+ )
155
+
156
+ # Fetch full records for selection
157
+ df_full = fetch_data(
158
+ con=con,
159
+ source_filter=source,
160
+ start_date=start_date,
161
+ end_date=end_date,
162
+ limit=limit,
163
+ include_all_columns=True
164
+ )
165
+
166
+ # Create a DataFrame for the grid with only the key columns
167
+ grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', 'SourceCollectionIdentifier']].copy()
168
+ grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Source Collection ID']
169
+
170
+ # Render the interactive data grid at the top
171
+ selected_row = render_data_grid(grid_df)
172
+
173
+ if selected_row:
174
+ # Find the full record in the original DataFrame using the selected ID
175
+ selected_id = selected_row['ID']
176
+ full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
177
+
178
+ # Display the graph and raw data below the grid
179
+ render_graph(full_record)
180
+ else:
181
+ st.info("Use the grid filters above to search and select a record.")
182
+
183
+ else:
184
+ st.warning("No matching records found.")
185
+
186
+ # Close database connection
187
+ con.close()
188
+
189
+ main()
pages/3_🌐_COVID_Network_Analysis.py ADDED
@@ -0,0 +1,349 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Network Analysis Page - GDELT Graph Analysis
3
+ This module provides interactive network analysis of GDELT event data.
4
+ """
5
+ import streamlit as st
6
+ import networkx as nx
7
+ from pyvis.network import Network
8
+ import pandas as pd
9
+ from datetime import datetime
10
+ import tempfile
11
+ import json
12
+ from typing import Dict, List, Set, Tuple, Optional
13
+ from pathlib import Path
14
+
15
+ from data_access import get_gdelt_data, filter_dataframe, GDELT_CATEGORIES
16
+ from graph_builder import StreamlitGraphBuilder
17
+ from graph_config import NODE_TYPES
18
+
19
+ # Type aliases for clarity
20
+ NodeID = str
21
+ CommunityID = int
22
+ Community = Set[NodeID]
23
+ Communities = List[Community]
24
+
25
+ def create_legend_html() -> str:
26
+ """Create HTML for the visualization legend."""
27
+ legend_html = """
28
+ <div style="
29
+ position: absolute;
30
+ top: 10px;
31
+ right: 10px;
32
+ background-color: rgba(255, 255, 255, 0.9);
33
+ padding: 10px;
34
+ border-radius: 5px;
35
+ border: 1px solid #ddd;
36
+ z-index: 1000;
37
+ ">
38
+ <h3 style="margin: 0 0 10px 0;">Legend</h3>
39
+ """
40
+
41
+ for node_type, info in NODE_TYPES.items():
42
+ legend_html += f"""
43
+ <div style="margin: 5px 0;">
44
+ <span style="
45
+ display: inline-block;
46
+ width: 12px;
47
+ height: 12px;
48
+ background-color: {info['color']};
49
+ border-radius: 50%;
50
+ margin-right: 5px;
51
+ "></span>
52
+ <span>{info['description']}</span>
53
+ </div>
54
+ """
55
+
56
+ legend_html += "</div>"
57
+ return legend_html
58
+
59
+ class CommunityAnalyzer:
60
+ """Handles community detection and analysis for GDELT network graphs."""
61
+
62
+ def __init__(self, G: nx.Graph):
63
+ self.G = G
64
+ self._communities: Optional[Communities] = None
65
+ self._analysis: Optional[List[Dict]] = None
66
+
67
+ @property
68
+ def communities(self) -> Communities:
69
+ """Cached access to detected communities."""
70
+ if self._communities is None:
71
+ self._communities = nx.community.louvain_communities(self.G)
72
+ return self._communities
73
+
74
+ def analyze_composition(self) -> List[Dict]:
75
+ """Perform detailed analysis of each community's composition."""
76
+ if self._analysis is not None:
77
+ return self._analysis
78
+
79
+ analysis_results = []
80
+
81
+ for idx, community in enumerate(self.communities):
82
+ try:
83
+ # Initialize analysis containers
84
+ node_types = {ntype: 0 for ntype in NODE_TYPES.keys()}
85
+ themes: Set[str] = set()
86
+ entities: Dict[str, int] = {}
87
+
88
+ # Analyze community nodes
89
+ for node in community:
90
+ attrs = self.G.nodes[node]
91
+ node_type = attrs.get('type', 'unknown')
92
+
93
+ # Update type counts
94
+ if node_type in node_types:
95
+ node_types[node_type] += 1
96
+
97
+ # Collect themes
98
+ if node_type == 'theme':
99
+ theme_name = attrs.get('name', '')
100
+ if theme_name:
101
+ themes.add(theme_name)
102
+
103
+ # Track entity connections
104
+ if node_type in {'person', 'organization', 'location'}:
105
+ name = attrs.get('name', node)
106
+ entities[name] = self.G.degree(node)
107
+
108
+ # Calculate community metrics
109
+ subgraph = self.G.subgraph(community)
110
+ n = len(community)
111
+ possible_edges = (n * (n - 1)) / 2 if n > 1 else 0
112
+ density = (subgraph.number_of_edges() / possible_edges) if possible_edges > 0 else 0
113
+
114
+ # Get top entities by degree
115
+ top_entities = dict(sorted(entities.items(), key=lambda x: x[1], reverse=True)[:5])
116
+
117
+ analysis_results.append({
118
+ 'id': idx,
119
+ 'size': len(community),
120
+ 'node_types': node_types,
121
+ 'themes': sorted(themes),
122
+ 'top_entities': top_entities,
123
+ 'density': density,
124
+ 'internal_edges': subgraph.number_of_edges(),
125
+ 'external_edges': sum(1 for u in community
126
+ for v in self.G[u]
127
+ if v not in community)
128
+ })
129
+
130
+ except Exception as e:
131
+ st.error(f"Error analyzing community {idx}: {str(e)}")
132
+ continue
133
+
134
+ self._analysis = analysis_results
135
+ return analysis_results
136
+
137
+ def display_community_analysis(analysis: List[Dict]) -> None:
138
+ """Display detailed community analysis in Streamlit."""
139
+ # Display summary metrics
140
+ total_nodes = sum(comm['size'] for comm in analysis)
141
+ col1, col2, col3 = st.columns(3)
142
+ with col1:
143
+ st.metric("Total Communities", len(analysis))
144
+ with col2:
145
+ st.metric("Total Nodes", total_nodes)
146
+ with col3:
147
+ largest_comm = max(comm['size'] for comm in analysis)
148
+ st.metric("Largest Community", largest_comm)
149
+
150
+ # Display each community in tabs
151
+ st.subheader("Community Details")
152
+ tabs = st.tabs([f"Community {comm['id']}" for comm in analysis])
153
+ for tab, comm in zip(tabs, analysis):
154
+ with tab:
155
+ cols = st.columns(2)
156
+
157
+ # Left column: Composition
158
+ with cols[0]:
159
+ st.subheader("Composition")
160
+ node_types_df = pd.DataFrame([comm['node_types']]).T
161
+ node_types_df.columns = ['Count']
162
+ st.bar_chart(node_types_df)
163
+
164
+ st.markdown("**Metrics:**")
165
+ st.write(f"- Size: {comm['size']} nodes")
166
+ st.write(f"- Density: {comm['density']:.3f}")
167
+ st.write(f"- Internal edges: {comm['internal_edges']}")
168
+ st.write(f"- External edges: {comm['external_edges']}")
169
+ st.write(f"- % of network: {(comm['size']/total_nodes)*100:.1f}%")
170
+
171
+ # Right column: Entities and Themes
172
+ with cols[1]:
173
+ if comm['top_entities']:
174
+ st.subheader("Key Entities")
175
+ for entity, degree in comm['top_entities'].items():
176
+ st.write(f"- {entity} ({degree} connections)")
177
+
178
+ if comm['themes']:
179
+ st.subheader("Themes")
180
+ for theme in sorted(comm['themes']):
181
+ st.write(f"- {theme}")
182
+
183
+ def visualize_with_pyvis(G: nx.Graph, physics: bool = True) -> str:
184
+ """Create interactive PyVis visualization with legend."""
185
+ net = Network(height="600px", width="100%", notebook=False, directed=False)
186
+ net.from_nx(G)
187
+
188
+ # Configure nodes
189
+ for node in net.nodes:
190
+ node_type = node.get("type", "unknown")
191
+ node["color"] = NODE_TYPES.get(node_type, {}).get('color', "#cccccc")
192
+ node["size"] = 20 if node_type == "event" else 15
193
+ title_attrs = {k: v for k, v in node.items() if k != "id"}
194
+ node["title"] = "<br>".join(f"{k}: {v}" for k, v in title_attrs.items())
195
+
196
+ # Configure edges
197
+ for edge in net.edges:
198
+ edge["title"] = edge.get("relationship", "")
199
+ edge["color"] = {"color": "#666666", "opacity": 0.5}
200
+
201
+ # Physics settings
202
+ if physics:
203
+ net.show_buttons(filter_=['physics'])
204
+ else:
205
+ net.toggle_physics(False)
206
+
207
+ # Generate HTML
208
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as f:
209
+ net.save_graph(f.name)
210
+ html_content = Path(f.name).read_text(encoding='utf-8')
211
+
212
+ # Add legend
213
+ legend = create_legend_html()
214
+ html_content = html_content.replace('</body>', f'{legend}</body>')
215
+
216
+ return html_content
217
+
218
+ def main():
219
+ st.title("🌐 Global Network Analysis")
220
+ st.markdown("""
221
+ **Explore Global Event Networks**
222
+
223
+ Dive deep into the interconnected world of negative sentiment events as captured by GDELT. Utilize interactive visualizations and community analysis tools to understand key metrics, structures, and interrelationships.
224
+ """)
225
+
226
+
227
+ # Initialize session state
228
+ if 'vis_html' not in st.session_state:
229
+ st.session_state.vis_html = None
230
+
231
+ # Sidebar controls
232
+ with st.sidebar:
233
+ st.header("Graph Controls")
234
+ limit = st.slider("Max records to load", 1, 25, 5)
235
+ tone_threshold = st.slider("Max tone score", -10.0, -5.0, -7.0)
236
+ show_physics = st.checkbox("Enable physics", value=True)
237
+
238
+ st.header("Advanced Filters")
239
+ source_filter = st.text_input("Filter by source name")
240
+ themes_filter = st.text_input("Filter by theme/keyword")
241
+ start_date = st.text_input("Start date (YYYYMMDD)")
242
+ end_date = st.text_input("End date (YYYYMMDD)")
243
+
244
+ try:
245
+ # Load and process data
246
+ df = get_gdelt_data(
247
+ limit=limit,
248
+ tone_threshold=tone_threshold,
249
+ start_date=start_date if start_date else None,
250
+ end_date=end_date if end_date else None,
251
+ source_filter=source_filter,
252
+ themes_filter=themes_filter
253
+ )
254
+
255
+ # Build graph
256
+ with st.spinner("Building knowledge graph..."):
257
+ builder = StreamlitGraphBuilder()
258
+ for _, row in df.iterrows():
259
+ builder.process_row(row)
260
+ G = builder.G
261
+
262
+ if G.number_of_nodes() == 0:
263
+ st.warning("No data found matching the specified criteria.")
264
+ return
265
+
266
+ # Display basic metrics
267
+ col1, col2, col3 = st.columns(3)
268
+ with col1:
269
+ st.metric("Total Nodes", G.number_of_nodes())
270
+ with col2:
271
+ st.metric("Total Edges", G.number_of_edges())
272
+ with col3:
273
+ event_count = sum(1 for _, attr in G.nodes(data=True)
274
+ if attr.get("type") == "event")
275
+ st.metric("Negative Events", event_count)
276
+
277
+ # Analysis section
278
+ st.header("NetworkX Graph Analysis")
279
+
280
+ # Centrality analysis
281
+ with st.expander("Centrality Analysis"):
282
+ degree_centrality = nx.degree_centrality(G)
283
+ top_nodes = sorted(degree_centrality.items(),
284
+ key=lambda x: x[1], reverse=True)[:5]
285
+
286
+ st.write("Most Connected Nodes:")
287
+ for node, centrality in top_nodes:
288
+ node_type = G.nodes[node].get("type", "unknown")
289
+ st.write(f"- `{node[:30]}` ({node_type}): {centrality:.3f}")
290
+
291
+ # Community analysis
292
+ with st.expander("Community Analysis"):
293
+ try:
294
+ analyzer = CommunityAnalyzer(G)
295
+ analysis = analyzer.analyze_composition()
296
+ display_community_analysis(analysis)
297
+ except Exception as e:
298
+ st.error(f"Community analysis failed: {str(e)}")
299
+ st.error("Please check the graph structure and try again.")
300
+
301
+ # Export options
302
+ st.header("Export Options")
303
+ with st.expander("Export Data"):
304
+ col1, col2, col3 = st.columns(3)
305
+
306
+ with col1:
307
+ # GraphML export
308
+ graphml_string = "".join(nx.generate_graphml(G))
309
+ st.download_button(
310
+ label="Download GraphML",
311
+ data=graphml_string.encode('utf-8'),
312
+ file_name=f"gdelt_graph_{datetime.now().isoformat()}.graphml",
313
+ mime="application/xml"
314
+ )
315
+
316
+ with col2:
317
+ # JSON network export
318
+ json_string = json.dumps(nx.node_link_data(G, edges="edges"))
319
+ st.download_button(
320
+ label="Download JSON",
321
+ data=json_string.encode('utf-8'),
322
+ file_name=f"gdelt_graph_{datetime.now().isoformat()}.json",
323
+ mime="application/json"
324
+ )
325
+
326
+ with col3:
327
+ # Community analysis export
328
+ if 'analysis' in locals():
329
+ analysis_json = json.dumps(analysis, indent=2)
330
+ st.download_button(
331
+ label="Download Analysis",
332
+ data=analysis_json.encode('utf-8'),
333
+ file_name=f"community_analysis_{datetime.now().isoformat()}.json",
334
+ mime="application/json"
335
+ )
336
+
337
+ # Interactive visualization
338
+ st.header("Network Visualization")
339
+ with st.expander("Interactive Network", expanded=False):
340
+ if st.session_state.vis_html is None:
341
+ with st.spinner("Generating visualization..."):
342
+ st.session_state.vis_html = visualize_with_pyvis(G, physics=show_physics)
343
+ st.components.v1.html(st.session_state.vis_html, height=600, scrolling=True)
344
+
345
+ except Exception as e:
346
+ st.error(f"An error occurred: {str(e)}")
347
+ st.error("Please adjust your filters and try again.")
348
+
349
+ main()
pages/4_🗺️_Feb_2025_Navigator.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import duckdb
3
+ import pandas as pd
4
+ from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
5
+
6
+ # Constants for raw data categories
7
+ GDELT_CATEGORIES = {
8
+ "Metadata": ["GKGRECORDID", "DATE", "SourceCommonName", "DocumentIdentifier", "V2.1Quotations", "tone"],
9
+ "Persons": ["V2EnhancedPersons", "V1Persons"],
10
+ "Organizations": ["V2EnhancedOrganizations", "V1Organizations"],
11
+ "Locations": ["V2EnhancedLocations", "V1Locations"],
12
+ "Themes": ["V2EnhancedThemes", "V1Themes"],
13
+ "Names": ["V2.1AllNames"],
14
+ "Counts": ["V2.1Counts", "V1Counts"],
15
+ "Amounts": ["V2.1Amounts"],
16
+ "V2GCAM": ["V2GCAM"],
17
+ "V2.1EnhancedDates": ["V2.1EnhancedDates"],
18
+ }
19
+
20
+ def initialize_db():
21
+ """Initialize database connection and create dataset view with optimized tone extraction"""
22
+ con = duckdb.connect()
23
+ con.execute("""
24
+ CREATE VIEW tone_vw AS (
25
+ SELECT
26
+ * EXCLUDE ("V1.5Tone"),
27
+ TRY_CAST(
28
+ CASE
29
+ WHEN POSITION(',' IN "V1.5Tone") > 0
30
+ THEN SUBSTRING("V1.5Tone", 1, POSITION(',' IN "V1.5Tone") - 1)
31
+ ELSE "V1.5Tone"
32
+ END
33
+ AS FLOAT
34
+ ) AS tone
35
+ FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-2025-v2/**/*.parquet')
36
+ );
37
+ """)
38
+ return con
39
+
40
+ def fetch_data(con, source_filter=None,
41
+ start_date=None, end_date=None, limit=10, include_all_columns=False):
42
+ """Fetch filtered data from the database"""
43
+ if include_all_columns:
44
+ columns = "*"
45
+ else:
46
+ # Changed column specification: use double quotes for column names with periods.
47
+ columns = 'GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, "V2.1SharingImage", "V2.1Quotations", SourceCollectionIdentifier'
48
+
49
+ query = f"""
50
+ SELECT {columns}
51
+ FROM tone_vw
52
+ WHERE TRUE
53
+ """
54
+ params = []
55
+
56
+ if source_filter:
57
+ query += " AND SourceCommonName ILIKE ?"
58
+ params.append(f"%{source_filter}%")
59
+ if start_date:
60
+ query += " AND DATE >= ?"
61
+ params.append(start_date)
62
+ if end_date:
63
+ query += " AND DATE <= ?"
64
+ params.append(end_date)
65
+ if limit:
66
+ query += f" LIMIT {limit}"
67
+
68
+ try:
69
+ result = con.execute(query, params)
70
+ return result.fetchdf()
71
+ except Exception as e:
72
+ st.error(f"Query execution failed: {str(e)}")
73
+ return pd.DataFrame()
74
+
75
+ def render_data_grid(df):
76
+ """
77
+ Render an interactive data grid (with built‑in filtering) and return the selected row.
78
+ The grid is configured to show only the desired columns (ID, Date, Source, Tone)
79
+ and allow filtering/search on each.
80
+ """
81
+ st.subheader("Search and Filter Records")
82
+
83
+ # Build grid options with AgGrid
84
+ gb = GridOptionsBuilder.from_dataframe(df)
85
+ gb.configure_default_column(filter=True, sortable=True, resizable=True)
86
+ # Enable single row selection
87
+ gb.configure_selection('single', use_checkbox=False)
88
+ grid_options = gb.build()
89
+
90
+ # Render AgGrid (the grid will have a filter field for each column)
91
+ grid_response = AgGrid(
92
+ df,
93
+ gridOptions=grid_options,
94
+ update_mode=GridUpdateMode.SELECTION_CHANGED,
95
+ height=400,
96
+ fit_columns_on_grid_load=True
97
+ )
98
+
99
+ selected = grid_response.get('selected_rows')
100
+ if selected is not None:
101
+ # If selected is a DataFrame, use iloc to get the first row.
102
+ if isinstance(selected, pd.DataFrame):
103
+ if not selected.empty:
104
+ return selected.iloc[0].to_dict()
105
+ # Otherwise, if it's a list, get the first element.
106
+ elif isinstance(selected, list) and len(selected) > 0:
107
+ return selected[0]
108
+ return None
109
+
110
+ def render_raw_data(record):
111
+ """Render raw GDELT data in expandable sections."""
112
+ st.header("Full Record Details")
113
+ for category, fields in GDELT_CATEGORIES.items():
114
+ with st.expander(f"{category}"):
115
+ for field in fields:
116
+ if field in record:
117
+ st.markdown(f"**{field}:**")
118
+ st.text(record[field])
119
+ st.divider()
120
+
121
+ def main():
122
+ st.title("🗺️ GDELT Feb 2025 Navigator")
123
+ st.markdown("""
124
+ **Investigate Recent Global Events (Feb 2025)**
125
+
126
+ Leverage advanced filters and interactive grids to explore the latest data from the GDELT Global Knowledge Graph. This navigator is optimized for recent events, offering insights into evolving global narratives.
127
+ """)
128
+
129
+
130
+ # Initialize database connection using context manager
131
+ with initialize_db() as con:
132
+ if con is not None:
133
+ # Add UI components
134
+
135
+ # Sidebar controls
136
+ with st.sidebar:
137
+ st.header("Search Filters")
138
+ source = st.text_input("Filter by source name")
139
+ start_date = st.text_input("Start date (YYYYMMDD)", "20250210")
140
+ end_date = st.text_input("End date (YYYYMMDD)", "20250211")
141
+ limit = st.slider("Number of results to display", 10, 500, 10)
142
+
143
+ # Fetch initial data view
144
+ df_initial = fetch_data(
145
+ con=con,
146
+ source_filter=source,
147
+ start_date=start_date,
148
+ end_date=end_date,
149
+ limit=limit,
150
+ include_all_columns=False
151
+ )
152
+
153
+ # Fetch full records for selection
154
+ df_full = fetch_data(
155
+ con=con,
156
+ source_filter=source,
157
+ start_date=start_date,
158
+ end_date=end_date,
159
+ limit=limit,
160
+ include_all_columns=True
161
+ )
162
+
163
+ # Create a DataFrame for the grid with only the key columns
164
+ grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', "V2.1SharingImage", 'SourceCollectionIdentifier']].copy()
165
+ grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Image', 'Source Collection ID']
166
+
167
+ # Render the interactive data grid at the top
168
+ selected_row = render_data_grid(grid_df)
169
+
170
+ if selected_row:
171
+ # Find the full record in the original DataFrame using the selected ID
172
+ selected_id = selected_row['ID']
173
+ full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
174
+
175
+ # Display the raw data below the grid
176
+ render_raw_data(full_record)
177
+ else:
178
+ st.info("Select a record above to view its complete details.")
179
+ else:
180
+ st.warning("No matching records found.")
181
+
182
+ # Close database connection
183
+ con.close()
184
+
185
+ main()
pages/5_🔍_Feb_2025_Event_Graph.py ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import duckdb
3
+ import pandas as pd
4
+ from st_aggrid import AgGrid, GridOptionsBuilder, GridUpdateMode
5
+ from st_link_analysis import st_link_analysis, NodeStyle, EdgeStyle
6
+ from graph_builder import StLinkBuilder
7
+
8
+ # Node styles configuration
9
+ NODE_STYLES = [
10
+ NodeStyle("EVENT", "#FF7F3E", "name", "description"),
11
+ NodeStyle("PERSON", "#4CAF50", "name", "person"),
12
+ NodeStyle("NAME", "#2A629A", "created_at", "badge"),
13
+ NodeStyle("ORGANIZATION", "#9C27B0", "name", "business"),
14
+ NodeStyle("LOCATION", "#2196F3", "name", "place"),
15
+ NodeStyle("THEME", "#FFC107", "name", "sell"),
16
+ NodeStyle("COUNT", "#795548", "name", "inventory"),
17
+ NodeStyle("AMOUNT", "#607D8B", "name", "wallet"),
18
+ ]
19
+
20
+ # Edge styles configuration
21
+ EDGE_STYLES = [
22
+ EdgeStyle("MENTIONED_IN", caption="label", directed=True),
23
+ EdgeStyle("LOCATED_IN", caption="label", directed=True),
24
+ EdgeStyle("CATEGORIZED_AS", caption="label", directed=True)
25
+ ]
26
+
27
+ def initialize_db():
28
+ """Initialize database connection and create dataset view with optimized tone extraction"""
29
+ con = duckdb.connect()
30
+ con.execute("""
31
+ CREATE VIEW tone_vw AS (
32
+ SELECT
33
+ * EXCLUDE ("V1.5Tone"),
34
+ TRY_CAST(
35
+ CASE
36
+ WHEN POSITION(',' IN "V1.5Tone") > 0
37
+ THEN SUBSTRING("V1.5Tone", 1, POSITION(',' IN "V1.5Tone") - 1)
38
+ ELSE "V1.5Tone"
39
+ END
40
+ AS FLOAT
41
+ ) AS tone
42
+ FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-2025-v2/**/*.parquet')
43
+ );
44
+ """)
45
+ return con
46
+
47
+ def fetch_data(con, source_filter=None,
48
+ start_date=None, end_date=None, limit=50, include_all_columns=False):
49
+ """Fetch filtered data from the database"""
50
+ if include_all_columns:
51
+ columns = "*"
52
+ else:
53
+ columns = "GKGRECORDID, DATE, SourceCommonName, tone, DocumentIdentifier, 'V2.1Quotations', SourceCollectionIdentifier"
54
+
55
+ query = f"""
56
+ SELECT {columns}
57
+ FROM tone_vw
58
+ WHERE TRUE
59
+ """
60
+ params = []
61
+
62
+ if source_filter:
63
+ query += " AND SourceCommonName ILIKE ?"
64
+ params.append(f"%{source_filter}%")
65
+ if start_date:
66
+ query += " AND DATE >= ?"
67
+ params.append(start_date)
68
+ if end_date:
69
+ query += " AND DATE <= ?"
70
+ params.append(end_date)
71
+ if limit:
72
+ query += f" LIMIT {limit}"
73
+
74
+ try:
75
+ result = con.execute(query, params)
76
+ return result.fetchdf()
77
+ except Exception as e:
78
+ st.error(f"Query execution failed: {str(e)}")
79
+ return pd.DataFrame()
80
+
81
+ def render_data_grid(df):
82
+ """
83
+ Render an interactive data grid (with built‑in filtering) and return the selected row.
84
+ The grid is configured to show only the desired columns (ID, Date, Source, Tone)
85
+ and allow filtering/search on each.
86
+ """
87
+ st.subheader("Search and Filter Records")
88
+
89
+ # Build grid options with AgGrid
90
+ gb = GridOptionsBuilder.from_dataframe(df)
91
+ gb.configure_default_column(filter=True, sortable=True, resizable=True)
92
+ # Enable single row selection
93
+ gb.configure_selection('single', use_checkbox=False)
94
+ grid_options = gb.build()
95
+
96
+ # Render AgGrid (the grid will have a filter field for each column)
97
+ grid_response = AgGrid(
98
+ df,
99
+ gridOptions=grid_options,
100
+ update_mode=GridUpdateMode.SELECTION_CHANGED,
101
+ height=400,
102
+ fit_columns_on_grid_load=True
103
+ )
104
+
105
+ selected = grid_response.get('selected_rows')
106
+ if selected is not None:
107
+ # If selected is a DataFrame, use iloc to get the first row.
108
+ if isinstance(selected, pd.DataFrame):
109
+ if not selected.empty:
110
+ return selected.iloc[0].to_dict()
111
+ # Otherwise, if it's a list, get the first element.
112
+ elif isinstance(selected, list) and len(selected) > 0:
113
+ return selected[0]
114
+ return None
115
+
116
+ def render_graph(record):
117
+ """
118
+ Render a graph visualization for the selected record.
119
+ Uses StLinkBuilder to convert the record into graph format and then
120
+ displays the graph using st_link_analysis.
121
+ """
122
+ st.subheader(f"Event Graph: {record.get('GKGRECORDID', 'Unknown')}")
123
+ stlink_builder = StLinkBuilder()
124
+ # Convert the record (a Series) into a DataFrame with one row
125
+ record_df = pd.DataFrame([record])
126
+ graph_data = stlink_builder.build_graph(record_df)
127
+ return st_link_analysis(
128
+ elements=graph_data,
129
+ layout="fcose", # Column configuration for data grid - cose, fcose, breadthfirst, cola
130
+ node_styles=NODE_STYLES,
131
+ edge_styles=EDGE_STYLES
132
+ )
133
+
134
+ def main():
135
+ st.title("🔍 GDELT Feb 2025 Event Graph Explorer")
136
+ st.markdown("""
137
+ **Investigate Recent Global Events (Feb 2025) in an Interactive Event Graph Viewer**
138
+
139
+ Filter and select individual event records to display their detailed graph representations. Analyze relationships between events and associated entities using the interactive graph below.
140
+ """)
141
+
142
+ # Initialize database connection using context manager
143
+ with initialize_db() as con:
144
+ if con is not None:
145
+ # Add UI components
146
+
147
+ # Sidebar controls
148
+ with st.sidebar:
149
+ st.header("Search Filters")
150
+ source = st.text_input("Filter by source name")
151
+ start_date = st.text_input("Start date (YYYYMMDD)", "20250210")
152
+ end_date = st.text_input("End date (YYYYMMDD)", "20250211")
153
+ limit = st.slider("Number of results to display", 10, 500, 100)
154
+
155
+ # Fetch initial data view
156
+ df_initial = fetch_data(
157
+ con=con,
158
+ source_filter=source,
159
+ start_date=start_date,
160
+ end_date=end_date,
161
+ limit=limit,
162
+ include_all_columns=False
163
+ )
164
+
165
+ # Fetch full records for selection
166
+ df_full = fetch_data(
167
+ con=con,
168
+ source_filter=source,
169
+ start_date=start_date,
170
+ end_date=end_date,
171
+ limit=limit,
172
+ include_all_columns=True
173
+ )
174
+
175
+ # Create a DataFrame for the grid with only the key columns
176
+ grid_df = df_initial[['GKGRECORDID', 'DATE', 'SourceCommonName', 'tone', 'DocumentIdentifier', 'SourceCollectionIdentifier']].copy()
177
+ grid_df.columns = ['ID', 'Date', 'Source', 'Tone', 'Doc ID', 'Source Collection ID']
178
+
179
+ # Render the interactive data grid at the top
180
+ selected_row = render_data_grid(grid_df)
181
+
182
+ if selected_row:
183
+ # Find the full record in the original DataFrame using the selected ID
184
+ selected_id = selected_row['ID']
185
+ full_record = df_full[df_full['GKGRECORDID'] == selected_id].iloc[0]
186
+
187
+ # Display the graph and raw data below the grid
188
+ render_graph(full_record)
189
+ else:
190
+ st.info("Use the grid filters above to search and select a record.")
191
+
192
+ else:
193
+ st.warning("No matching records found.")
194
+
195
+ # Close database connection
196
+ con.close()
197
+
198
+ main()
pages/6_🧪_Feb_2025_Dataset_Explorer.py ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ from datasets import load_dataset
4
+ import re
5
+ from datetime import datetime, date
6
+ from io import StringIO
7
+ from typing import Optional, Tuple, List, Dict, Any
8
+
9
+ # Constants
10
+ DEFAULT_SAMPLE_SIZE = 1000
11
+ DATE_FORMAT = "%Y%m%d"
12
+ FULL_DATE_FORMAT = f"{DATE_FORMAT}%H%M%S"
13
+
14
+ # Load dataset with enhanced caching and validation
15
+ @st.cache_data(ttl=3600, show_spinner="Loading dataset...")
16
+ def load_data(sample_size: int = DEFAULT_SAMPLE_SIZE) -> pd.DataFrame:
17
+ """
18
+ Load and validate dataset with error handling.
19
+
20
+ Args:
21
+ sample_size (int): Number of records to load
22
+
23
+ Returns:
24
+ pd.DataFrame: Loaded and validated dataframe
25
+ """
26
+ try:
27
+ dataset = load_dataset(
28
+ "dwb2023/gdelt-gkg-2025-v2",
29
+ data_files={
30
+ "train": [
31
+ "gdelt_gkg_20250210.parquet",
32
+ "gdelt_gkg_20250211.parquet",
33
+ ]
34
+ },
35
+ split="train"
36
+ )
37
+ df = pd.DataFrame(dataset)
38
+
39
+ # Basic data validation
40
+ if df.empty:
41
+ st.error("Loaded dataset is empty")
42
+ return pd.DataFrame()
43
+
44
+ if "DATE" not in df.columns:
45
+ st.error("Dataset missing required DATE column")
46
+ return pd.DataFrame()
47
+
48
+ return df
49
+
50
+ except Exception as e:
51
+ st.error(f"Error loading dataset: {str(e)}")
52
+ st.stop()
53
+ return pd.DataFrame()
54
+
55
+ def initialize_app(df: pd.DataFrame) -> None:
56
+ """Initialize the Streamlit app interface."""
57
+ st.title("GDELT GKG 2025 Dataset Explorer")
58
+
59
+ with st.sidebar:
60
+ st.header("Search Criteria")
61
+ st.markdown("🔍 Filter dataset using the controls below")
62
+
63
+ def extract_unique_themes(df: pd.DataFrame, column: str) -> List[str]:
64
+ """
65
+ Extract and clean unique themes from semicolon-separated column.
66
+
67
+ Args:
68
+ df (pd.DataFrame): Input dataframe
69
+ column (str): Column name containing themes
70
+
71
+ Returns:
72
+ List[str]: Sorted list of unique themes
73
+ """
74
+ if df.empty:
75
+ return []
76
+
77
+ return sorted({
78
+ theme.split(",")[0].strip()
79
+ for themes in df[column].dropna().str.split(";")
80
+ for theme in themes if theme.strip()
81
+ })
82
+
83
+ def get_date_range(df: pd.DataFrame, date_col: str) -> Tuple[date, date]:
84
+ """
85
+ Get min/max dates from dataset with fallback defaults.
86
+
87
+ Args:
88
+ df (pd.DataFrame): Input dataframe
89
+ date_col (str): Column name containing dates
90
+
91
+ Returns:
92
+ Tuple[date, date]: (min_date, max_date) as date objects
93
+ """
94
+ try:
95
+ # Convert YYYYMMDDHHMMSS string format to datetime using constant
96
+ dates = pd.to_datetime(df[date_col], format=FULL_DATE_FORMAT)
97
+ return dates.min().date(), dates.max().date()
98
+ except Exception as e:
99
+ st.warning(f"Date range detection failed: {str(e)}")
100
+ return datetime(2025, 2, 10).date(), datetime(2025, 2, 11).date()
101
+
102
+ def create_filters(df: pd.DataFrame) -> Dict[str, Any]:
103
+ """
104
+ Generate sidebar filters and return filter state.
105
+
106
+ Args:
107
+ df (pd.DataFrame): Input dataframe
108
+
109
+ Returns:
110
+ Dict[str, Any]: Dictionary of filter settings
111
+ """
112
+ filters = {}
113
+
114
+ with st.sidebar:
115
+ # Theme multi-select
116
+ filters["themes"] = st.multiselect(
117
+ "V2EnhancedThemes (exact match)",
118
+ options=extract_unique_themes(df, "V2EnhancedThemes"),
119
+ help="Select exact themes to include (supports multiple selection)"
120
+ )
121
+
122
+ # Text-based filters
123
+ text_filters = {
124
+ "source_common_name": ("SourceCommonName", "partial name match"),
125
+ "document_identifier": ("DocumentIdentifier", "partial identifier match"),
126
+ "sharing_image": ("V2.1SharingImage", "partial image URL match")
127
+ }
128
+
129
+ for key, (label, help_text) in text_filters.items():
130
+ filters[key] = st.text_input(
131
+ f"{label} ({help_text})",
132
+ placeholder=f"Enter {help_text}...",
133
+ help=f"Case-insensitive {help_text}"
134
+ )
135
+
136
+ # Date range with dataset-based defaults
137
+ date_col = "DATE"
138
+ min_date, max_date = get_date_range(df, date_col)
139
+
140
+ filters["date_range"] = st.date_input(
141
+ "Date range",
142
+ value=(min_date, max_date),
143
+ min_value=min_date,
144
+ max_value=max_date,
145
+ )
146
+
147
+ # Record limit
148
+ filters["record_limit"] = st.number_input(
149
+ "Max records to display",
150
+ min_value=100,
151
+ max_value=5000,
152
+ value=1000,
153
+ step=100,
154
+ help="Limit results for better performance"
155
+ )
156
+
157
+ return filters
158
+
159
+ def apply_filters(df: pd.DataFrame, filters: Dict[str, Any]) -> pd.DataFrame:
160
+ """
161
+ Apply all filters to dataframe using vectorized operations.
162
+
163
+ Args:
164
+ df (pd.DataFrame): Input dataframe to filter
165
+ filters (Dict[str, Any]): Dictionary containing filter parameters:
166
+ - themes (list): List of themes to match exactly
167
+ - source_common_name (str): Partial match for source name
168
+ - document_identifier (str): Partial match for document ID
169
+ - sharing_image (str): Partial match for image URL
170
+ - date_range (tuple): (start_date, end_date) tuple
171
+ - record_limit (int): Maximum number of records to return
172
+
173
+ Returns:
174
+ pd.DataFrame: Filtered dataframe
175
+ """
176
+ filtered_df = df.copy()
177
+
178
+ # Theme exact match filter - set regex groups to be non-capturing using (?:) syntax
179
+ if filters["themes"]:
180
+ pattern = r'(?:^|;)(?:{})(?:$|,|;)'.format('|'.join(map(re.escape, filters["themes"])))
181
+ filtered_df = filtered_df[filtered_df["V2EnhancedThemes"].str.contains(pattern, na=False)]
182
+
183
+ # Text partial match filters
184
+ text_columns = {
185
+ "source_common_name": "SourceCommonName",
186
+ "document_identifier": "DocumentIdentifier",
187
+ "sharing_image": "V2.1SharingImage"
188
+ }
189
+
190
+ for filter_key, col_name in text_columns.items():
191
+ if value := filters.get(filter_key):
192
+ filtered_df = filtered_df[
193
+ filtered_df[col_name]
194
+ .str.contains(re.escape(value), case=False, na=False)
195
+ ]
196
+
197
+ # Date range filter with validation
198
+ if len(filters["date_range"]) == 2:
199
+ start_date, end_date = filters["date_range"]
200
+
201
+ # Validate date range
202
+ if start_date > end_date:
203
+ st.error("Start date must be before end date")
204
+ return filtered_df
205
+
206
+ date_col = "DATE"
207
+ try:
208
+ # Convert full datetime strings to datetime objects using constant
209
+ date_series = pd.to_datetime(filtered_df[date_col], format=FULL_DATE_FORMAT)
210
+
211
+ # Create timestamps for start/end of day
212
+ start_timestamp = pd.Timestamp(start_date).normalize() # Start of day
213
+ end_timestamp = pd.Timestamp(end_date) + pd.Timedelta(days=1) - pd.Timedelta(seconds=1) # End of day
214
+
215
+ filtered_df = filtered_df[
216
+ (date_series >= start_timestamp) &
217
+ (date_series <= end_timestamp)
218
+ ]
219
+ except Exception as e:
220
+ st.error(f"Error applying date filter: {str(e)}")
221
+ return filtered_df
222
+
223
+ # Apply record limit
224
+ return filtered_df.head(filters["record_limit"])
225
+
226
+ def main():
227
+ """Main application entry point."""
228
+ df = load_data()
229
+ if df.empty:
230
+ st.warning("No data available - check data source")
231
+ return
232
+
233
+ initialize_app(df)
234
+ filters = create_filters(df)
235
+ filtered_df = apply_filters(df, filters)
236
+
237
+ # Display results
238
+ st.subheader(f"Results: {len(filtered_df)} records")
239
+
240
+ st.dataframe(filtered_df, use_container_width=True)
241
+
242
+ st.download_button(
243
+ label="Download CSV",
244
+ data=filtered_df.to_csv(index=False).encode(),
245
+ file_name="filtered_results.csv",
246
+ mime="text/csv",
247
+ help="Download filtered results as CSV"
248
+ )
249
+
250
+ main()
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+ duckdb
3
+ networkx
4
+ pandas
5
+ pyvis
6
+ datasets
7
+ huggingface_hub
8
+ python-dateutil
9
+ st-link-analysis
10
+ streamlit-aggrid
solution_component_notes/gdelt_gkg_duckdb_networkx_v5.ipynb ADDED
@@ -0,0 +1,388 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "N20l3SsqSVUM"
7
+ },
8
+ "source": [
9
+ "## Leveraging DuckDB with HF Datasets - GDELT Global KG\n",
10
+ "\n",
11
+ "This notebook demonstrates how to seamlessly transform **GDELT** knowledge graph data into a coherent format that can be pushed to both **NetworkX** and **Neo4j**. It provides a **referenceable pipeline** for data professionals, researchers, and solution architects who need to:\n",
12
+ "\n",
13
+ "1. **Ingest and Query Data Efficiently** \n",
14
+ " - Utilize **DuckDB** to load just the required portions of large Parquet datasets, enabling targeted data exploration and analysis.\n",
15
+ " - It also allows for iteratively honing in on a specific segment of data using splits - helping to maximize performance / cost / efficiency.\n",
16
+ "\n",
17
+ "2. **Maintain Consistent Graph Modeling** \n",
18
+ " - Leverage a shared parsing and entity extraction layer to build consistent node and relationship structures in both an **in-memory** graph (NetworkX) and a **Neo4j** database. (not a requirement per se - but an approach I wanted to start with)\n",
19
+ "\n",
20
+ "3. **Run Advanced Queries and Analytics** \n",
21
+ " - Illustrate critical tasks like **centrality** and **community detection** to pinpoint influential nodes and groupings, and execute **Cypher** queries for real-time insights.\n",
22
+ "\n",
23
+ "4. **Visualize and Export** \n",
24
+ " - Produce simple web-based **PyVis** visualizations or **matplotlib** plots.\n",
25
+ " - more importantly the data can also be exported in **JSON** and GraphML for integration with other graph tooling. (D3.js, Cytoscape, etc.)"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": null,
31
+ "metadata": {
32
+ "id": "DCPEB5tpfW44"
33
+ },
34
+ "outputs": [],
35
+ "source": [
36
+ "%pip install -q duckdb networkx pandas neo4j pyvis"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "execution_count": null,
42
+ "metadata": {
43
+ "id": "A1vEyOkm7LPV"
44
+ },
45
+ "outputs": [],
46
+ "source": [
47
+ "from google.colab import userdata\n",
48
+ "\n",
49
+ "URI = userdata.get('NEO4J_URI')\n",
50
+ "USER = 'neo4j'\n",
51
+ "PASSWORD = userdata.get('NEO4J_PASSWORD')"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "code",
56
+ "execution_count": null,
57
+ "metadata": {
58
+ "id": "cm8t66uPy_C7"
59
+ },
60
+ "outputs": [],
61
+ "source": [
62
+ "import duckdb\n",
63
+ "import networkx as nx\n",
64
+ "from neo4j import GraphDatabase\n",
65
+ "import logging\n",
66
+ "from datetime import datetime\n",
67
+ "import pandas as pd\n",
68
+ "from pyvis.network import Network\n",
69
+ "\n",
70
+ "def get_gdelt_data(limit=100):\n",
71
+ " \"\"\"Get data from DuckDB with specified limit\"\"\"\n",
72
+ " con = duckdb.connect(database=':memory:')\n",
73
+ "\n",
74
+ " # Create view of the dataset\n",
75
+ " con.execute(\"\"\"\n",
76
+ " CREATE VIEW train AS (\n",
77
+ " SELECT *\n",
78
+ " FROM read_parquet('hf://datasets/dwb2023/gdelt-gkg-march2020-v2/*.parquet')\n",
79
+ " );\n",
80
+ " \"\"\")\n",
81
+ "\n",
82
+ " # Single query with limit\n",
83
+ " query = f\"\"\"\n",
84
+ " SELECT\n",
85
+ " GKGRECORDID,\n",
86
+ " DATE,\n",
87
+ " SourceCommonName,\n",
88
+ " DocumentIdentifier,\n",
89
+ " V2EnhancedPersons,\n",
90
+ " V2EnhancedOrganizations,\n",
91
+ " V2EnhancedLocations,\n",
92
+ " V2EnhancedThemes,\n",
93
+ " CAST(SPLIT_PART(\"V1.5Tone\", ',', 1) AS FLOAT) as tone\n",
94
+ " FROM train\n",
95
+ " LIMIT {limit}\n",
96
+ " \"\"\"\n",
97
+ "\n",
98
+ " results_df = con.execute(query).fetchdf()\n",
99
+ " con.close()\n",
100
+ " return results_df\n",
101
+ "\n",
102
+ "class GraphBuilder:\n",
103
+ " \"\"\"Base class for building graph from GDELT data\"\"\"\n",
104
+ " def process_entities(self, row):\n",
105
+ " \"\"\"Process entities from a row and return nodes and relationships\"\"\"\n",
106
+ " nodes = []\n",
107
+ " relationships = []\n",
108
+ " event_id = row[\"GKGRECORDID\"]\n",
109
+ " event_date = row[\"DATE\"]\n",
110
+ " event_source = row[\"SourceCommonName\"]\n",
111
+ " event_document_id = row[\"DocumentIdentifier\"]\n",
112
+ " event_tone = float(row[\"tone\"]) if pd.notna(row[\"tone\"]) else 0.0\n",
113
+ "\n",
114
+ " # Add event node\n",
115
+ " nodes.append({\n",
116
+ " \"id\": event_id,\n",
117
+ " \"type\": \"event\",\n",
118
+ " \"properties\": {\n",
119
+ " \"date\": event_date,\n",
120
+ " \"source\": event_source,\n",
121
+ " \"document\": event_document_id,\n",
122
+ " \"tone\": event_tone\n",
123
+ " }\n",
124
+ " })\n",
125
+ "\n",
126
+ " # Process each entity type\n",
127
+ " entity_mappings = {\n",
128
+ " \"V2EnhancedPersons\": (\"Person\", \"MENTIONED_IN\"),\n",
129
+ " \"V2EnhancedOrganizations\": (\"Organization\", \"MENTIONED_IN\"),\n",
130
+ " \"V2EnhancedLocations\": (\"Location\", \"LOCATED_IN\"),\n",
131
+ " \"V2EnhancedThemes\": (\"Theme\", \"CATEGORIZED_AS\")\n",
132
+ " }\n",
133
+ "\n",
134
+ " for field, (label, relationship) in entity_mappings.items():\n",
135
+ " if pd.notna(row[field]):\n",
136
+ " entities = [e.strip() for e in row[field].split(';') if e.strip()]\n",
137
+ " for entity in entities:\n",
138
+ " nodes.append({\n",
139
+ " \"id\": entity,\n",
140
+ " \"type\": label.lower(),\n",
141
+ " \"properties\": {\"name\": entity}\n",
142
+ " })\n",
143
+ " relationships.append({\n",
144
+ " \"from\": entity,\n",
145
+ " \"to\": event_id,\n",
146
+ " \"type\": relationship,\n",
147
+ " \"properties\": {\"created_at\": event_date}\n",
148
+ " })\n",
149
+ "\n",
150
+ " return nodes, relationships\n",
151
+ "\n",
152
+ "class NetworkXBuilder(GraphBuilder):\n",
153
+ " def build_graph(self, df):\n",
154
+ " G = nx.Graph()\n",
155
+ "\n",
156
+ " for _, row in df.iterrows():\n",
157
+ " nodes, relationships = self.process_entities(row)\n",
158
+ "\n",
159
+ " # Add nodes\n",
160
+ " for node in nodes:\n",
161
+ " G.add_node(node[\"id\"],\n",
162
+ " type=node[\"type\"],\n",
163
+ " **node[\"properties\"])\n",
164
+ "\n",
165
+ " # Add relationships\n",
166
+ " for rel in relationships:\n",
167
+ " G.add_edge(rel[\"from\"],\n",
168
+ " rel[\"to\"],\n",
169
+ " relationship=rel[\"type\"],\n",
170
+ " **rel[\"properties\"])\n",
171
+ "\n",
172
+ " return G\n",
173
+ "\n",
174
+ "class Neo4jBuilder(GraphBuilder):\n",
175
+ " def __init__(self, uri, user, password):\n",
176
+ " self.driver = GraphDatabase.driver(uri, auth=(user, password))\n",
177
+ " self.logger = logging.getLogger(__name__)\n",
178
+ "\n",
179
+ " def close(self):\n",
180
+ " self.driver.close()\n",
181
+ "\n",
182
+ " def build_graph(self, df):\n",
183
+ " with self.driver.session() as session:\n",
184
+ " for _, row in df.iterrows():\n",
185
+ " nodes, relationships = self.process_entities(row)\n",
186
+ "\n",
187
+ " # Create nodes and relationships in Neo4j\n",
188
+ " try:\n",
189
+ " session.execute_write(self._create_graph_elements,\n",
190
+ " nodes, relationships)\n",
191
+ " except Exception as e:\n",
192
+ " self.logger.error(f\"Error processing row {row['GKGRECORDID']}: {str(e)}\")\n",
193
+ "\n",
194
+ " def _create_graph_elements(self, tx, nodes, relationships):\n",
195
+ " # Create nodes\n",
196
+ " for node in nodes:\n",
197
+ " query = f\"\"\"\n",
198
+ " MERGE (n:{node['type']} {{id: $id}})\n",
199
+ " SET n += $properties\n",
200
+ " \"\"\"\n",
201
+ " tx.run(query, id=node[\"id\"], properties=node[\"properties\"])\n",
202
+ "\n",
203
+ " # Create relationships\n",
204
+ " for rel in relationships:\n",
205
+ " query = f\"\"\"\n",
206
+ " MATCH (a {{id: $from_id}})\n",
207
+ " MATCH (b {{id: $to_id}})\n",
208
+ " MERGE (a)-[r:{rel['type']}]->(b)\n",
209
+ " SET r += $properties\n",
210
+ " \"\"\"\n",
211
+ " tx.run(query,\n",
212
+ " from_id=rel[\"from\"],\n",
213
+ " to_id=rel[\"to\"],\n",
214
+ " properties=rel[\"properties\"])"
215
+ ]
216
+ },
217
+ {
218
+ "cell_type": "code",
219
+ "execution_count": null,
220
+ "metadata": {
221
+ "id": "ghbLZNLe23x1"
222
+ },
223
+ "outputs": [],
224
+ "source": [
225
+ "if __name__ == \"__main__\":\n",
226
+ " # Get data once\n",
227
+ " df = get_gdelt_data(limit=25) # Get 25 records\n",
228
+ "\n",
229
+ " # Build NetworkX graph\n",
230
+ " nx_builder = NetworkXBuilder()\n",
231
+ " G = nx_builder.build_graph(df)\n",
232
+ "\n",
233
+ " # Print graph information\n",
234
+ " print(f\"NetworkX Graph Summary:\")\n",
235
+ " print(f\"Nodes: {G.number_of_nodes()}\")\n",
236
+ " print(f\"Edges: {G.number_of_edges()}\")\n",
237
+ "\n",
238
+ " # Print node types distribution\n",
239
+ " node_types = {}\n",
240
+ " for _, attr in G.nodes(data=True):\n",
241
+ " node_type = attr.get('type', 'unknown')\n",
242
+ " node_types[node_type] = node_types.get(node_type, 0) + 1\n",
243
+ "\n",
244
+ " print(\"\\nNode types distribution:\")\n",
245
+ " for ntype, count in node_types.items():\n",
246
+ " print(f\"{ntype}: {count}\")\n",
247
+ "\n",
248
+ " # Build Neo4j graph\n",
249
+ " neo4j_builder = Neo4jBuilder(URI, USER, PASSWORD)\n",
250
+ " try:\n",
251
+ " neo4j_builder.build_graph(df)\n",
252
+ " finally:\n",
253
+ " neo4j_builder.close()"
254
+ ]
255
+ },
256
+ {
257
+ "cell_type": "code",
258
+ "execution_count": null,
259
+ "metadata": {
260
+ "id": "mkJKz_soTsAY"
261
+ },
262
+ "outputs": [],
263
+ "source": [
264
+ "# run cypher query for validation\n",
265
+ "\n",
266
+ "from neo4j import GraphDatabase\n",
267
+ "\n",
268
+ "class Neo4jQuery:\n",
269
+ " def __init__(self, uri, user, password):\n",
270
+ " self.driver = GraphDatabase.driver(uri, auth=(user, password))\n",
271
+ " self.logger = logging.getLogger(__name__)\n",
272
+ "\n",
273
+ " def close(self):\n",
274
+ " self.driver.close()\n",
275
+ "\n",
276
+ " def run_query(self, query):\n",
277
+ " with self.driver.session() as session:\n",
278
+ " result = session.run(query)\n",
279
+ " return result.data()\n",
280
+ "\n",
281
+ "query_1 = \"\"\"\n",
282
+ "// Count nodes by type\n",
283
+ "MATCH (n)\n",
284
+ "RETURN labels(n) as type, count(*) as count\n",
285
+ "ORDER BY count DESC;\n",
286
+ "\"\"\"\n"
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "code",
291
+ "execution_count": null,
292
+ "metadata": {
293
+ "id": "mrlWADO93ize"
294
+ },
295
+ "outputs": [],
296
+ "source": [
297
+ "def visualize_graph(G, output_file='gdelt_network.html'):\n",
298
+ " \"\"\"Visualize NetworkX graph using Pyvis\"\"\"\n",
299
+ " # Create Pyvis network\n",
300
+ " net = Network(notebook=True,\n",
301
+ " height='750px',\n",
302
+ " width='100%',\n",
303
+ " bgcolor='#ffffff',\n",
304
+ " font_color='#000000')\n",
305
+ "\n",
306
+ " # Configure physics\n",
307
+ " net.force_atlas_2based(gravity=-50,\n",
308
+ " central_gravity=0.01,\n",
309
+ " spring_length=100,\n",
310
+ " spring_strength=0.08,\n",
311
+ " damping=0.4,\n",
312
+ " overlap=0)\n",
313
+ "\n",
314
+ " # Color mapping for node types\n",
315
+ " color_map = {\n",
316
+ " 'event': '#1f77b4', # Blue\n",
317
+ " 'person': '#00ff00', # Green\n",
318
+ " 'organization': '#ffa500', # Orange\n",
319
+ " 'location': '#ff0000', # Red\n",
320
+ " 'theme': '#800080' # Purple\n",
321
+ " }\n",
322
+ "\n",
323
+ " # Add nodes\n",
324
+ " for node, attr in G.nodes(data=True):\n",
325
+ " node_type = attr.get('type', 'unknown')\n",
326
+ " title = f\"Type: {node_type}\\n\"\n",
327
+ " for k, v in attr.items():\n",
328
+ " if k != 'type':\n",
329
+ " title += f\"{k}: {v}\\n\"\n",
330
+ "\n",
331
+ " net.add_node(node,\n",
332
+ " title=title,\n",
333
+ " label=str(node)[:20] + '...' if len(str(node)) > 20 else str(node),\n",
334
+ " color=color_map.get(node_type, '#gray'),\n",
335
+ " size=20 if node_type == 'event' else 15)\n",
336
+ "\n",
337
+ " # Add edges\n",
338
+ " for source, target, attr in G.edges(data=True):\n",
339
+ " net.add_edge(source,\n",
340
+ " target,\n",
341
+ " title=f\"{attr.get('relationship', '')}\\nDate: {attr.get('created_at', '')}\",\n",
342
+ " color='#666666')\n",
343
+ "\n",
344
+ " # Save visualization\n",
345
+ " net.show(output_file)\n",
346
+ " return f\"Graph visualization saved to {output_file}\"\n",
347
+ "\n",
348
+ "# Usage example:\n",
349
+ "if __name__ == \"__main__\":\n",
350
+ " visualize_graph(G)"
351
+ ]
352
+ },
353
+ {
354
+ "cell_type": "code",
355
+ "execution_count": null,
356
+ "metadata": {
357
+ "id": "RqFRO1atnIIT"
358
+ },
359
+ "outputs": [],
360
+ "source": [
361
+ "!pip show duckdb"
362
+ ]
363
+ },
364
+ {
365
+ "cell_type": "code",
366
+ "execution_count": null,
367
+ "metadata": {
368
+ "id": "95ML8u0LnKif"
369
+ },
370
+ "outputs": [],
371
+ "source": []
372
+ }
373
+ ],
374
+ "metadata": {
375
+ "colab": {
376
+ "provenance": []
377
+ },
378
+ "kernelspec": {
379
+ "display_name": "Python 3",
380
+ "name": "python3"
381
+ },
382
+ "language_info": {
383
+ "name": "python"
384
+ }
385
+ },
386
+ "nbformat": 4,
387
+ "nbformat_minor": 0
388
+ }
solution_component_notes/gdelt_prefect_extract_to_hf_ds.py ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import asyncio
3
+ from prefect import flow, task, get_run_logger
4
+ from prefect.tasks import task_input_hash
5
+ from prefect.blocks.system import Secret, JSON
6
+ from prefect.task_runners import ConcurrentTaskRunner
7
+ from prefect.concurrency.sync import concurrency
8
+ from pathlib import Path
9
+ import datetime
10
+ from datetime import timedelta
11
+ import pandas as pd
12
+ from tqdm import tqdm
13
+ from huggingface_hub import HfApi, hf_hub_url, list_datasets
14
+ import requests
15
+ import zipfile
16
+ from typing import List, Dict, Optional
17
+
18
+ # --- Constants ---
19
+ # Set a global concurrency limit for Hugging Face uploads
20
+ REPO_ID = "dwb2023/gdelt-gkg-march2020-v2"
21
+
22
+ BASE_URL = "http://data.gdeltproject.org/gdeltv2"
23
+
24
+ # Complete Column List
25
+ GKG_COLUMNS = [
26
+ 'GKGRECORDID', # Unique identifier
27
+ 'DATE', # Publication date
28
+ 'SourceCollectionIdentifier', # Source type
29
+ 'SourceCommonName', # Source name
30
+ 'DocumentIdentifier', # Document URL/ID
31
+ 'V1Counts', # Counts of various types
32
+ 'V2.1Counts', # Enhanced counts with positions
33
+ 'V1Themes', # Theme tags
34
+ 'V2EnhancedThemes', # Themes with positions
35
+ 'V1Locations', # Location mentions
36
+ 'V2EnhancedLocations', # Locations with positions
37
+ 'V1Persons', # Person names
38
+ 'V2EnhancedPersons', # Persons with positions
39
+ 'V1Organizations', # Organization names
40
+ 'V2EnhancedOrganizations', # Organizations with positions
41
+ 'V1.5Tone', # Emotional dimensions
42
+ 'V2.1EnhancedDates', # Date mentions
43
+ 'V2GCAM', # Global Content Analysis Measures
44
+ 'V2.1SharingImage', # Publisher selected image
45
+ 'V2.1RelatedImages', # Article images
46
+ 'V2.1SocialImageEmbeds', # Social media images
47
+ 'V2.1SocialVideoEmbeds', # Social media videos
48
+ 'V2.1Quotations', # Quote extractions
49
+ 'V2.1AllNames', # Named entities
50
+ 'V2.1Amounts', # Numeric amounts
51
+ 'V2.1TranslationInfo', # Translation metadata
52
+ 'V2ExtrasXML' # Additional XML data
53
+ ]
54
+
55
+ # Priority Columns
56
+ PRIORITY_COLUMNS = [
57
+ 'GKGRECORDID', # Unique identifier
58
+ 'DATE', # Publication date
59
+ 'SourceCollectionIdentifier', # Source type
60
+ 'SourceCommonName', # Source name
61
+ 'DocumentIdentifier', # Document URL/ID
62
+ 'V1Counts', # Numeric mentions
63
+ 'V2.1Counts', # Enhanced counts
64
+ 'V1Themes', # Theme tags
65
+ 'V2EnhancedThemes', # Enhanced themes
66
+ 'V1Locations', # Geographic data
67
+ 'V2EnhancedLocations', # Enhanced locations
68
+ 'V1Persons', # Person mentions
69
+ 'V2EnhancedPersons', # Enhanced persons
70
+ 'V1Organizations', # Organization mentions
71
+ 'V2EnhancedOrganizations', # Enhanced organizations
72
+ 'V1.5Tone', # Sentiment scores
73
+ 'V2.1EnhancedDates', # Date mentions
74
+ 'V2GCAM', # Enhanced sentiment
75
+ 'V2.1Quotations', # Direct quotes
76
+ 'V2.1AllNames', # All named entities
77
+ 'V2.1Amounts' # Numeric data
78
+ ]
79
+
80
+ # --- Tasks ---
81
+
82
+ @task(retries=3, retry_delay_seconds=30, log_prints=True)
83
+ def setup_directories(base_path: Path) -> dict:
84
+ """Create processing directories."""
85
+ logger = get_run_logger()
86
+ try:
87
+ raw_dir = base_path / "gdelt_raw"
88
+ processed_dir = base_path / "gdelt_processed"
89
+ raw_dir.mkdir(parents=True, exist_ok=True)
90
+ processed_dir.mkdir(parents=True, exist_ok=True)
91
+ logger.info("Directories created successfully")
92
+ return {"raw": raw_dir, "processed": processed_dir}
93
+ except Exception as e:
94
+ logger.error(f"Directory creation failed: {str(e)}")
95
+ raise
96
+
97
+ @task(retries=2, log_prints=True)
98
+ def generate_gdelt_urls(start_date: datetime.datetime, end_date: datetime.datetime) -> Dict[datetime.date, List[str]]:
99
+ """
100
+ Generate a dictionary keyed by date. Each value is a list of URLs (one per 15-minute interval).
101
+ """
102
+ logger = get_run_logger()
103
+ url_groups = {}
104
+ try:
105
+ current_date = start_date.date()
106
+ while current_date <= end_date.date():
107
+ urls = [
108
+ f"{BASE_URL}/{current_date.strftime('%Y%m%d')}{hour:02}{minute:02}00.gkg.csv.zip"
109
+ for hour in range(24)
110
+ for minute in [0, 15, 30, 45]
111
+ ]
112
+ url_groups[current_date] = urls
113
+ current_date += timedelta(days=1)
114
+ logger.info(f"Generated URL groups for dates: {list(url_groups.keys())}")
115
+ return url_groups
116
+ except Exception as e:
117
+ logger.error(f"URL generation failed: {str(e)}")
118
+ raise
119
+
120
+ @task(retries=3, retry_delay_seconds=30, log_prints=True)
121
+ def download_file(url: str, raw_dir: Path) -> Path:
122
+ """Download a single CSV (zip) file from the given URL."""
123
+ logger = get_run_logger()
124
+ try:
125
+ response = requests.get(url, timeout=10)
126
+ response.raise_for_status()
127
+ filename = Path(url).name
128
+ zip_path = raw_dir / filename
129
+ with zip_path.open('wb') as f:
130
+ f.write(response.content)
131
+ logger.info(f"Downloaded {filename}")
132
+
133
+ # Optionally, extract the CSV from the ZIP archive.
134
+ with zipfile.ZipFile(zip_path, 'r') as z:
135
+ # Assuming the zip contains one CSV file.
136
+ csv_names = z.namelist()
137
+ if csv_names:
138
+ extracted_csv = raw_dir / csv_names[0]
139
+ z.extractall(path=raw_dir)
140
+ logger.info(f"Extracted {csv_names[0]}")
141
+ return extracted_csv
142
+ else:
143
+ raise ValueError("Zip file is empty.")
144
+ except Exception as e:
145
+ logger.error(f"Error downloading {url}: {str(e)}")
146
+ raise
147
+
148
+ @task(retries=2, log_prints=True)
149
+ def convert_and_filter_combined(csv_paths: List[Path], processed_dir: Path, date: datetime.date) -> Path:
150
+ """
151
+ Combine multiple CSV files (for one day) into a single DataFrame,
152
+ filter to only the required columns, optimize data types,
153
+ and write out as a single Parquet file.
154
+ """
155
+ logger = get_run_logger()
156
+ try:
157
+ dfs = []
158
+ for csv_path in csv_paths:
159
+ df = pd.read_csv(
160
+ csv_path,
161
+ sep='\t',
162
+ names=GKG_COLUMNS,
163
+ dtype='string',
164
+ quoting=3,
165
+ na_values=[''],
166
+ encoding='utf-8',
167
+ encoding_errors='replace'
168
+ )
169
+ dfs.append(df)
170
+ combined_df = pd.concat(dfs, ignore_index=True)
171
+ filtered_df = combined_df[PRIORITY_COLUMNS].copy()
172
+ # Convert the date field to datetime; adjust the format if necessary.
173
+ if 'V2.1DATE' in filtered_df.columns:
174
+ filtered_df['V2.1DATE'] = pd.to_datetime(
175
+ filtered_df['V2.1DATE'], format='%Y%m%d%H%M%S', errors='coerce'
176
+ )
177
+ output_filename = f"gdelt_gkg_{date.strftime('%Y%m%d')}.parquet"
178
+ output_path = processed_dir / output_filename
179
+ filtered_df.to_parquet(output_path, engine='pyarrow', compression='snappy', index=False)
180
+ logger.info(f"Converted and filtered data for {date} into {output_filename}")
181
+ return output_path
182
+ except Exception as e:
183
+ logger.error(f"Error processing CSVs for {date}: {str(e)}")
184
+ raise
185
+
186
+ @task(retries=3, retry_delay_seconds=30, log_prints=True)
187
+ def upload_to_hf(file_path: Path, token: str) -> bool:
188
+ """Upload task with global concurrency limit."""
189
+ logger = get_run_logger()
190
+ try:
191
+ with concurrency("hf_uploads", occupy=1):
192
+ # Enable the optimized HF Transfer backend.
193
+ os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
194
+
195
+ api = HfApi()
196
+ api.upload_file(
197
+ path_or_fileobj=str(file_path),
198
+ path_in_repo=file_path.name,
199
+ repo_id=REPO_ID,
200
+ repo_type="dataset",
201
+ token=token,
202
+ )
203
+ logger.info(f"Uploaded {file_path.name}")
204
+ return True
205
+ except Exception as e:
206
+ logger.error(f"Upload failed for {file_path.name}: {str(e)}")
207
+ raise
208
+
209
+ @task(retries=3, retry_delay_seconds=120, log_prints=True)
210
+ def create_hf_repo(token: str) -> bool:
211
+ """
212
+ Validate that the Hugging Face dataset repository exists; create it if not.
213
+ """
214
+ logger = get_run_logger()
215
+ try:
216
+ api = HfApi()
217
+ datasets = [ds.id for ds in list_datasets(token=token)]
218
+ if REPO_ID in datasets:
219
+ logger.info(f"Dataset repository '{REPO_ID}' already exists.")
220
+ return True
221
+ # Create the repository if it doesn't exist.
222
+ api.create_repo(repo_id=REPO_ID, repo_type="dataset", token=token, private=False)
223
+ logger.info(f"Successfully created dataset repository: {REPO_ID}")
224
+ return True
225
+ except Exception as e:
226
+ logger.error(f"Failed to create or validate dataset repo '{REPO_ID}': {str(e)}")
227
+ raise RuntimeError(f"Repository validation/creation failed for '{REPO_ID}'") from e
228
+
229
+ @flow(name="Process Single Day", log_prints=True)
230
+ def process_single_day(
231
+ date: datetime.date, urls: List[str], directories: dict, hf_token: str
232
+ ) -> bool:
233
+ """
234
+ Process one day's data by:
235
+ 1. Downloading all CSV files concurrently.
236
+ 2. Merging, filtering, and optimizing the CSVs.
237
+ 3. Writing out a single daily Parquet file.
238
+ 4. Uploading the file to the Hugging Face Hub.
239
+ """
240
+ logger = get_run_logger()
241
+ try:
242
+ # Download and process data (unlimited concurrency)
243
+ csv_paths = [download_file(url, directories["raw"]) for url in urls]
244
+ daily_parquet = convert_and_filter_combined(csv_paths, directories["processed"], date)
245
+
246
+ # Upload with global concurrency limit
247
+ upload_to_hf(daily_parquet, hf_token) # <-- Throttled to 2 concurrent
248
+
249
+ logger.info(f"Completed {date}")
250
+ return True
251
+ except Exception as e:
252
+ logger.error(f"Day {date} failed: {str(e)}")
253
+ raise
254
+
255
+ @flow(
256
+ name="Process Date Range",
257
+ task_runner=ConcurrentTaskRunner(), # Parallel subflows
258
+ log_prints=True
259
+ )
260
+ def process_date_range(base_path: Path = Path("data")):
261
+ """
262
+ Main ETL flow:
263
+ 1. Load parameters and credentials.
264
+ 2. Validate (or create) the Hugging Face repository.
265
+ 3. Setup directories.
266
+ 4. Generate URL groups by date.
267
+ 5. Process each day concurrently.
268
+ """
269
+ logger = get_run_logger()
270
+
271
+ # Load parameters from a JSON block.
272
+ json_block = JSON.load("gdelt-etl-parameters")
273
+ params = json_block.value
274
+ start_date = datetime.datetime.fromisoformat(params.get("start_date", "2020-03-16T00:00:00"))
275
+ end_date = datetime.datetime.fromisoformat(params.get("end_date", "2020-03-22T00:00:00"))
276
+
277
+ # Load the Hugging Face token from a Secret block.
278
+ secret_block = Secret.load("huggingface-token")
279
+ hf_token = secret_block.get()
280
+
281
+ # Validate or create the repository.
282
+ create_hf_repo(hf_token)
283
+
284
+ directories = setup_directories(base_path)
285
+ url_groups = generate_gdelt_urls(start_date, end_date)
286
+
287
+ # Process days concurrently (subflows)
288
+ futures = [process_single_day(date, urls, directories, hf_token)
289
+ for date, urls in url_groups.items()]
290
+
291
+ # Wait for completion (optional error handling)
292
+ for future in futures:
293
+ try:
294
+ future.result()
295
+ except Exception as e:
296
+ logger.error(f"Failed day: {str(e)}")
297
+
298
+ # --- Entry Point ---
299
+ if __name__ == "__main__":
300
+ process_date_range.serve(
301
+ name="gdelt-etl-production-v2",
302
+ tags=["gdelt", "etl", "production"],
303
+ )
solution_component_notes/hf_gdelt_dataset_2020_covid.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ tags:
4
+ - text
5
+ - news
6
+ - global
7
+ - knowledge-graph
8
+ - geopolitics
9
+ dataset_info:
10
+ features:
11
+ - name: GKGRECORDID
12
+ dtype: string
13
+ - name: DATE
14
+ dtype: string
15
+ - name: SourceCollectionIdentifier
16
+ dtype: string
17
+ - name: SourceCommonName
18
+ dtype: string
19
+ - name: DocumentIdentifier
20
+ dtype: string
21
+ - name: V1Counts
22
+ dtype: string
23
+ - name: V2.1Counts
24
+ dtype: string
25
+ - name: V1Themes
26
+ dtype: string
27
+ - name: V2EnhancedThemes
28
+ dtype: string
29
+ - name: V1Locations
30
+ dtype: string
31
+ - name: V2EnhancedLocations
32
+ dtype: string
33
+ - name: V1Persons
34
+ dtype: string
35
+ - name: V2EnhancedPersons
36
+ dtype: string
37
+ - name: V1Organizations
38
+ dtype: string
39
+ - name: V2EnhancedOrganizations
40
+ dtype: string
41
+ - name: V1.5Tone
42
+ dtype: string
43
+ - name: V2GCAM
44
+ dtype: string
45
+ - name: V2.1EnhancedDates
46
+ dtype: string
47
+ - name: V2.1Quotations
48
+ dtype: string
49
+ - name: V2.1AllNames
50
+ dtype: string
51
+ - name: V2.1Amounts
52
+ dtype: string
53
+ - name: tone
54
+ dtype: float64
55
+ splits:
56
+ - name: train
57
+ num_bytes: 3331097194
58
+ num_examples: 281215
59
+ - name: negative_tone
60
+ num_bytes: 3331097194
61
+ num_examples: 281215
62
+ download_size: 2229048020
63
+ dataset_size: 6662194388
64
+ configs:
65
+ - config_name: default
66
+ data_files:
67
+ - split: train
68
+ path: data/train-*
69
+ - split: negative_tone
70
+ path: data/negative_tone-*
71
+ ---
72
+
73
+ # Dataset Card for dwb2023/gdelt-gkg-march2020-v2
74
+
75
+ ## Dataset Details
76
+
77
+ ### Dataset Description
78
+
79
+ This dataset contains GDELT Global Knowledge Graph (GKG) data covering March 10-22, 2020, during the early phase of the COVID-19 pandemic. It captures global event interactions, actor relationships, and contextual narratives to support temporal, spatial, and thematic analysis.
80
+
81
+ - **Curated by:** dwb2023
82
+
83
+ ### Dataset Sources
84
+
85
+ - **Repository:** [http://data.gdeltproject.org/gdeltv2](http://data.gdeltproject.org/gdeltv2)
86
+ - **GKG Documentation:** [GDELT 2.0 Overview](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/), [GDELT GKG Codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf)
87
+
88
+ ## Uses
89
+
90
+ ### Direct Use
91
+
92
+ This dataset is suitable for:
93
+
94
+ - Temporal analysis of global events
95
+ - Relationship mapping of key actors in supply chain and logistics
96
+ - Sentiment and thematic analysis of COVID-19 pandemic narratives
97
+
98
+ ### Out-of-Scope Use
99
+
100
+ - Not designed for real-time monitoring due to its historic and static nature
101
+ - Not intended for medical diagnosis or predictive health modeling
102
+
103
+ ## Dataset Structure
104
+
105
+ ### Features and Relationships
106
+
107
+ - this dataset focuses on a subset of features from the source GDELT dataset.
108
+
109
+ | Name | Type | Aspect | Description |
110
+ |------|------|---------|-------------|
111
+ | DATE | string | Metadata | Publication date of the article/document |
112
+ | SourceCollectionIdentifier | string | Metadata | Unique identifier for the source collection |
113
+ | SourceCommonName | string | Metadata | Common/display name of the source |
114
+ | DocumentIdentifier | string | Metadata | Unique URL/identifier of the document |
115
+ | V1Counts | string | Metrics | Original count mentions of numeric values |
116
+ | V2.1Counts | string | Metrics | Enhanced numeric pattern extraction |
117
+ | V1Themes | string | Classification | Original thematic categorization |
118
+ | V2EnhancedThemes | string | Classification | Expanded theme taxonomy and classification |
119
+ | V1Locations | string | Entities | Original geographic mentions |
120
+ | V2EnhancedLocations | string | Entities | Enhanced location extraction with coordinates |
121
+ | V1Persons | string | Entities | Original person name mentions |
122
+ | V2EnhancedPersons | string | Entities | Enhanced person name extraction |
123
+ | V1Organizations | string | Entities | Original organization mentions |
124
+ | V2EnhancedOrganizations | string | Entities | Enhanced organization name extraction |
125
+ | V1.5Tone | string | Sentiment | Original emotional tone scoring |
126
+ | V2GCAM | string | Sentiment | Global Content Analysis Measures |
127
+ | V2.1EnhancedDates | string | Temporal | Temporal reference extraction |
128
+ | V2.1Quotations | string | Content | Direct quote extraction |
129
+ | V2.1AllNames | string | Entities | Comprehensive named entity extraction |
130
+ | V2.1Amounts | string | Metrics | Quantity and measurement extraction |
131
+
132
+ ### Aspects Overview:
133
+ - **Metadata**: Core document information
134
+ - **Metrics**: Numerical measurements and counts
135
+ - **Classification**: Categorical and thematic analysis
136
+ - **Entities**: Named entity recognition (locations, persons, organizations)
137
+ - **Sentiment**: Emotional and tone analysis
138
+ - **Temporal**: Time-related information
139
+ - **Content**: Direct content extraction
140
+
141
+ ## Dataset Creation
142
+
143
+ ### Curation Rationale
144
+ This dataset was curated to capture the rapidly evolving global narrative during the early phase of the COVID-19 pandemic, focusing specifically on March 10–22, 2020. By zeroing in on this critical period, it offers a granular perspective on how geopolitical events, actor relationships, and thematic discussions shifted amid the escalating pandemic. The enhanced GKG features further enable advanced entity, sentiment, and thematic analysis, making it a valuable resource for studying the socio-political and economic impacts of COVID-19 during a pivotal point in global history.
145
+
146
+ ### Curation Approach
147
+ A targeted subset of GDELT’s columns was selected to streamline analysis on key entities (locations, persons, organizations), thematic tags, and sentiment scores—core components of many knowledge-graph and text analytics workflows. This approach balances comprehensive coverage with manageable data size and performance. The ETL pipeline used to produce these transformations is documented here:
148
+ [https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0](https://gist.github.com/donbr/e2af2bbe441f90b8664539a25957a6c0).
149
+
150
+ ## Citation
151
+
152
+ When using this dataset, please cite both the dataset and original GDELT project:
153
+
154
+ ```bibtex
155
+ @misc{gdelt-gkg-march2020,
156
+ title = {GDELT Global Knowledge Graph March 2020 Dataset},
157
+ author = {dwb2023},
158
+ year = {2025},
159
+ publisher = {Hugging Face},
160
+ url = {https://huggingface.co/datasets/dwb2023/gdelt-gkg-march2020-v2}
161
+ }
162
+ ```
163
+
164
+ ## Dataset Card Contact
165
+
166
+ For questions and comments about this dataset card, please contact dwb2023 through the Hugging Face platform.
solution_component_notes/hf_gdelt_dataset_2025_february.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ tags:
4
+ - text
5
+ - news
6
+ - global
7
+ - knowledge-graph
8
+ - geopolitics
9
+ dataset_info:
10
+ features:
11
+ - name: GKGRECORDID
12
+ dtype: string
13
+ - name: DATE
14
+ dtype: string
15
+ - name: SourceCollectionIdentifier
16
+ dtype: string
17
+ - name: SourceCommonName
18
+ dtype: string
19
+ - name: DocumentIdentifier
20
+ dtype: string
21
+ - name: V1Counts
22
+ dtype: string
23
+ - name: V2.1Counts
24
+ dtype: string
25
+ - name: V1Themes
26
+ dtype: string
27
+ - name: V2EnhancedThemes
28
+ dtype: string
29
+ - name: V1Locations
30
+ dtype: string
31
+ - name: V2EnhancedLocations
32
+ dtype: string
33
+ - name: V1Persons
34
+ dtype: string
35
+ - name: V2EnhancedPersons
36
+ dtype: string
37
+ - name: V1Organizations
38
+ dtype: string
39
+ - name: V2EnhancedOrganizations
40
+ dtype: string
41
+ - name: V1.5Tone
42
+ dtype: string
43
+ - name: V2.1EnhancedDates
44
+ dtype: string
45
+ - name: V2GCAM
46
+ dtype: string
47
+ - name: V2.1SharingImage
48
+ dtype: string
49
+ - name: V2.1Quotations
50
+ dtype: string
51
+ - name: V2.1AllNames
52
+ dtype: string
53
+ - name: V2.1Amounts
54
+ dtype: string
55
+ ---
56
+
57
+ # Dataset Card for dwb2023/gdelt-gkg-2025-v2
58
+
59
+ ## Dataset Details
60
+
61
+ ### Dataset Description
62
+
63
+ This dataset contains GDELT Global Knowledge Graph (GKG) data covering February 2025. It captures global event interactions, actor relationships, and contextual narratives to support temporal, spatial, and thematic analysis.
64
+
65
+ - **Curated by:** dwb2023
66
+
67
+ ### Dataset Sources
68
+
69
+ - **Repository:** [http://data.gdeltproject.org/gdeltv2](http://data.gdeltproject.org/gdeltv2)
70
+ - **GKG Documentation:** [GDELT 2.0 Overview](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/), [GDELT GKG Codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf)
71
+
72
+ ## Uses
73
+
74
+ ### Direct Use
75
+
76
+ This dataset is suitable for:
77
+
78
+ - Temporal analysis of global events
79
+
80
+ ### Out-of-Scope Use
81
+
82
+ - Not designed for real-time monitoring due to its historic and static nature
83
+ - Not intended for medical diagnosis or predictive health modeling
84
+
85
+ ## Dataset Structure
86
+
87
+ ### Features and Relationships
88
+
89
+ - this dataset focuses on a subset of features from the source GDELT dataset.
90
+
91
+ | Name | Type | Aspect | Description |
92
+ |------|------|---------|-------------|
93
+ | DATE | string | Metadata | Publication date of the article/document |
94
+ | SourceCollectionIdentifier | string | Metadata | Unique identifier for the source collection |
95
+ | SourceCommonName | string | Metadata | Common/display name of the source |
96
+ | DocumentIdentifier | string | Metadata | Unique URL/identifier of the document |
97
+ | V1Counts | string | Metrics | Original count mentions of numeric values |
98
+ | V2.1Counts | string | Metrics | Enhanced numeric pattern extraction |
99
+ | V1Themes | string | Classification | Original thematic categorization |
100
+ | V2EnhancedThemes | string | Classification | Expanded theme taxonomy and classification |
101
+ | V1Locations | string | Entities | Original geographic mentions |
102
+ | V2EnhancedLocations | string | Entities | Enhanced location extraction with coordinates |
103
+ | V1Persons | string | Entities | Original person name mentions |
104
+ | V2EnhancedPersons | string | Entities | Enhanced person name extraction |
105
+ | V1Organizations | string | Entities | Original organization mentions |
106
+ | V2EnhancedOrganizations | string | Entities | Enhanced organization name extraction |
107
+ | V1.5Tone | string | Sentiment | Original emotional tone scoring |
108
+ | V2.1EnhancedDates | string | Temporal | Temporal reference extraction |
109
+ | V2GCAM | string | Sentiment | Global Content Analysis Measures |
110
+ | V2.1SharingImage | string | Content | URL of document image |
111
+ | V2.1Quotations | string | Content | Direct quote extraction |
112
+ | V2.1AllNames | string | Entities | Comprehensive named entity extraction |
113
+ | V2.1Amounts | string | Metrics | Quantity and measurement extraction |
114
+
115
+ ### Aspects Overview:
116
+ - **Metadata**: Core document information
117
+ - **Metrics**: Numerical measurements and counts
118
+ - **Classification**: Categorical and thematic analysis
119
+ - **Entities**: Named entity recognition (locations, persons, organizations)
120
+ - **Sentiment**: Emotional and tone analysis
121
+ - **Temporal**: Time-related information
122
+ - **Content**: Direct content extraction
123
+
124
+ ## Dataset Creation
125
+
126
+ ### Curation Rationale
127
+ This dataset was curated to capture the rapidly evolving global narrative during February 2025. By zeroing in on this critical period, it offers a granular perspective on how geopolitical events, actor relationships, and thematic discussions shifted amid the escalating pandemic. The enhanced GKG features further enable advanced entity, sentiment, and thematic analysis, making it a valuable resource for studying the socio-political and economic impacts of emergent LLM capabilities.
128
+
129
+ ### Curation Approach
130
+ A targeted subset of GDELT’s columns was selected to streamline analysis on key entities (locations, persons, organizations), thematic tags, and sentiment scores—core components of many knowledge-graph and text analytics workflows. This approach balances comprehensive coverage with manageable data size and performance. The ETL pipeline used to produce these transformations is documented here:
131
+ [https://gist.github.com/donbr/5293468436a1a39bd2d9f4959cbd4923](https://gist.github.com/donbr/5293468436a1a39bd2d9f4959cbd4923).
132
+
133
+ ## Citation
134
+
135
+ When using this dataset, please cite both the dataset and original GDELT project:
136
+
137
+ ```bibtex
138
+ @misc{gdelt-gkg-2025-v2,
139
+ title = {GDELT Global Knowledge Graph 2025 Dataset},
140
+ author = {dwb2023},
141
+ year = {2025},
142
+ publisher = {Hugging Face},
143
+ url = {https://huggingface.co/datasets/dwb2023/gdelt-gkg-2025-v2}
144
+ }
145
+ ```
146
+
147
+ ## Dataset Card Contact
148
+
149
+ For questions and comments about this dataset card, please contact dwb2023 through the Hugging Face platform.