Srihari Thyagarajan commited on
Commit
2ddafd0
·
unverified ·
2 Parent(s): a797de9 d4325de

Merge pull request #73 from julius383/duckdb_loading_json

Browse files
Files changed (1) hide show
  1. duckdb/009_loading_json.py +281 -0
duckdb/009_loading_json.py ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.12"
3
+ # dependencies = [
4
+ # "duckdb==1.2.1",
5
+ # "marimo",
6
+ # "polars[pyarrow]==1.25.2",
7
+ # "sqlglot==26.11.1",
8
+ # ]
9
+ # ///
10
+ # /// script
11
+ # requires-python = ">=3.11"
12
+ # dependencies = [
13
+ # "marimo",
14
+ # "duckdb==1.2.1",
15
+ # "sqlglot==26.11.1",
16
+ # "polars[pyarrow]==1.25.2",
17
+ # ]
18
+ # ///
19
+
20
+ import marimo
21
+
22
+ __generated_with = "0.12.8"
23
+ app = marimo.App(width="medium")
24
+
25
+
26
+ @app.cell(hide_code=True)
27
+ def _(mo):
28
+ mo.md(
29
+ r"""
30
+ # Loading JSON
31
+
32
+ DuckDB supports reading and writing JSON through the `json` extension that should be present in most distributions and is autoloaded on first-use. If it's not, you can [install and load](https://duckdb.org/docs/stable/data/json/installing_and_loading.html) it manually like any other extension.
33
+
34
+ In this tutorial we'll cover 4 different ways we can transfer JSON data in and out of DuckDB:
35
+
36
+ - [`FROM`](https://duckdb.org/docs/stable/sql/query_syntax/from.html) statement.
37
+ - [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function.
38
+ - [`COPY`](https://duckdb.org/docs/stable/sql/statements/copy#copy--from) statement.
39
+ - [`IMPORT DATABASE`](https://duckdb.org/docs/stable/sql/statements/export.html) statement.
40
+ """
41
+ )
42
+ return
43
+
44
+
45
+ @app.cell(hide_code=True)
46
+ def _(mo):
47
+ mo.md(
48
+ r"""
49
+ ## Using `FROM`
50
+
51
+ Loading data using `FROM` is simple and straightforward. We use a path or URL to the file we want to load where we'd normally put a table name. When we do this, DuckDB attempts to infer the right way to read the file including the correct format and column types. In most cases this is all we need to load data into DuckDB.
52
+ """
53
+ )
54
+ return
55
+
56
+
57
+ @app.cell
58
+ def _(mo):
59
+ _df = mo.sql(
60
+ f"""
61
+ SELECT * FROM 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json';
62
+ """
63
+ )
64
+ return
65
+
66
+
67
+ @app.cell(hide_code=True)
68
+ def _(mo):
69
+ mo.md(
70
+ r"""
71
+ ## Using `read_json`
72
+
73
+ For greater control over how the JSON is read, we can directly call the [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. It supports a few different arguments — some common ones are:
74
+
75
+ - `format='array'` or `format='newline_delimited'` - the former tells DuckDB that the rows should be read from a top-level JSON array while the latter means the rows should be read from JSON objects separated by a newline (JSONL/NDJSON).
76
+ - `ignore_errors=true` - skips lines with parse errors when reading newline delimited JSON.
77
+ - `columns={columnName: type, ...}` - lets you set types for individual columns manually.
78
+ - `dateformat` and `timestampformat` - controls how DuckDB attempts to parse [Date](https://duckdb.org/docs/stable/sql/data_types/date) and [Timestamp](https://duckdb.org/docs/stable/sql/data_types/timestamp) types. Use the format specifiers specified in the [docs](https://duckdb.org/docs/stable/sql/functions/dateformat.html#format-specifiers).
79
+
80
+ We could rewrite the previous query more explicitly as:
81
+ """
82
+ )
83
+ return
84
+
85
+
86
+ @app.cell
87
+ def _(mo):
88
+ cars_df = mo.sql(
89
+ f"""
90
+ SELECT *
91
+ FROM
92
+ read_json(
93
+ 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json',
94
+ format = 'array',
95
+ columns = {{
96
+ Name:'VARCHAR',
97
+ Miles_per_Gallon:'FLOAT',
98
+ Cylinders:'FLOAT',
99
+ Displacement:'FLOAT',
100
+ Horsepower:'FLOAT',
101
+ Weight_in_lbs:'FLOAT',
102
+ Acceleration:'FLOAT',
103
+ Year:'DATE',
104
+ Origin:'VARCHAR'
105
+ }},
106
+ dateformat = '%Y-%m-%d'
107
+ )
108
+ ;
109
+ """
110
+ )
111
+ return (cars_df,)
112
+
113
+
114
+ @app.cell(hide_code=True)
115
+ def _(mo):
116
+ mo.md(r"""Other than singular files we can read [multiple files](https://duckdb.org/docs/stable/data/multiple_files/overview.html) at a time by either passing a list of files or a UNIX glob pattern.""")
117
+ return
118
+
119
+
120
+ @app.cell(hide_code=True)
121
+ def _(mo):
122
+ mo.md(
123
+ r"""
124
+ ## Using `COPY`
125
+
126
+ `COPY` is for useful both for importing and exporting data in a variety of formats including JSON. For example, we can import data into an existing table from a JSON file.
127
+ """
128
+ )
129
+ return
130
+
131
+
132
+ @app.cell
133
+ def _(mo):
134
+ _df = mo.sql(
135
+ f"""
136
+ CREATE OR REPLACE TABLE cars2 (
137
+ Name VARCHAR,
138
+ Miles_per_Gallon VARCHAR,
139
+ Cylinders VARCHAR,
140
+ Displacement FLOAT,
141
+ Horsepower FLOAT,
142
+ Weight_in_lbs FLOAT,
143
+ Acceleration FLOAT,
144
+ Year DATE,
145
+ Origin VARCHAR
146
+ );
147
+ """
148
+ )
149
+ return (cars2,)
150
+
151
+
152
+ @app.cell
153
+ def _(cars2, mo):
154
+ _df = mo.sql(
155
+ f"""
156
+ COPY cars2 FROM 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json' (FORMAT json, ARRAY true, DATEFORMAT '%Y-%m-%d');
157
+ SELECT * FROM cars2;
158
+ """
159
+ )
160
+ return
161
+
162
+
163
+ @app.cell(hide_code=True)
164
+ def _(mo):
165
+ mo.md(r"""Similarly, we can write data from a table or select statement to a JSON file. For example, we create a new JSONL file with just the car names and miles per gallon. We first create a temporary directory to avoid cluttering our project directory.""")
166
+ return
167
+
168
+
169
+ @app.cell
170
+ def _(Path):
171
+ from tempfile import TemporaryDirectory
172
+
173
+ TMP_DIR = TemporaryDirectory()
174
+ COPY_PATH = Path(TMP_DIR.name) / "cars_mpg.jsonl"
175
+ print(COPY_PATH)
176
+ return COPY_PATH, TMP_DIR, TemporaryDirectory
177
+
178
+
179
+ @app.cell
180
+ def _(COPY_PATH, cars2, mo):
181
+ _df = mo.sql(
182
+ f"""
183
+ COPY (
184
+ SELECT
185
+ Name AS car_name,
186
+ "Miles_per_Gallon" AS mpg
187
+ FROM cars2
188
+ WHERE mpg IS NOT null
189
+ ) TO '{COPY_PATH}' (FORMAT json);
190
+ """
191
+ )
192
+ return
193
+
194
+
195
+ @app.cell
196
+ def _(COPY_PATH, Path):
197
+ Path(COPY_PATH).exists()
198
+ return
199
+
200
+
201
+ @app.cell(hide_code=True)
202
+ def _(mo):
203
+ mo.md(
204
+ r"""
205
+ ## Using `IMPORT DATABASE`
206
+
207
+ The last method we can use to load JSON data is using the `IMPORT DATABASE` statement. It works in conjunction with `EXPORT DATABASE` to save and load an entire database to and from a directory. For example let's try and export our default in-memory database.
208
+ """
209
+ )
210
+ return
211
+
212
+
213
+ @app.cell
214
+ def _(Path, TMP_DIR):
215
+ EXPORT_PATH = Path(TMP_DIR.name) / "cars_export"
216
+ print(EXPORT_PATH)
217
+ return (EXPORT_PATH,)
218
+
219
+
220
+ @app.cell
221
+ def _(EXPORT_PATH, mo):
222
+ _df = mo.sql(
223
+ f"""
224
+ EXPORT DATABASE '{EXPORT_PATH}' (FORMAT json);
225
+ """
226
+ )
227
+ return
228
+
229
+
230
+ @app.cell
231
+ def _(EXPORT_PATH, Path):
232
+ list(Path(EXPORT_PATH).iterdir())
233
+ return
234
+
235
+
236
+ @app.cell(hide_code=True)
237
+ def _(mo):
238
+ mo.md(r"""We can then load the database back into DuckDB.""")
239
+ return
240
+
241
+
242
+ @app.cell
243
+ def _(EXPORT_PATH, mo):
244
+ _df = mo.sql(
245
+ f"""
246
+ DROP TABLE IF EXISTS cars2;
247
+ IMPORT DATABASE '{EXPORT_PATH}';
248
+ SELECT * FROM cars2;
249
+ """
250
+ )
251
+ return
252
+
253
+
254
+ @app.cell(hide_code=True)
255
+ def _(TMP_DIR):
256
+ TMP_DIR.cleanup()
257
+ return
258
+
259
+
260
+ @app.cell(hide_code=True)
261
+ def _(mo):
262
+ mo.md(
263
+ r"""
264
+ ## Further Reading
265
+
266
+ - Complete information on the JSON support in DuckDB can be found in their [documentation](https://duckdb.org/docs/stable/data/json/overview.html).
267
+ - You can also learn more about using SQL in marimo from the [examples](https://github.com/marimo-team/marimo/tree/main/examples/sql).
268
+ """
269
+ )
270
+ return
271
+
272
+
273
+ @app.cell
274
+ def _():
275
+ import marimo as mo
276
+ from pathlib import Path
277
+ return Path, mo
278
+
279
+
280
+ if __name__ == "__main__":
281
+ app.run()