Spaces:
Sleeping
Sleeping
File size: 931 Bytes
e19a510 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PySpark Data Engineering Assessment\n",
"\n",
"## Tasks\n",
"\n",
"1. Read the CSV data (in `../data/titanic.csv`) into:\n",
" - a Pandas DataFrame\n",
" - a Spark DataFrame\n",
"\n",
"2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
"\n",
"3. Run basic aggregations:\n",
" - Find the average Fare by Pclass\n",
" - Find survival rate by Sex and Pclass\n",
" - etc.\n",
"\n",
"4. Write the cleaned Spark DataFrame to a Parquet file.\n",
"\n",
"5. Bonus tasks:\n",
" - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
" - Provide quick EDA (e.g., distribution of Ages).\n",
"\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|