File size: 931 Bytes
e19a510
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PySpark Data Engineering Assessment\n",
    "\n",
    "## Tasks\n",
    "\n",
    "1. Read the CSV data (in `../data/titanic.csv`) into:\n",
    "   - a Pandas DataFrame\n",
    "   - a Spark DataFrame\n",
    "\n",
    "2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
    "\n",
    "3. Run basic aggregations:\n",
    "   - Find the average Fare by Pclass\n",
    "   - Find survival rate by Sex and Pclass\n",
    "   - etc.\n",
    "\n",
    "4. Write the cleaned Spark DataFrame to a Parquet file.\n",
    "\n",
    "5. Bonus tasks:\n",
    "   - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
    "   - Provide quick EDA (e.g., distribution of Ages).\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}