Spaces:

deagar
/

spark_sandbox

Sleeping

File size: 931 Bytes

e19a510

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PySpark Data Engineering Assessment\n",
    "\n",
    "## Tasks\n",
    "\n",
    "1. Read the CSV data (in `../data/titanic.csv`) into:\n",
    "   - a Pandas DataFrame\n",
    "   - a Spark DataFrame\n",
    "\n",
    "2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
    "\n",
    "3. Run basic aggregations:\n",
    "   - Find the average Fare by Pclass\n",
    "   - Find survival rate by Sex and Pclass\n",
    "   - etc.\n",
    "\n",
    "4. Write the cleaned Spark DataFrame to a Parquet file.\n",
    "\n",
    "5. Bonus tasks:\n",
    "   - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
    "   - Provide quick EDA (e.g., distribution of Ages).\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}