{
 "cells": [
  {
   "cell_type": "markdown",
   "id": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data Science Analysis Notebook\n",
    "\n",
    "This notebook contains some example Python code for data analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  Create a function to summarize the code.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "At first, we will start by importing the pandas and numpy modules.\n",
    " Then we will use the seaborn library.\n",
    " Next step is to set the style of the visualization.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Set visualization style\n",
    "sns.set(style='whitegrid')\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a function summarize and load the dataset.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "To Load the dataset\n",
    " To display the basic information, use the print statement in the function.\n",
    " To print the dataset shape and head method.\n",
    "\n",
    " Create a new dataframe with the shape of the dataframe and the head method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "df = pd.read_csv('housing_data.csv')\n",
    "\n",
    "# Display basic information\n",
    "print(f\"Dataset shape: {df.shape}\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a function summarize to perform the data cleaning.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "In the for loop we iterate through the dataframe and fill missing values with median.\n",
    " For each column in the dataframe, we check if the column is float64 or int64 type. If it is then we use the mode() function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform data cleaning\n",
    "# Fill missing values with median\n",
    "for column in df.columns:\n",
    "    if df[column].dtype in ['float64', 'int64']:\n",
    "        df[column].fillna(df[column].median(), inplace=True)\n",
    "    else:\n",
    "        df[column].fillna(df[column].mode()[0], inplace=True)\n",
    "\n",
    "# Check for remaining missing values\n",
    "print(\"Missing values after cleaning:\")\n",
    "print(df.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  Create a function to summarize the data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "For each column in the dataframe, create a list of numeric columns.\n",
    " Then create a correlation matrix.\n",
    " Next step is to create a function that takes in a dataframe and returns the correlation matrix as an argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exploratory data analysis\n",
    "# Create correlation matrix\n",
    "numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns\n",
    "correlation_matrix = df[numeric_columns].corr()\n",
    "\n",
    "# Plot heatmap\n",
    "plt.figure(figsize=(12, 10))\n",
    "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)\n",
    "plt.title('Correlation Matrix of Numeric Features', fontsize=18)\n",
    "plt.xticks(rotation=45, ha='right')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  Create a variable called bedrooms_ratio and rooms_per_household.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "If 'bedrooms' in the column and total_rooms is the column then create a new feature and scale it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature engineering\n",
    "# Create new features\n",
    "if 'bedrooms' in df.columns and 'total_rooms' in df.columns:\n",
    "    df['bedrooms_ratio'] = df['bedrooms'] / df['total_rooms']\n",
    "\n",
    "if 'total_rooms' in df.columns and 'households' in df.columns:\n",
    "    df['rooms_per_household'] = df['total_rooms'] / df['households']\n",
    "\n",
    "# Scale numeric features\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()\n",
    "df[numeric_columns] = scaler.fit_transform(df[numeric_columns])\n",
    "\n",
    "# Display transformed data\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  Create a simple prediction model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "This function will build a model that can be used to train and evaluate the model.\n",
    " Next step is to split the dataframe into training and test data and predict the median_house_value column using the train_test_split function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build a simple prediction model\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "\n",
    "# Assume we're predicting median_house_value\n",
    "if 'median_house_value' in df.columns:\n",
    "    # Prepare features and target\n",
    "    X = df.drop('median_house_value', axis=1)\n",
    "    y = df['median_house_value']\n",
    "    \n",
    "    # Split the data\n",
    "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "    \n",
    "    # Train the model\n",
    "    model = LinearRegression()\n",
    "    model.fit(X_train, y_train)\n",
    "    \n",
    "    # Make predictions\n",
    "    y_pred = model.predict(X_test)\n",
    "    \n",
    "    # Evaluate the model\n",
    "    mse = mean_squared_error(y_test, y_pred)\n",
    "    r2 = r2_score(y_test, y_pred)\n",
    "    \n",
    "    print(f\"Mean Squared Error: {mse:.2f}\")\n",
    "    print(f\"R² Score: {r2:.2f}\")\n",
    "    \n",
    "    # Plot actual vs predicted values\n",
    "    plt.figure(figsize=(10, 6))\n",
    "    plt.scatter(y_test, y_pred, alpha=0.5)\n",
    "    plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')\n",
    "    plt.xlabel('Actual Values')\n",
    "    plt.ylabel('Predicted Values')\n",
    "    plt.title('Actual vs Predicted Values')\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}