# Data Science Analysis Notebook

This notebook contains some example Python code for data analysis.

# Create a function to summarize the code.


At first, we will start by importing the pandas and numpy modules.
 Then we will use the seaborn library.
 Next step is to set the style of the visualization.


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')
%matplotlib inline

# Create a function summarize and load the dataset.


To Load the dataset
 To display the basic information, use the print statement in the function.
 To print the dataset shape and head method.

 Create a new dataframe with the shape of the dataframe and the head method

In [None]:
# Load the dataset
df = pd.read_csv('housing_data.csv')

# Display basic information
print(f"Dataset shape: {df.shape}")
df.head()

# Create a function summarize to perform the data cleaning.


In the for loop we iterate through the dataframe and fill missing values with median.
 For each column in the dataframe, we check if the column is float64 or int64 type. If it is then we use the mode() function

In [None]:
# Perform data cleaning
# Fill missing values with median
for column in df.columns:
 if df[column].dtype in ['float64', 'int64']:
 df[column].fillna(df[column].median(), inplace=True)
 else:
 df[column].fillna(df[column].mode()[0], inplace=True)

# Check for remaining missing values
print("Missing values after cleaning:")
print(df.isnull().sum())

# Create a function to summarize the data.


For each column in the dataframe, create a list of numeric columns.
 Then create a correlation matrix.
 Next step is to create a function that takes in a dataframe and returns the correlation matrix as an argument.

In [None]:
# Exploratory data analysis
# Create correlation matrix
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = df[numeric_columns].corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Numeric Features', fontsize=18)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Create a variable called bedrooms_ratio and rooms_per_household.


If 'bedrooms' in the column and total_rooms is the column then create a new feature and scale it.


In [None]:
# Feature engineering
# Create new features
if 'bedrooms' in df.columns and 'total_rooms' in df.columns:
 df['bedrooms_ratio'] = df['bedrooms'] / df['total_rooms']

if 'total_rooms' in df.columns and 'households' in df.columns:
 df['rooms_per_household'] = df['total_rooms'] / df['households']

# Scale numeric features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Display transformed data
df.head()

# Create a simple prediction model


This function will build a model that can be used to train and evaluate the model.
 Next step is to split the dataframe into training and test data and predict the median_house_value column using the train_test_split function.

In [None]:
# Build a simple prediction model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Assume we're predicting median_house_value
if 'median_house_value' in df.columns:
 # Prepare features and target
 X = df.drop('median_house_value', axis=1)
 y = df['median_house_value']
 
 # Split the data
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
 # Train the model
 model = LinearRegression()
 model.fit(X_train, y_train)
 
 # Make predictions
 y_pred = model.predict(X_test)
 
 # Evaluate the model
 mse = mean_squared_error(y_test, y_pred)
 r2 = r2_score(y_test, y_pred)
 
 print(f"Mean Squared Error: {mse:.2f}")
 print(f"R² Score: {r2:.2f}")
 
 # Plot actual vs predicted values
 plt.figure(figsize=(10, 6))
 plt.scatter(y_test, y_pred, alpha=0.5)
 plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
 plt.xlabel('Actual Values')
 plt.ylabel('Predicted Values')
 plt.title('Actual vs Predicted Values')
 plt.tight_layout()
 plt.show()