Spaces:
Running
Running
import streamlit as st | |
import pandas as pd | |
import numpy as np | |
import matplotlib.pyplot as plt | |
import seaborn as sns | |
import io | |
import sys | |
from contextlib import redirect_stdout | |
# Initialize session state for notebook-like cells | |
if 'cells' not in st.session_state: | |
st.session_state.cells = [] | |
if 'df' not in st.session_state: | |
st.session_state.df = None | |
def capture_output(code, df=None): | |
"""Helper function to capture print output""" | |
f = io.StringIO() | |
with redirect_stdout(f): | |
try: | |
# Create a dictionary of variables to use in exec | |
variables = {'pd': pd, 'np': np, 'plt': plt, 'sns': sns} | |
if df is not None: | |
variables['df'] = df | |
exec(code, variables) | |
except Exception as e: | |
return f"Error: {str(e)}" | |
return f.getvalue() | |
def show(): | |
st.title("Week 3: Data Cleaning and Exploratory Data Analysis") | |
# Section 1: Introduction to EDA | |
st.header("1. Introduction to Exploratory Data Analysis") | |
st.markdown(""" | |
Exploratory Data Analysis (EDA) is a crucial step in any data science project. Whether EDA is the main purpose of your project or is being used for feature selection/feature engineering in a machine learning context, it's important to understand the relationships between your features and target variables. | |
In this module, we'll focus on: | |
- Understanding categorical variables | |
- Data cleaning techniques | |
- Visualizing relationships in data | |
- Identifying patterns and insights | |
""") | |
# Section 2: The Titanic Dataset | |
st.header("2. Working with the Titanic Dataset") | |
st.markdown(""" | |
We'll use the famous Titanic dataset to demonstrate data cleaning and EDA techniques. This dataset contains information about passengers aboard the Titanic and whether they survived. | |
### Dataset Description | |
| Variable | Definition | Key | | |
| -------- | ---------- | --- | | |
| survival | Survival | 0 = No, 1 = Yes | | |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | |
| sex | Sex | | | |
| Age | Age in years | | | |
| sibsp | # of siblings / spouses aboard | | | |
| parch | # of parents / children aboard | | | |
| ticket | Ticket number | | | |
| fare | Passenger fare | | | |
| cabin | Cabin number | | | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | | |
""") | |
# Load and display the dataset | |
def load_data(): | |
return pd.read_csv("https://raw.githubusercontent.com/hoffm386/eda-with-categorical-variables/master/titanic.csv") | |
df = load_data() | |
st.session_state.df = df | |
st.subheader("Dataset Preview") | |
st.dataframe(df.head()) | |
# Interactive Data Loading Example | |
st.subheader("Try loading the data yourself!") | |
load_code = st.text_area("Try loading the Titanic dataset:", | |
'import pandas as pd\n\ndf = pd.read_csv("https://raw.githubusercontent.com/hoffm386/eda-with-categorical-variables/master/titanic.csv")\nprint(df.head())', | |
height=100) | |
st.code(load_code, language="python", line_numbers=True) | |
if st.button("Run Data Loading Code"): | |
output = capture_output(load_code, df) | |
st.code(output, language="python", line_numbers=True) | |
# Basic Dataset Information | |
st.subheader("Dataset Information") | |
st.markdown(""" | |
Let's explore some basic information about our dataset. Try these commands: | |
""") | |
info_code = st.text_area("Try getting dataset information:", | |
'print("Dataset Shape:", df.shape)\nprint("\\nColumn Names:", df.columns.tolist())\nprint("\\nData Types:\\n", df.dtypes)\nprint("\\nMissing Values:\\n", df.isnull().sum())', | |
height=150) | |
st.code(info_code, language="python", line_numbers=True) | |
if st.button("Run Info Code"): | |
output = capture_output(info_code, df) | |
st.code(output, language="python", line_numbers=True) | |
# Section 3: Data Cleaning | |
st.header("3. Data Cleaning Techniques") | |
# Missing Value Handling | |
st.subheader("Missing Value Analysis") | |
st.markdown(""" | |
Let's analyze and handle missing values in our dataset. Try these examples: | |
""") | |
missing_code = st.text_area("Try analyzing missing values:", | |
'missing_percent = (df.isnull().sum() / len(df)) * 100\nprint("Percentage of missing values:\\n", missing_percent[missing_percent > 0])\n\n# Try filling missing values\ndf_filled = df.copy()\ndf_filled["Age"].fillna(df_filled["Age"].median(), inplace=True)\nprint("\\nMissing values after filling Age:", df_filled["Age"].isnull().sum())', | |
height=150) | |
st.code(missing_code, language="python", line_numbers=True) | |
if st.button("Run Missing Value Code"): | |
output = capture_output(missing_code, df) | |
st.code(output, language="python", line_numbers=True) | |
# Data Type Conversion | |
st.subheader("Data Type Conversion") | |
st.markdown(""" | |
Let's convert categorical variables to the appropriate data types: | |
""") | |
type_code = st.text_area("Try converting data types:", | |
'df_cat = df.copy()\ndf_cat["Sex"] = df_cat["Sex"].astype("category")\ndf_cat["Embarked"] = df_cat["Embarked"].astype("category")\nprint("Data types after conversion:\\n", df_cat.dtypes)', | |
height=100) | |
st.code(type_code, language="python", line_numbers=True) | |
if st.button("Run Type Conversion Code"): | |
output = capture_output(type_code, df) | |
st.code(output, language="python", line_numbers=True) | |
# Section 4: EDA with Categorical Variables | |
st.header("4. EDA with Categorical Variables") | |
# Interactive Visualizations | |
st.subheader("Create Your Own Visualizations") | |
st.markdown(""" | |
Let's explore different types of visualizations to understand our data better: | |
1. **Basic Count Plots** | |
First, let's look at the distribution of passengers by class and survival: | |
""") | |
viz_code = st.text_area("Try creating basic visualizations:", | |
'''import matplotlib.pyplot as plt | |
import seaborn as sns | |
# Create a figure with two subplots | |
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5)) | |
# Count plot for Sex | |
sns.countplot(data=df, x="Sex", ax=ax1) | |
ax1.set_title("Passenger Count by Sex") | |
# Bar plot for survival rate by Pclass | |
sns.barplot(data=df, x="Pclass", y="Survived", ax=ax2) | |
ax2.set_title("Survival Rate by Passenger Class") | |
plt.tight_layout() | |
st.pyplot(fig)''', | |
height=200) | |
st.code(viz_code, language="python", line_numbers=True) | |
if st.button("Run Basic Visualization Code"): | |
output = capture_output(viz_code, df) | |
st.pyplot(plt.gcf()) | |
# Advanced Visualizations | |
st.subheader("Advanced Visualizations") | |
st.markdown(""" | |
Now let's create more complex visualizations to understand relationships between variables: | |
2. **Survival Analysis by Class** | |
Let's analyze survival rates across different passenger classes with a stacked bar chart: | |
""") | |
advanced_viz_code = st.text_area("Try creating advanced visualizations:", | |
'''import matplotlib.pyplot as plt | |
import seaborn as sns | |
from matplotlib.patches import Patch | |
# Create figure and axis | |
fig, ax = plt.subplots(figsize=(10, 6)) | |
# Create countplot with custom colors | |
sns.countplot(x="Pclass", hue="Survived", data=df, | |
palette={1: "blue", 0: "red"}, ax=ax) | |
# Customize the plot | |
ax.set_xlabel("Passenger Class") | |
ax.set_title("Survival Distribution by Passenger Class") | |
# Create custom legend | |
legend_elements = [ | |
Patch(facecolor="blue", label="Survived"), | |
Patch(facecolor="red", label="Did Not Survive") | |
] | |
ax.legend(handles=legend_elements) | |
plt.tight_layout() | |
st.pyplot(fig) | |
# Create a second figure for percentage analysis | |
fig2, ax2 = plt.subplots(figsize=(10, 6)) | |
# Calculate percentages | |
survival_by_class = df.groupby("Pclass")["Survived"].value_counts(normalize=True).unstack() | |
survival_by_class.plot(kind="bar", stacked=True, ax=ax2) | |
# Customize the plot | |
ax2.set_xlabel("Passenger Class") | |
ax2.set_ylabel("Percentage") | |
ax2.set_title("Survival Rate by Passenger Class") | |
ax2.legend(title="Survived", labels=["No", "Yes"]) | |
plt.tight_layout() | |
st.pyplot(fig2)''', | |
height=400) | |
st.code(advanced_viz_code, language="python", line_numbers=True) | |
if st.button("Run Advanced Visualization Code"): | |
output = capture_output(advanced_viz_code, df) | |
st.pyplot(plt.gcf()) | |
# Age Distribution Analysis | |
st.subheader("Age Distribution Analysis") | |
st.markdown(""" | |
3. **Age Distribution by Survival** | |
Let's examine how age relates to survival: | |
""") | |
age_viz_code = st.text_area("Try creating age distribution visualizations:", | |
'''import matplotlib.pyplot as plt | |
# Create figure and axis | |
fig, ax = plt.subplots() | |
# Plot histograms for survived and non-survived passengers | |
ax.hist(df[df["Survived"]==1]["Age"], bins=15, alpha=0.5, color="blue", label="survived") | |
ax.hist(df[df["Survived"]==0]["Age"], bins=15, alpha=0.5, color="green", label="did not survive") | |
# Customize the plot | |
ax.set_xlabel("Age") | |
ax.set_ylabel("Count of passengers") | |
ax.set_title("Age vs. Survival for Titanic Passengers") | |
ax.legend() | |
plt.tight_layout() | |
st.pyplot(fig)''', | |
height=200) | |
st.code(age_viz_code, language="python", line_numbers=True) | |
if st.button("Run Age Distribution Code"): | |
output = capture_output(age_viz_code, df) | |
st.pyplot(plt.gcf()) | |
# Age and Fare Analysis | |
st.subheader("Age and Fare Analysis") | |
st.markdown(""" | |
4. **Survival by Age and Fare** | |
Let's analyze how both age and fare relate to survival: | |
""") | |
age_fare_viz_code = st.text_area("Try creating age and fare visualizations:", | |
'''import matplotlib.pyplot as plt | |
from matplotlib.lines import Line2D | |
# Create figure and axis | |
fig, ax = plt.subplots(figsize=(10, 5)) | |
# Plot scatter points for survived and non-survived passengers | |
ax.scatter(df[df["Survived"]==1]["Age"], df[df["Survived"]==1]["Fare"], | |
c="blue", alpha=0.5, label="survived") | |
ax.scatter(df[df["Survived"]==0]["Age"], df[df["Survived"]==0]["Fare"], | |
c="green", alpha=0.5, label="did not survive") | |
# Customize the plot | |
ax.set_xlabel("Age") | |
ax.set_ylabel("Fare") | |
ax.set_title("Survival by Age and Fare for Titanic Passengers") | |
# Create custom legend | |
color_patches = [ | |
Line2D([0], [0], marker='o', color='w', label='survived', | |
markerfacecolor='b', markersize=10), | |
Line2D([0], [0], marker='o', color='w', label='did not survive', | |
markerfacecolor='g', markersize=10) | |
] | |
ax.legend(handles=color_patches) | |
plt.tight_layout() | |
st.pyplot(fig)''', | |
height=250) | |
st.code(age_fare_viz_code, language="python", line_numbers=True) | |
if st.button("Run Age and Fare Visualization Code"): | |
output = capture_output(age_fare_viz_code, df) | |
st.pyplot(plt.gcf()) | |
# Section 5: Hands-on Exercise | |
st.header("5. Hands-on Exercise") | |
st.markdown(""" | |
### Tasks for this week: | |
1. **Data Cleaning Exercise** | |
- Load the dataset used for your research | |
- Identify and handle missing values | |
- Convert categorical variables | |
- Create summary statistics | |
2. **EDA Analysis** | |
- Create visualizations for key variables | |
- Analyze relationships between variables | |
- Identify patterns in survival rates | |
3. **Report Writing** | |
- Document your findings | |
- Create a presentation of key insights | |
- Suggest potential next steps | |
""") | |
# Interactive Exercise | |
st.subheader("Try Your Own Analysis") | |
exercise_code = st.text_area("Write your own analysis code here:", | |
'# Your code here\n# Try analyzing the relationship between Age and Survival\n# Or create your own visualizations\n# Or perform any other analysis you find interesting', | |
height=150) | |
st.code(exercise_code, language="python", line_numbers=True) | |
if st.button("Run Exercise Code"): | |
output = capture_output(exercise_code, df) | |
st.code(output, language="python", line_numbers=True) | |
# Section 6: Resources | |
st.header("6. Homework This Week") | |
st.markdown(""" | |
1. Please use your research dataset to complete the following tasks: | |
- Analyze data for any missing values | |
- Get basic information about the dataset (Hint use the [Dataset Information](#dataset-information) section above) | |
- Create visualizations to understand the data | |
- Hint use the [Create Your Own Visualizations](#create-your-own-visualizations) section above | |
- Write a report of your findings and save the graphs produced | |
- Your report should cover what you find interesting about the data | |
- Possible research questions | |
- Please submit your homework on Canvas | |
""") | |
# Section 7: Resources | |
st.header("7. Additional Resources") | |
st.markdown(""" | |
- [EDA with Categorical Variables](https://github.com/hoffm386/eda-with-categorical-variables) | |
- [Kaggle EDA Tutorial](https://www.kaggle.com/code/kashnitsky/topic-1-exploratory-data-analysis-with-pandas) | |
- [Pandas Documentation](https://pandas.pydata.org/docs/) | |
- [Seaborn Documentation](https://seaborn.pydata.org/) | |
""") | |
if __name__ == "__main__": | |
show() | |