Breast Cancer Classification Models
Overview
This repository contains a collection of logistic regression models trained on the Breast Cancer Wisconsin dataset for binary classification of tumors (malignant vs. benign). The models were developed using scikit-learn as part of a machine learning assignment to explore different optimization techniques and regularization methods.
Models
The following models and preprocessing objects are included in this repository:
- scaler.joblib: A
StandardScaler
used for preprocessing features (required for non-pipeline models). - batch_model.joblib: Logistic Regression trained with Batch Gradient Descent (using the
lbfgs
solver). - sgd_model.joblib: Logistic Regression trained with Stochastic Gradient Descent (SGD).
- mini_batch_model.joblib: Logistic Regression trained with Mini-batch Gradient Descent (approximated using SGD).
- poly_pipeline.joblib: A pipeline combining
PolynomialFeatures
(degree=2),StandardScaler
, and Logistic Regression. - l2_model.joblib: Logistic Regression with L2 (Ridge) regularization.
- es_model.joblib: Logistic Regression with Early Stopping.
Dataset
- Source: Breast Cancer Wisconsin dataset, accessed via
sklearn.datasets.load_breast_cancer
. - Features: 30 numerical features (e.g., mean radius, mean texture, mean perimeter).
- Target: Binary classification (0 = malignant, 1 = benign).
- Size: 569 samples.
- Split: 80% training (455 samples), 20% validation (114 samples).
- Preprocessing: Features were standardized using
StandardScaler
(except for thepoly_pipeline
, which handles scaling internally).
Training Details
Library:
scikit-learn
.Optimization Techniques:
- Batch Gradient Descent: Used
lbfgs
solver withmax_iter=100
. - Stochastic Gradient Descent: Used
SGDClassifier
withloss='log_loss'
, constant learning rate (eta0=0.01
), andmax_iter=100
. - Mini-batch Gradient Descent: Approximated using
SGDClassifier
with shuffling enabled. - Polynomial Features: Added degree-2 polynomial features, followed by scaling and logistic regression.
- L2 Regularization: Applied with
C=1.0
andmax_iter=1000
. - Early Stopping: Used
SGDClassifier
withearly_stopping=True
, validation fraction of 0.1, andn_iter_no_change=10
.
- Batch Gradient Descent: Used
Random State: Set to 42 for reproducibility across all models.
Evaluation Metrics
The models were evaluated on the validation set (114 samples) using accuracy and confusion matrices. Below are the accuracy scores:
Model | Accuracy |
---|---|
Batch GD | 97.37% |
SGD | 98.25% |
Mini-batch GD | 98.25% |
Polynomial GD | 97.37% |
Early Stopping | 99.12% |
Usage
Installation
Ensure you have scikit-learn
and joblib
installed:
pip install scikit-learn joblib
Loading and Using Non-Pipeline Models
For models like batch_model, sgd_model, mini_batch_model, l2_model, and es_model, you need the scaler for preprocessing:
import joblib
import numpy as np
# Load the scaler and model
scaler = joblib.load('scaler.joblib')
model = joblib.load('batch_model.joblib') # Replace with desired model
# Example: Preprocess new data (replace with your data)
X_new = np.array([[17.99, 10.38, 122.80, ...]]) # 30 features
X_new_scaled = scaler.transform(X_new)
# Make predictions
predictions = model.predict(X_new_scaled)
print(predictions) # 0 (malignant) or 1 (benign)
Loading and Using the Pipeline Model
- The poly_pipeline includes its own preprocessing steps, so the scaler is not needed:
import joblib
import numpy as np
# Load the pipeline
poly_pipeline = joblib.load('poly_pipeline.joblib')
# Example: New data (replace with your data)
X_new = np.array([[17.99, 10.38, 122.80, ...]]) # 30 features
# Make predictions directly
predictions = poly_pipeline.predict(X_new)
print(predictions) # 0 (malignant) or 1 (benign)
Intended Use
- These models are intended for educational purposes, demonstrating the application of logistic regression with various optimization techniques on a medical dataset. They can be used for:
- Classifying breast tumors as malignant or benign based on 30 features.
- Comparing the performance of different gradient descent methods and regularization techniques.
Limitations
- Dataset Size: The dataset is relatively small (569 samples), which may limit model generalization.
- Feature Engineering: Only polynomial features (degree=2) were explored; other feature engineering techniques might improve performance.
- Model Complexity: Logistic regression is a linear model and may not capture complex patterns as well as non-linear models (e.g., SVM, neural networks).
- Evaluation: Performance was evaluated on a single validation split; cross-validation could provide a more robust assessment.
License
- This project is licensed under the MIT License.
Author
- Created by Arsal Mairaj on April 11, 2025.
For questions or contributions, please open an issue in the repository.