Breast Cancer Classification Models

Overview

This repository contains a collection of logistic regression models trained on the Breast Cancer Wisconsin dataset for binary classification of tumors (malignant vs. benign). The models were developed using scikit-learn as part of a machine learning assignment to explore different optimization techniques and regularization methods.

Models

The following models and preprocessing objects are included in this repository:

  • scaler.joblib: A StandardScaler used for preprocessing features (required for non-pipeline models).
  • batch_model.joblib: Logistic Regression trained with Batch Gradient Descent (using the lbfgs solver).
  • sgd_model.joblib: Logistic Regression trained with Stochastic Gradient Descent (SGD).
  • mini_batch_model.joblib: Logistic Regression trained with Mini-batch Gradient Descent (approximated using SGD).
  • poly_pipeline.joblib: A pipeline combining PolynomialFeatures (degree=2), StandardScaler, and Logistic Regression.
  • l2_model.joblib: Logistic Regression with L2 (Ridge) regularization.
  • es_model.joblib: Logistic Regression with Early Stopping.

Dataset

  • Source: Breast Cancer Wisconsin dataset, accessed via sklearn.datasets.load_breast_cancer.
  • Features: 30 numerical features (e.g., mean radius, mean texture, mean perimeter).
  • Target: Binary classification (0 = malignant, 1 = benign).
  • Size: 569 samples.
  • Split: 80% training (455 samples), 20% validation (114 samples).
  • Preprocessing: Features were standardized using StandardScaler (except for the poly_pipeline, which handles scaling internally).

Training Details

  • Library: scikit-learn.

  • Optimization Techniques:

    • Batch Gradient Descent: Used lbfgs solver with max_iter=100.
    • Stochastic Gradient Descent: Used SGDClassifier with loss='log_loss', constant learning rate (eta0=0.01), and max_iter=100.
    • Mini-batch Gradient Descent: Approximated using SGDClassifier with shuffling enabled.
    • Polynomial Features: Added degree-2 polynomial features, followed by scaling and logistic regression.
    • L2 Regularization: Applied with C=1.0 and max_iter=1000.
    • Early Stopping: Used SGDClassifier with early_stopping=True, validation fraction of 0.1, and n_iter_no_change=10.
  • Random State: Set to 42 for reproducibility across all models.

Evaluation Metrics

The models were evaluated on the validation set (114 samples) using accuracy and confusion matrices. Below are the accuracy scores:

Model Accuracy
Batch GD 97.37%
SGD 98.25%
Mini-batch GD 98.25%
Polynomial GD 97.37%
Early Stopping 99.12%

Usage

Installation

Ensure you have scikit-learn and joblib installed:

pip install scikit-learn joblib

Loading and Using Non-Pipeline Models

For models like batch_model, sgd_model, mini_batch_model, l2_model, and es_model, you need the scaler for preprocessing:

import joblib
import numpy as np

# Load the scaler and model
scaler = joblib.load('scaler.joblib')
model = joblib.load('batch_model.joblib')  # Replace with desired model

# Example: Preprocess new data (replace with your data)
X_new = np.array([[17.99, 10.38, 122.80, ...]])  # 30 features
X_new_scaled = scaler.transform(X_new)

# Make predictions
predictions = model.predict(X_new_scaled)
print(predictions)  # 0 (malignant) or 1 (benign)

Loading and Using the Pipeline Model

  • The poly_pipeline includes its own preprocessing steps, so the scaler is not needed:
import joblib
import numpy as np

# Load the pipeline
poly_pipeline = joblib.load('poly_pipeline.joblib')

# Example: New data (replace with your data)
X_new = np.array([[17.99, 10.38, 122.80, ...]])  # 30 features

# Make predictions directly
predictions = poly_pipeline.predict(X_new)
print(predictions)  # 0 (malignant) or 1 (benign)

Intended Use

  • These models are intended for educational purposes, demonstrating the application of logistic regression with various optimization techniques on a medical dataset. They can be used for:
  • Classifying breast tumors as malignant or benign based on 30 features.
  • Comparing the performance of different gradient descent methods and regularization techniques.

Limitations

  • Dataset Size: The dataset is relatively small (569 samples), which may limit model generalization.
  • Feature Engineering: Only polynomial features (degree=2) were explored; other feature engineering techniques might improve performance.
  • Model Complexity: Logistic regression is a linear model and may not capture complex patterns as well as non-linear models (e.g., SVM, neural networks).
  • Evaluation: Performance was evaluated on a single validation split; cross-validation could provide a more robust assessment.

License

  • This project is licensed under the MIT License.

Author

  • Created by Arsal Mairaj on April 11, 2025.

For questions or contributions, please open an issue in the repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support