Spaces:

rongguangw
/

GenMIND

Running

App Files Files Community

rongguangw commited on Jul 10, 2024

Commit

5ad4396

verified ·

1 Parent(s): df36903

add training script

Browse files

Files changed (1) hide show

synthetic_data_generation.ipynb +193 -0

synthetic_data_generation.ipynb ADDED Viewed

	@@ -0,0 +1,193 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "eaa22304",
+      "metadata": {
+        "id": "eaa22304"
+      },
+      "source": [
+        "### Kernel Density Estimation  \n",
+        "Given n data points, X$\\in R^{n\\times m}$, estimate the probability density function of the data i.e. Prob(x).\n",
+        "\n",
+        "In KDE, the pdf is given by $P(x) = \\frac{1}{nh}\\sum_{i=1}^{N}K(\\frac{X_i-x}{h})$,\n",
+        "where K is the kernel function, h is smoothing bandwidth (small h undersmoothing, large h oversmoothing)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e139aff4",
+      "metadata": {
+        "id": "e139aff4"
+      },
+      "outputs": [],
+      "source": [
+        "import sklearn\n",
+        "import fnmatch\n",
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "import seaborn as sns\n",
+        "import statsmodels.api as sm\n",
+        "import matplotlib.pyplot as plt\n",
+        "from sklearn.decomposition import PCA\n",
+        "from sklearn.neighbors import KernelDensity\n",
+        "from sklearn.model_selection import GridSearchCV"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "14a48fb0",
+      "metadata": {
+        "id": "14a48fb0"
+      },
+      "source": [
+        "#### Load the real data and select samples for a specific race and sex"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "c784a28b",
+      "metadata": {
+        "id": "c784a28b"
+      },
+      "outputs": [],
+      "source": [
+        "df = pd.read_csv('istaging_all.csv') # load istaging data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "0159223c",
+      "metadata": {
+        "id": "0159223c"
+      },
+      "outputs": [],
+      "source": [
+        "# select black females\n",
+        "df = df[((df.Race == 'Black') & (df.Sex == 'F'))].reset_index(drop=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "0758d447",
+      "metadata": {
+        "id": "0758d447"
+      },
+      "outputs": [],
+      "source": [
+        "# select baseline data for each subject\n",
+        "df.Date = pd.to_datetime(df.Date)\n",
+        "df_tp1 = df.loc[df.groupby('PTID')['Date'].idxmin()].reset_index(drop=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "9611e735",
+      "metadata": {
+        "scrolled": true,
+        "id": "9611e735"
+      },
+      "outputs": [],
+      "source": [
+        "# split the data to train and test set, train set will be used to learn the probablity distribtuion of the real data\n",
+        "df_train, df_test = sklearn.model_selection.train_test_split(df_tp1, test_size=0.3, random_state=40)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "0800ceda",
+      "metadata": {
+        "id": "0800ceda"
+      },
+      "source": [
+        "#### Fit a KDE model to estimate the joint probability density of Age and ROI volumes."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "b37250f8",
+      "metadata": {
+        "id": "b37250f8"
+      },
+      "outputs": [],
+      "source": [
+        "## standardized ROI grid search\n",
+        "# use grid search to select the bandwidth\n",
+        "cols = ['Age']\n",
+        "roi_cols = [] #fill in with the roi column names\n",
+        "cols.extend(fnmatch.filter(df_train.columns, roi_cols)) # select the ROI volumes\n",
+        "data = df_train.loc[:, cols].to_numpy()\n",
+        "data_standard = pd.DataFrame()\n",
+        "# standardize the data\n",
+        "data_standard['Age'] = (df_train['Age'] - df_train.loc[:, 'Age'].mean()) / df_train.loc[:, 'Age'].std()\n",
+        "data_standard[cols[1:]] =  ((df_train.loc[:, cols[1:]] - df_train.loc[:, cols[1:]].mean()) / df_train.loc[:, cols[1:]].std())\n",
+        "data_standard = data_standard.to_numpy()\n",
+        "\n",
+        "# Use a Gaussian kernel\n",
+        "kde = GridSearchCV(KernelDensity(kernel='gaussian'),{'bandwidth': np.linspace(0, 3, 100)}, cv=5)\n",
+        "kde.fit(data_standard)\n",
+        "kde = kde.best_estimator_\n",
+        "print(f'optimal bandwidth of kernel estimated via grid search is {kde.bandwidth_} ')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "32c78445",
+      "metadata": {
+        "id": "32c78445"
+      },
+      "source": [
+        "#### Generate synthetic data using a KDE model for the specified category of race and sex"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e06523c2",
+      "metadata": {
+        "id": "e06523c2"
+      },
+      "outputs": [],
+      "source": [
+        "# sample 3000 data points\n",
+        "sample = kde.sample(3000, random_state=0)\n",
+        "sample[:, :] = np.multiply(sample[:, :], df_train.loc[:, cols[:]].std().tolist()) + df_train.loc[:, cols[:]].mean().tolist()\n",
+        "cov_list = np.array([[f'Synth_{i+1}', 'F', 'Black'] for i in range(3000)])\n",
+        "synthetic_data = np.concatenate([cov_list, sample], axis=1)\n",
+        "cols=['PTID', 'Sex', 'Race', 'Age']\n",
+        "cols.extend(roi_cols)\n",
+        "df_kde_synth = pd.DataFrame(synthetic_data, columns=cols)"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.8"
+    },
+    "colab": {
+      "provenance": []
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}