Spaces:
Running
Running
File size: 5,835 Bytes
5ad4396 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
{
"cells": [
{
"cell_type": "markdown",
"id": "eaa22304",
"metadata": {
"id": "eaa22304"
},
"source": [
"### Kernel Density Estimation \n",
"Given n data points, X$\\in R^{n\\times m}$, estimate the probability density function of the data i.e. Prob(x).\n",
"\n",
"In KDE, the pdf is given by $P(x) = \\frac{1}{nh}\\sum_{i=1}^{N}K(\\frac{X_i-x}{h})$,\n",
"where K is the kernel function, h is smoothing bandwidth (small h undersmoothing, large h oversmoothing)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e139aff4",
"metadata": {
"id": "e139aff4"
},
"outputs": [],
"source": [
"import sklearn\n",
"import fnmatch\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import statsmodels.api as sm\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.neighbors import KernelDensity\n",
"from sklearn.model_selection import GridSearchCV"
]
},
{
"cell_type": "markdown",
"id": "14a48fb0",
"metadata": {
"id": "14a48fb0"
},
"source": [
"#### Load the real data and select samples for a specific race and sex"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c784a28b",
"metadata": {
"id": "c784a28b"
},
"outputs": [],
"source": [
"df = pd.read_csv('istaging_all.csv') # load istaging data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0159223c",
"metadata": {
"id": "0159223c"
},
"outputs": [],
"source": [
"# select black females\n",
"df = df[((df.Race == 'Black') & (df.Sex == 'F'))].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0758d447",
"metadata": {
"id": "0758d447"
},
"outputs": [],
"source": [
"# select baseline data for each subject\n",
"df.Date = pd.to_datetime(df.Date)\n",
"df_tp1 = df.loc[df.groupby('PTID')['Date'].idxmin()].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9611e735",
"metadata": {
"scrolled": true,
"id": "9611e735"
},
"outputs": [],
"source": [
"# split the data to train and test set, train set will be used to learn the probablity distribtuion of the real data\n",
"df_train, df_test = sklearn.model_selection.train_test_split(df_tp1, test_size=0.3, random_state=40)"
]
},
{
"cell_type": "markdown",
"id": "0800ceda",
"metadata": {
"id": "0800ceda"
},
"source": [
"#### Fit a KDE model to estimate the joint probability density of Age and ROI volumes."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b37250f8",
"metadata": {
"id": "b37250f8"
},
"outputs": [],
"source": [
"## standardized ROI grid search\n",
"# use grid search to select the bandwidth\n",
"cols = ['Age']\n",
"roi_cols = [] #fill in with the roi column names\n",
"cols.extend(fnmatch.filter(df_train.columns, roi_cols)) # select the ROI volumes\n",
"data = df_train.loc[:, cols].to_numpy()\n",
"data_standard = pd.DataFrame()\n",
"# standardize the data\n",
"data_standard['Age'] = (df_train['Age'] - df_train.loc[:, 'Age'].mean()) / df_train.loc[:, 'Age'].std()\n",
"data_standard[cols[1:]] = ((df_train.loc[:, cols[1:]] - df_train.loc[:, cols[1:]].mean()) / df_train.loc[:, cols[1:]].std())\n",
"data_standard = data_standard.to_numpy()\n",
"\n",
"# Use a Gaussian kernel\n",
"kde = GridSearchCV(KernelDensity(kernel='gaussian'),{'bandwidth': np.linspace(0, 3, 100)}, cv=5)\n",
"kde.fit(data_standard)\n",
"kde = kde.best_estimator_\n",
"print(f'optimal bandwidth of kernel estimated via grid search is {kde.bandwidth_} ')"
]
},
{
"cell_type": "markdown",
"id": "32c78445",
"metadata": {
"id": "32c78445"
},
"source": [
"#### Generate synthetic data using a KDE model for the specified category of race and sex"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e06523c2",
"metadata": {
"id": "e06523c2"
},
"outputs": [],
"source": [
"# sample 3000 data points\n",
"sample = kde.sample(3000, random_state=0)\n",
"sample[:, :] = np.multiply(sample[:, :], df_train.loc[:, cols[:]].std().tolist()) + df_train.loc[:, cols[:]].mean().tolist()\n",
"cov_list = np.array([[f'Synth_{i+1}', 'F', 'Black'] for i in range(3000)])\n",
"synthetic_data = np.concatenate([cov_list, sample], axis=1)\n",
"cols=['PTID', 'Sex', 'Race', 'Age']\n",
"cols.extend(roi_cols)\n",
"df_kde_synth = pd.DataFrame(synthetic_data, columns=cols)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"colab": {
"provenance": []
}
},
"nbformat": 4,
"nbformat_minor": 5
} |