{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Predicting Hospital Readmission with a Binary Classification Model\n", "\n", "In this tutorial, we'll be looking at hospital admission data in patients with diabetes. This dataset was collected from 130 hospitals in the United States from 1999 to 2008. More details can be found on the UCI Machine Learning Repository [website](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).\n", "\n", "This walkthrough is divided into two parts:\n", "\n", "- **Data preprocessing (Steps 1-5)**, which involves data cleaning, exploration, and feature engineering\n", "- **Data modelling (Steps 6-10)**, where we will train a machine learning model to predict whether a patient will be readmitted to the hospital" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Importing Depedencies\n", "\n", "Before getting started, we'll need to import several packages. These include:\n", "\n", "- [pandas](https://pandas.pydata.org/pandas-docs/stable/) - a package for performing data analysis and manipulation\n", "- [numpy](https://docs.scipy.org/doc/numpy/) - a package for scientific computing \n", "- [matplotlib](https://matplotlib.org/) - the standard Python plotting package\n", "- [seaborn](https://seaborn.pydata.org/) - a dataframe-centric visualization package that is built off of **matplotlib**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Load the Data\n", "\n", "We will be loading in the data as a [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).\n", "\n", "The data is stored in a csv file, which we can access locally (see data/patient_data.csv) or in the cloud (stored in a AWS S3 bucket). We'll import the S3 version of this data using a pandas method called `read_csv`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"https://s3.us-east-2.amazonaws.com/explore.datasets/diabetes/patient_data.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a glimpse of our data, we can use either the `head()`, which shows the first 5 rows of the dataframe." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
encounter_idpatient_nbrracegenderageweightadmission_type_iddischarge_disposition_idtime_in_hospitalmedical_specialty...examidecitogliptoninsulinglyburide-metforminglipizide-metforminglimepiride-pioglitazonemetformin-rosiglitazonemetformin-pioglitazonediabetesMedreadmitted
022783928222157CaucasianFemale[0-10)NaN6251Pediatrics-Endocrinology...NoNoNoNoNoNoNoNoNoNO
114919055629189CaucasianFemale[10-20)NaN113NaN...NoNoUpNoNoNoNoNoYes>30
26441086047875AfricanAmericanFemale[20-30)NaN112NaN...NoNoNoNoNoNoNoNoYesNO
350036482442376CaucasianMale[30-40)NaN112NaN...NoNoUpNoNoNoNoNoYesNO
41668042519267CaucasianMale[40-50)NaN111NaN...NoNoSteadyNoNoNoNoNoYesNO
\n", "

5 rows × 44 columns

\n", "
" ], "text/plain": [ " encounter_id patient_nbr race gender age weight \\\n", "0 2278392 8222157 Caucasian Female [0-10) NaN \n", "1 149190 55629189 Caucasian Female [10-20) NaN \n", "2 64410 86047875 AfricanAmerican Female [20-30) NaN \n", "3 500364 82442376 Caucasian Male [30-40) NaN \n", "4 16680 42519267 Caucasian Male [40-50) NaN \n", "\n", " admission_type_id discharge_disposition_id time_in_hospital \\\n", "0 6 25 1 \n", "1 1 1 3 \n", "2 1 1 2 \n", "3 1 1 2 \n", "4 1 1 1 \n", "\n", " medical_specialty ... examide citoglipton insulin \\\n", "0 Pediatrics-Endocrinology ... No No No \n", "1 NaN ... No No Up \n", "2 NaN ... No No No \n", "3 NaN ... No No Up \n", "4 NaN ... No No Steady \n", "\n", " glyburide-metformin glipizide-metformin glimepiride-pioglitazone \\\n", "0 No No No \n", "1 No No No \n", "2 No No No \n", "3 No No No \n", "4 No No No \n", "\n", " metformin-rosiglitazone metformin-pioglitazone diabetesMed readmitted \n", "0 No No No NO \n", "1 No No Yes >30 \n", "2 No No Yes NO \n", "3 No No Yes NO \n", "4 No No Yes NO \n", "\n", "[5 rows x 44 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### How many rows and columns are in our dataset?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(101766, 44)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our dataset has 101,766 rows and 45 columns. Each row represents a unique hospital admission. Columns represent patient demographics, medical details, and admission-specific information such as length of stay (`time_in_hospital`). We can see a list of all columns by applying `.columns` to our dataframe." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns: ['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'time_in_hospital', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'diabetesMed', 'readmitted']\n" ] } ], "source": [ "print(f\"Columns: {data.columns.tolist()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the columns, we can see that a large proportion are medication names. Let's store these column names as a separate list, which we'll get back to in a bit." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 23 medications represented as columns in the dataset.\n" ] } ], "source": [ "medications = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', \n", " 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', \n", " 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', \n", " 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin',\n", " 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']\n", "\n", "print(f\"There are {len(medications)} medications represented as columns in the dataset.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### How many hospital admissions and unique patients are in the dataset? " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hospital admissions: 101,766\n", "Number of unique patients: 71,518\n" ] } ], "source": [ "n_admissions = data['encounter_id'].nunique()\n", "n_patients = data['patient_nbr'].nunique()\n", "\n", "print(f\"Number of hospital admissions: {n_admissions:,}\")\n", "print(f\"Number of unique patients: {n_patients:,}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### How many patients have had more than one hospital admission?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "admissions_per_patient = data['patient_nbr'].value_counts().reset_index()\n", "admissions_per_patient.columns = ['patient_nbr', 'count']\n", "multiple_admissions = admissions_per_patient[admissions_per_patient['count'] > 1]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Proportion of patients that have multiple admissions: 23.45%\n", "Maximum number of admissions for a given patient: 40\n" ] } ], "source": [ "print(f\"Proportion of patients that have multiple admissions: {multiple_admissions['patient_nbr'].nunique()/n_patients:.2%}\")\n", "print(f\"Maximum number of admissions for a given patient: {multiple_admissions['count'].max()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Almost one-quarter of the patients (23.45%) have had more than 1 hosptial admission. The maximum number of hospital admissions for a given patient is 40. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: Data Cleaning\n", "\n", "Data cleaning is a crucial step in the machine learning pipeline, and typically requires the most time and effort in any data science project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Decoding Admission Type\n", "\n", "The `admission_type_id` column describes the type of admission and is represented by integers. The `id` column links to descriptors found in a separate file. We'll update this column so that it represents the descriptor name instead of simply the id number.\n", "\n", "Our mapper files are located in `data/id_mappers/`. They are also stored on the cloud in a AWS S3 bucket." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
admission_type_iddescription
01Emergency
12Urgent
23Elective
34Newborn
45Not Available
56NaN
67Trauma Center
78Not Mapped
\n", "
" ], "text/plain": [ " admission_type_id description\n", "0 1 Emergency\n", "1 2 Urgent\n", "2 3 Elective\n", "3 4 Newborn\n", "4 5 Not Available\n", "5 6 NaN\n", "6 7 Trauma Center\n", "7 8 Not Mapped" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "admission_type = pd.read_csv(\"https://s3.us-east-2.amazonaws.com/explore.datasets/diabetes/id_mappers/admission_type_id.csv\")\n", "admission_type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the admission type mapper file has 3 values which represent missing data:\n", "\n", "1. NaN\n", "2. 'Not Mapped'\n", "3. 'Not Available'\n", "\n", "Let's collapse these into one category that represents a missing value. We can use `pandas` [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html) method to do this. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "missing_values = ['nan', 'Not Available', 'Not Mapped']\n", "admission_type['description'] = admission_type['description'].replace(missing_values, np.nan)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "admission_type.columns = ['admission_type_id', 'admission_type']" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "data = data.merge(admission_type, on='admission_type_id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a \"clean\" mapper, we can apply it to our dataset. We can [map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) `admission_type_id` values in our original dataframe to the descriptors in our `admission_type_mapper` dictionary." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Emergency 53990\n", "Elective 18869\n", "Urgent 18480\n", "Trauma Center 21\n", "Newborn 10\n", "Name: admission_type, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['admission_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.countplot(x='admission_type', data=data, palette='magma')\n", "plt.xlabel('type of hospital admission')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Decoding Discharge Location" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
discharge_disposition_iddescription
1314Hospice / medical facility
89Admitted as an inpatient to this hospital
1415Discharged/transferred within this institution...
2021Expired, place unknown. Medicaid only, hospice.
2929Discharged/transferred to a Critical Access Ho...
\n", "
" ], "text/plain": [ " discharge_disposition_id \\\n", "13 14 \n", "8 9 \n", "14 15 \n", "20 21 \n", "29 29 \n", "\n", " description \n", "13 Hospice / medical facility \n", "8 Admitted as an inpatient to this hospital \n", "14 Discharged/transferred within this institution... \n", "20 Expired, place unknown. Medicaid only, hospice. \n", "29 Discharged/transferred to a Critical Access Ho... " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "discharge_disposition = pd.read_csv(\"https://s3.us-east-2.amazonaws.com/explore.datasets/diabetes/id_mappers/discharge_disposition_id.csv\")\n", "discharge_disposition.sample(n=5, random_state=416)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In medicine, \"expired\" is a term that describes a patient who has died. We only want to predict hospital readmission for living patients so we're going to remove hospital admissions in which the patient was recorded as \"expired\" upon being discharged in our dataset.\n", "\n", "We'll first convert our `description` column to lowercase (`str.lower()`), then we'll search for rows that contain \"expired\" (`str.contains(\"expired\")`). " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "discharge_disposition['expired'] = discharge_disposition['description'].str.lower().str.contains('expired')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at all discharge dispositions that indicate an expired patient. We'll create a new dataframe that filters for rows in which the expired column is True." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
discharge_disposition_iddescriptionexpired
1011ExpiredTrue
1819Expired at home. Medicaid only, hospice.True
1920Expired in a medical facility. Medicaid only, ...True
2021Expired, place unknown. Medicaid only, hospice.True
\n", "
" ], "text/plain": [ " discharge_disposition_id \\\n", "10 11 \n", "18 19 \n", "19 20 \n", "20 21 \n", "\n", " description expired \n", "10 Expired True \n", "18 Expired at home. Medicaid only, hospice. True \n", "19 Expired in a medical facility. Medicaid only, ... True \n", "20 Expired, place unknown. Medicaid only, hospice. True " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "discharge_expired = discharge_disposition[discharge_disposition['expired']==True]\n", "discharge_expired" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "discharge_disposition_id's that indicate an expired patient: [11, 19, 20, 21]\n" ] } ], "source": [ "expired_ids = discharge_expired['discharge_disposition_id'].tolist()\n", "print(f\"discharge_disposition_id's that indicate an expired patient: {expired_ids}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to remove all rows in our original dataset that has `discharge_disposition_id` equal to one of the values in our `expired_ids` list." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "data = data[~data['discharge_disposition_id'].isin(expired_ids)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After removing expired patients, how many patients do we have in our dataset?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original number of patients: 71,518\n", "Number of expired patients: 1,079\n", "After filtering out expired patients: 70,439\n" ] } ], "source": [ "n_patients_nonexpired = data['patient_nbr'].nunique()\n", "print(f\"Original number of patients: {n_patients:,}\")\n", "print(f\"Number of expired patients: {n_patients-n_patients_nonexpired:,}\")\n", "print(f\"After filtering out expired patients: {n_patients_nonexpired:,}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We had 1,079 expired patients in our dataset. After removing them, our dataset has 70,439 patients." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Converting Medication Features From Categorical to Boolean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that list of medications we created when we loaded our data in Step 2? We're going to convert these medication columns into boolean variables." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "No 80216\n", "Steady 18256\n", "Up 1067\n", "Down 575\n", "Name: metformin, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[medications[0]].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Medication columns are currently categorical datatypes that have several possible categories including:\n", "\n", "- \"No\" (not taking the medication)\n", "- \"Up\" (increased medication dose)\n", "- \"Down\" (decrease medication dose)\n", "- \"Steady\" (no changes in dose)\n", "\n", "To keep things simple, we'll update the column to \"0\" (not taking the medication) to \"1\" (taking the medication). We're losing out on information regarding their dose change, but it's a compromise we're willing to make in order to simplify our dataset.\n", "\n", "We can use [numpy.where](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html) to convert all instances of \"No\" to `0` and everything else (i.e., \"Up\", \"Down\", \"Steady\") to `1`. Let's loop through all medications and convert each column to boolean." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "for m in medications:\n", " data[f'{m}_bool'] = np.where(data[m]=='No', 0, 1)\n", " data = data.drop(columns=m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our medication data are now represented as boolean features. Let's take a look at the prevalence of these medications. We'll calcualte the proportion of patients taking each type of medication. Because some patients have had multiple hospital admissions in this dataset, we'll need to do some wrangling to determine whether a patient was on a given medication during any of their admissions. The wrangling process consists of the following steps:\n", "\n", "- applying `groupby` to `patient_nbr` and calculate the sum of admissions in which the patient was administered a medication\n", "- convert the column to boolean such that patients that have \"0\" are False and \"1\" is True\n", "- calculate the sum of patients on that specific medication\n", "- calculate the proportion of patients who were administered that medication" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "prevalence = []\n", "\n", "for m in medications:\n", " patient_meds = data.groupby('patient_nbr')[f'{m}_bool'].sum().reset_index()\n", " patient_meds[f'{m}_bool'] = patient_meds[f'{m}_bool'].astype(bool)\n", " n_patients_on_med = patient_meds[f'{m}_bool'].sum()\n", " proportion = n_patients_on_med/n_patients\n", " prevalence.append(proportion)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a list of medication prevalence, we can create a dataframe and sort by prevalence to determine which medications are most prevalent in our dataset. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
medicationprevalence
17insulin0.543779
0metformin0.229523
6glipizide0.138916
7glyburide0.118851
9pioglitazone0.082525
\n", "
" ], "text/plain": [ " medication prevalence\n", "17 insulin 0.543779\n", "0 metformin 0.229523\n", "6 glipizide 0.138916\n", "7 glyburide 0.118851\n", "9 pioglitazone 0.082525" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "medication_counts = pd.DataFrame({'medication': medications, 'prevalence':prevalence})\n", "medication_counts = medication_counts.sort_values(by='prevalence', ascending=False)\n", "medication_counts.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also visualize the top 10 most prevalent medications. We'll use seaborn's [barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html) method." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.barplot(x='medication', y='prevalence', data=medication_counts.head(10), palette='viridis')\n", "plt.xticks(rotation=90)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that `insulin` is by far the most prevalent medication followed by `metformin`. More than half of the patients in our dataset were prescribed insulin whie 22% were prescribed metformin. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[MeSH](https://en.wikipedia.org/wiki/Medical_Subject_Headings) (or Medical Subject Headings) are a type of \"tag\" that describes a medical term. We'll use RxNav's API to further investigate which MeSH terms are assocaited with our list of medications. We'll cretea a function called `get_mesh_from_drug_name` which returns relevant MeSH terms for a given drug name." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "import json\n", "import requests\n", "\n", "def get_mesh_from_drug_name(drug_name):\n", " drug_name = drug_name.strip()\n", " rxclass_list = []\n", " try:\n", " r = requests.get(f\"https://rxnav.nlm.nih.gov/REST/rxclass/class/byDrugName.json?drugName={drug_name}&relaSource=MESH\")\n", " response = r.json()\n", " all_concepts = response['rxclassDrugInfoList']['rxclassDrugInfo']\n", " for i in all_concepts:\n", " rxclass_list.append(i['rxclassMinConceptItem']['className'])\n", " except:\n", " pass\n", " return list(set(rxclass_list))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll pass the top 10 medications into `get_mesh_from_drug_name` and get their relevant MeSH terms. The results will be stored in a dictionary that we'll call `med_mesh_descriptors`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "top_ten_meds = medication_counts.head(10)['medication'].tolist()\n", "\n", "med_mesh_descriptors = dict()\n", "for m in top_ten_meds:\n", " med_mesh_descriptors[m] = get_mesh_from_drug_name(m)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'insulin': [],\n", " 'metformin': ['Hypoglycemic Agents'],\n", " 'glipizide': ['Hypoglycemic Agents'],\n", " 'glyburide': ['Hypoglycemic Agents'],\n", " 'pioglitazone': ['Hypoglycemic Agents'],\n", " 'rosiglitazone': ['Hypoglycemic Agents'],\n", " 'glimepiride': ['Immunosuppressive Agents',\n", " 'Hypoglycemic Agents',\n", " 'Anti-Arrhythmia Agents'],\n", " 'repaglinide': ['Hypoglycemic Agents'],\n", " 'nateglinide': ['Hypoglycemic Agents'],\n", " 'glyburide-metformin': ['Hypoglycemic Agents']}" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_mesh_descriptors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results above show that all medications have the MeSH term [hypoglycemic agent](https://en.wikipedia.org/wiki/Anti-diabetic_medication), which means it's an anti-diabetic medication. Interestingly, the medication `glimepiride` also has two other associated MeSH terms 'Anti-Arrhythmia Agents' and 'Immunosuppressive Agents' in addition to it being a hypoglycemic agent.\n", "\n", "If you want to learn more about each medication in our dataset, check out the [RxNav dashboard](https://mor.nlm.nih.gov/RxNav/) which gives an overview of medication properties and interactions.\n", "\n", "![](images/rxnav_dashboard.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating a Target Variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of our model will be to predict whether a patient will get readmitted to the hospital. Looking at the `readmitted` column, we see that there are 3 possible values: \n", "\n", "1. `NO` (not readmitted)\n", "2. `>30` (readmitted more than 30 days after being discharged)\n", "3. `<30` (readmitted within 30 days of being discharged)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NO 53212\n", ">30 35545\n", "<30 11357\n", "Name: readmitted, dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['readmitted'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To keep things simple, we'll view this as a binary classification problem: did the patient get readmitted? We're going to use `numpy.where` to convert all instances of \"NO\" to 0 (means patient did not get readmitted) and everyhting else to 1 (patient did get readmitted)." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 53212\n", "1 46902\n", "Name: readmitted_bool, dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['readmitted_bool'] = np.where(data['readmitted']=='NO', 0, 1)\n", "data['readmitted_bool'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4: Data Exploration and Visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Assessing Missing Values\n", "\n", "To get a better sense of the missing values in our data, let's visualize it using [missingno](https://github.com/ResidentMario/missingno)'s \"nullity\" matrix." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import missingno as msno\n", "\n", "msno.matrix(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data-dense columns are fully black, while the sparse columns (with missing values) have a mixture of white and black. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Patient Demographics: Age and Gender" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1MAAAEWCAYAAACDss1qAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAIABJREFUeJzt3XmYJWV99//3RwYFRTYZeZBFUFEz6iPqCLglKIYtKmiI4qMyKgYXcInm94gxihtGs2iiKHmIjoBRUVF0VBQRQSUJyyDIKmGCqIMICAi4O/j9/VF3Q9F095w+M6e7p+f9uq66us5d27fqnD73+VbddVeqCkmSJEnS9NxjtgOQJEmSpHWRyZQkSZIkDcFkSpIkSZKGYDIlSZIkSUMwmZIkSZKkIZhMSZIkSdIQTKa03kryr0nespbWtUOSXyTZoL0+M8nL1sa62/q+mmTJ2lrfNLb7riQ/S/LTmd62JM1H1j0DbXdO1T1r+7hqfjGZ0ryU5Ookv05yW5KfJ/nPJK9IcsdnvqpeUVXvHHBdT59qnqr6UVVtUlW3r4XY35bk38etf9+qOn5N1z3NOHYA3gAsqqr/NcV8OyX5Q5JjZi46SZp7rHvW3CB1T5L7JnlfO0a/TPKjJCcl2W0mY5XAZErz2zOr6r7AA4H3AG8EPrq2N5Jkwdpe5xyxA3BjVV2/mvkOBm4GnpfkXqMPS5LmNOueNTNl3dPqmW8CjwKeAWwK/BFwIrDvTAU5iHn8HqnHZErzXlXdUlXLgOcBS5I8EiDJcUne1ca3SvLldibxpiTfSXKPJB+n+2L/UmtK8X+T7JikkhyS5EfAN3tl/S/OByc5N8mtSb6YZMu2rT2SrOzHOHYGMsk+wN/QJSa/SPK9Nv2OJgYtrr9N8sMk1yc5IclmbdpYHEvambqfJXnzZMcmyWZt+Rva+v62rf/pwGnAA1ocx02yfOiSqb8Ffg88c9z0vZJckeSWJB9O8q1+U4kkL01yeZKbk5ya5IFTvpmStI6w7hlZ3fMiYDvggKq6pKpur6pfVtVJVfW23jYenuS0dlyvSPLc3rTjknwoyVfSXUU8J8mDe9P/NMn3W911NJBx8U9ad7XjcFiSK4ErJzsGmj9MprTeqKpzgZXAUyaY/IY2bSGwNV2lUlX1IuBHdGcaN6mqv+8t8yd0Z8P2nmSTBwMvBbYBVgEfGCDGrwHvBj7dtvfoCWZ7cRueCjwI2AQ4etw8TwYeBuwJvDXJH02yyQ8Cm7X1/EmL+SVV9Q26M3w/aXG8eJLln0xXqZ0IfAa4o219kq2Ak4A3AfcDrgCe2Ju+P91xfg7dcf8O8KlJtiNJ6yTrngmtSd3zdODUqvrlZPuT5D50SdkngfsDBwEfTrKoN9tBwNuBLYAVwFFt2a2Az9OdJNwK+B/gSb11D1J3HQDsBixC857JlNY3PwG2nKD893QVzwOr6vdV9Z2qqtWs623tbNivJ5n+8XbW7JfAW4Dnpt0kvIZeALyvqq6qql/QJSsHjTsz+faq+nVVfQ/4HnC3irHFchDwpqq6raquBv6J7qzfoJYAX62qm+kqrX2S3L9N2w+4tKo+X1VjFXr/ZuJXAH9XVZe36e8GdvHqlKR5yLqnWQt1z1b06pIku7Qre7cmuaIVPwO4uqo+VlWrquoC4HPAX/TWc3JVndvqn08Au7TysbrrpKr6PfDPTL/u+ruqummK90jziMmU1jfbAjdNUP4PdGemvp7kqiRHDLCuH09j+g+BDekqgTX1gLa+/roX0J3VHNP/4v8V3RnE8bZqMY1f17aDBJFkY7qK6RMAVfVfdGdS/08vzjuOQfuB0G9i8kDgX1ol+HO69yWDbl+S1iHWPXdao7oHuJEuAQWgqi6sqs3prhSN3bf7QGC3sfql1TEvAPodWkwW60R1V/+YDlJ3re490jxiMqX1RpLH033ZnTV+Wjs79oaqehDwLOD1SfYcmzzJKld39nD73vgOdGcgfwb8Erh3L64N6JoKDLren9B9mffXvQq4bjXLjfezFtP4dV0z4PLPprvx98NJfpquC9ttubOp37V0TQCBO+6v2q63/I+Bl1fV5r1h46r6z2nuhyTNWdY9d7Omdc/pwF6tKd9kfgx8a1z9sklVvXKA9V9L7xi2uqt/TAepu1Z3LDWPmExp3kuyaZJn0N3X8+9VdfEE8zwjyUPal+YtwO3AH9rk6+jadU/XC5MsSnJv4B3ASa372v8GNkryZ0k2pGuX3e8F7zpgx/S60h3nU8BfpeuSfBPubOe+ajrBtVg+AxyVrpvZBwKvB/596iXvsARYStej0i5teBLw6CSPAr4CPCrJAa0ZyGHc9azgvwJvSvIIuOOG5H4TDElaZ1n3TGwt1D0n0CU8Jyd5ZJINkmwELO7N82XgoUlelGTDNjx+inu4+r4CPCLJc1rd9RqsuzQFkynNZ19KchvdWaQ3A+8DXjLJvDsD3wB+AfwX8OGqOqNN+zvgb9sl/b+exvY/DhxH15RgI7ovZKrqFuBVwEfozsT9krs2f/ts+3tjku9OsN6lbd3fBn4A/AZ49TTi6nt12/5VdGdNP9nWP6Uk29LdYPzPVfXT3nA+8DVgSVX9jK4Z4N/TNctYBCwHfgtQVScD7wVOTHIrcAlzrFtbSRqCdc/qDVX3AFTVb+g6wbiMLvG5la6Do8cDz23z3AbsRXdv1k/ojsV7uWvyONn6x+qu99DVXTsD/9Gbbt2lu8jq73OUpDXXznauBF7Q+7EgSZK0zvLKlKSRSbJ3ks3TPWTxb+hu0j17lsOSJElaK0ymJI3SE+ie0fEzugf6HmBXsZIkab6wmZ8kSZIkDcErU5IkSZI0hAWrn2V+2WqrrWrHHXec7TAkab12/vnn/6yqFq5+zvWP9ZQkzb5B66n1LpnacccdWb58+WyHIUnrtSQ/nO0Y5irrKUmafYPWUzbzkyRJkqQhmExJkiRJ0hBMpiRJkiRpCCZTkiRJkjQEkylJkiRJGoLJlCRJkiQNwWRKkiRJkoZgMiVJkiRJQzCZkiRJkqQhLJjtAHSnP9vnLTO6va987Z0zuj1J0sx5ysv9jtf0fef/zexvEWldN7IrU0m2T3JGksuSXJrkta38bUmuSXJhG/brLfOmJCuSXJFk7175Pq1sRZIjeuU7JTmnlX86yT1HtT+SJEmS1DfKZn6rgDdU1SJgd+CwJIvatPdX1S5tOAWgTTsIeASwD/DhJBsk2QD4ELAvsAh4fm89723reghwM3DICPdHkiRJku4wsmSqqq6tqu+28duAy4Ftp1hkf+DEqvptVf0AWAHs2oYVVXVVVf0OOBHYP0mApwEnteWPBw4Yzd5IkiRJ0l3NSAcUSXYEHgOc04oOT3JRkqVJtmhl2wI/7i22spVNVn4/4OdVtWpc+UTbPzTJ8iTLb7jhhrWwR5IkSZLWdyNPppJsAnwOeF1V3QocAzwY2AW4FvinUcdQVcdW1eKqWrxw4cJRb06SJEnSemCkvfkl2ZAukfpEVX0eoKqu603/N+DL7eU1wPa9xbdrZUxSfiOweZIF7epUf35JkiRJGqlR9uYX4KPA5VX1vl75Nr3Zng1c0saXAQcluVeSnYCdgXOB84CdW89996TrpGJZVRVwBnBgW34J8MVR7Y8kSZIk9Y3yytSTgBcBFye5sJX9DV1vfLsABVwNvBygqi5N8hngMrqeAA+rqtsBkhwOnApsACytqkvb+t4InJjkXcAFdMmbJEmSJI3cyJKpqjoLyASTTplimaOAoyYoP2Wi5arqKrre/iRJkiRpRs1Ib36SJEmSNN+YTEmS1ltJrk5ycZILkyxvZVsmOS3Jle3vFq08ST6QZEV7vMdje+tZ0ua/MsmSXvnj2vpXtGUnarEhSVpHmUxJktZ3T62qXapqcXt9BHB6Ve0MnN5eA+xL1znSzsChdI/6IMmWwJHAbnRNz4/sPUPxGOAve8vtM/rdkSTNFJMpSZLuan/g+DZ+PHBAr/yE6pxN93iObYC9gdOq6qaquhk4DdinTdu0qs5uPdCe0FuXJGkeMJmSJK3PCvh6kvOTHNrKtq6qa9v4T4Gt2/i2wI97y65sZVOVr5ygXJI0T4z0ob2SJM1xT66qa5LcHzgtyff7E6uqktSog2iJ3KEAO+yww6g3J0laS7wyJUlab1XVNe3v9cDJdPc8XTf2gPn29/o2+zXA9r3Ft2tlU5VvN0H5RHEcW1WLq2rxwoUL13S3JEkzxGRKkrReSnKfJPcdGwf2Ai4BlgFjPfItAb7YxpcBB7de/XYHbmnNAU8F9kqyRet4Yi/g1Dbt1iS7t178Du6tS5I0D9jMT5K0vtoaOLn1Vr4A+GRVfS3JecBnkhwC/BB4bpv/FGA/YAXwK+AlAFV1U5J3Aue1+d5RVTe18VcBxwEbA19tgyRpnjCZkiStl6rqKuDRE5TfCOw5QXkBh02yrqXA0gnKlwOPXONgJUlzks38JEmSJGkIJlOSJEmSNASTKUmSJEkagsmUJEmSJA3BZEqSJEmShmAyJUmSJElDMJmSJEmSpCH4nClN6OkveOeMb/Mbn3jLjG9TkiRJGpZXpiRJkiRpCF6ZkiRJ0pyz14lvmu0QtA76+kF/N6Pb88qUJEmSJA3BZEqSJEmShmAyJUmSJElDMJmSJEmSpCGYTEmSJEnSEEymJEmSJGkIJlOSJEmSNASTKUmSJEkagsmUJEmSJA3BZEqSJEmShmAyJUmSJElDMJmSJEmSpCGMLJlKsn2SM5JcluTSJK9t5VsmOS3Jle3vFq08ST6QZEWSi5I8treuJW3+K5Ms6ZU/LsnFbZkPJMmo9keSJEmS+kZ5ZWoV8IaqWgTsDhyWZBFwBHB6Ve0MnN5eA+wL7NyGQ4FjoEu+gCOB3YBdgSPHErA2z1/2lttnhPsjSZIkSXcYWTJVVddW1Xfb+G3A5cC2wP7A8W2244ED2vj+wAnVORvYPMk2wN7AaVV1U1XdDJwG7NOmbVpVZ1dVASf01iVJkiRJIzUj90wl2RF4DHAOsHVVXdsm/RTYuo1vC/y4t9jKVjZV+coJyifa/qFJlidZfsMNN6zRvkiSJEkSzEAylWQT4HPA66rq1v60dkWpRh1DVR1bVYuravHChQtHvTlJkiRJ64GRJlNJNqRLpD5RVZ9vxde1Jnq0v9e38muA7XuLb9fKpirfboJySZIkSRq5UfbmF+CjwOVV9b7epGXAWI98S4Av9soPbr367Q7c0poDngrslWSL1vHEXsCpbdqtSXZv2zq4ty5JkiRJGqkFI1z3k4AXARcnubCV/Q3wHuAzSQ4Bfgg8t007BdgPWAH8CngJQFXdlOSdwHltvndU1U1t/FXAccDGwFfbIEmSJEkjN7JkqqrOAiZ77tOeE8xfwGGTrGspsHSC8uXAI9cgTEnSei7JBsBy4JqqekaSnYATgfsB5wMvqqrfJbkXXc+xjwNuBJ5XVVe3dbwJOAS4HXhNVZ3ayvcB/gXYAPhIVb1nRndOkjRSM9KbnyRJc9hr6R7fMea9wPur6iHAzXRJEu3vza38/W0+2jMUDwIeQfe8ww8n2aAlaR+ie47iIuD5bV5J0jxhMiVJWm8l2Q74M+Aj7XWApwEntVnGPw9x7DmJJwF7tvn3B06sqt9W1Q/omqvv2oYVVXVVVf2O7mrX/qPfK0nSTDGZkiStz/4Z+L/AH9rr+wE/r6pV7XX/GYZ3PPewTb+lzT/d5yTejc9DlKR1k8mUJGm9lOQZwPVVdf5sx+LzECVp3TTK3vwkSZrLngQ8K8l+wEbApnSdRWyeZEG7+tR/huHYcw9XJlkAbEbXEcVkz0NkinJJ0jzglSlJ0nqpqt5UVdtV1Y50HUh8s6peAJwBHNhmG/88xLHnJB7Y5q9WflCSe7WeAHcGzqV7pMfOSXZKcs+2jWUzsGuSpBnilSlJku7qjcCJSd4FXED3AHra348nWQHcRJccUVWXJvkMcBmwCjisqm4HSHI43cPnNwCWVtWlM7onkqSRMpmSJK33qupM4Mw2fhVdT3zj5/kN8BeTLH8UcNQE5afQPZRekjQP2cxPkiRJkoZgMiVJkiRJQzCZkiRJkqQhmExJkiRJ0hBMpiRJkiRpCCZTkiRJkjQEkylJkiRJGoLJlCRJkiQNwWRKkiRJkoawYLYDkNY1j3vzO2Z8m+cf9dYZ36YkSZKm5pUpSZIkSRqCyZQkSZIkDcFmfpLWmicf9+YZ3+ZZLz5qxrcpSZIEXpmSJEmSpKGYTEmSJEnSEEymJEmSJGkIJlOSJEmSNASTKUmSJEkagsmUJEmSJA3BZEqSJEmShmAyJUmSJElDMJmSJEmSpCGYTEmSJEnSEEaWTCVZmuT6JJf0yt6W5JokF7Zhv960NyVZkeSKJHv3yvdpZSuSHNEr3ynJOa3800nuOap9kSRJkqTxRnll6jhgnwnK319Vu7ThFIAki4CDgEe0ZT6cZIMkGwAfAvYFFgHPb/MCvLet6yHAzcAhI9wXSZIkSbqLkSVTVfVt4KYBZ98fOLGqfltVPwBWALu2YUVVXVVVvwNOBPZPEuBpwElt+eOBA9bqDkiSJEnSFGbjnqnDk1zUmgFu0cq2BX7cm2dlK5us/H7Az6tq1bjyCSU5NMnyJMtvuOGGtbUfkiRJktZjM51MHQM8GNgFuBb4p5nYaFUdW1WLq2rxwoULZ2KTkiRJkua5gZKpJKcPUrY6VXVdVd1eVX8A/o2uGR/ANcD2vVm3a2WTld8IbJ5kwbhySdJ6aG3VU5IkTceUyVSSjZJsCWyVZIskW7ZhR6ZoVjfF+rbpvXw2MNbT3zLgoCT3SrITsDNwLnAesHPrue+edJ1ULKuqAs4ADmzLLwG+ON14JEnrtjWpp9qy5yb5XpJLk7y9lU/YW2yroz7dys9p2xhb17R6pJUkzQ8LVjP95cDrgAcA5wNp5bcCR0+1YJJPAXvQVXArgSOBPZLsAhRwdVs/VXVpks8AlwGrgMOq6va2nsOBU4ENgKVVdWnbxBuBE5O8C7gA+OhguyxJmkeGrqeA3wJPq6pfJNkQOCvJV4HX0/UWe2KSf6XrLfaY9vfmqnpIkoPoepV93rgeaR8AfCPJQ9s2PgT8Kd29veclWVZVl62VPZckzbopk6mq+hfgX5K8uqo+OJ0VV9XzJyieNOGpqqOAoyYoPwU4ZYLyq7izmaAkaT20hvVUAb9oLzdsQ9H1Fvt/WvnxwNvokqn92zh0vcke3XqXvaNHWuAHScZ6pIXWIy1AkhPbvCZTkjRPrO7KFABV9cEkTwR27C9TVSeMKC5JkgY2bD3Vnmd4PvAQuqtI/8PkvcXe0cNsVa1Kcgtd77LbAmf3VttfZnyPtLtNEsehwKEAO+yww1QhS5LmkIGSqSQfp+uF70Lg9lZcgMmUJGnWDVtPtSbluyTZHDgZePgo45wijmOBYwEWL15csxGDJGn6BkqmgMXAotYkQpKkuWaN6qmq+nmSM4An0HqLbVen+r3FjvUwu7L1JrsZXe+yk/U8yxTlkqR5YNDnTF0C/K9RBiJJ0hqYdj2VZGG7IkWSjek6iricyXuLXdZe06Z/syVv0+qRdsj9kyTNQYNemdoKuCzJuXS9HwFQVc8aSVSSJE3PMPXUNsDx7b6pewCfqaovJ7mMiXuL/Sjw8dbBxE10ydGwPdJKkuaBQZOpt40yCEmS1tDbprtAVV0EPGaC8gl7i62q3wB/Mcm6ptUjrSRpfhi0N79vjToQSZKGZT0lSZoNg/bmdxtdr0gA96R7Fscvq2rTUQUmSdKgrKckSbNh0CtT9x0b7z2gcPdRBSVJ0nRYT0mSZsOgvfndoTpfAPYeQTySJK0R6ylJ0kwZtJnfc3ov70H3PI/fjCQiSZKmyXpKkjQbBu3N75m98VXA1XRNKCRJmguspyRJM27Qe6ZeMupAJEkalvWUJGk2DHTPVJLtkpyc5Po2fC7JdqMOTpKkQVhPSZJmw6AdUHwMWAY8oA1famWSJM0F1lOSpBk3aDK1sKo+VlWr2nAcsHCEcUmSNB3WU5KkGTdoMnVjkhcm2aANLwRuHGVgkiRNg/WUJGnGDZpMvRR4LvBT4FrgQODFI4pJkqTpsp6SJM24QbtGfwewpKpuBkiyJfCPdJWXJEmzzXpKkjTjBr0y9b/HKiiAqroJeMxoQpIkadqspyRJM27QZOoeSbYYe9HO+A16VUuSpFGznpIkzbhBK5p/Av4ryWfb678AjhpNSJIkTZv1lCRpxg2UTFXVCUmWA09rRc+pqstGF5YkSYOznpIkzYaBm0C0SsmKSZI0J1lPSZJm2qD3TEmSJEmSekymJEmSJGkIJlOSJEmSNASTKUmSJEkagsmUJEmSJA3BZEqSJEmShmAyJUmSJElDMJmSJEmSpCGMLJlKsjTJ9Uku6ZVtmeS0JFe2v1u08iT5QJIVSS5K8tjeMkva/FcmWdIrf1ySi9syH0iSUe2LJEmSJI03yitTxwH7jCs7Aji9qnYGTm+vAfYFdm7DocAx0CVfwJHAbsCuwJFjCVib5y97y43fliRJkiSNzMiSqar6NnDTuOL9gePb+PHAAb3yE6pzNrB5km2AvYHTquqmqroZOA3Yp03btKrOrqoCTuitS5IkSZJGbqbvmdq6qq5t4z8Ftm7j2wI/7s23spVNVb5ygvIJJTk0yfIky2+44YY12wNJkiRJYhY7oGhXlGqGtnVsVS2uqsULFy6ciU1Kkua4JNsnOSPJZUkuTfLaVu79vZKkgcx0MnVda6JH+3t9K78G2L4333atbKry7SYolyRpUKuAN1TVImB34LAki/D+XknSgBbM8PaWAUuA97S/X+yVH57kRLrK6JaqujbJqcC7e5XSXsCbquqmJLcm2R04BzgY+OBM7ohm1hMPf+eMbu8/j37LjG5P0sxrzc6vbeO3Jbmcrsn4/sAebbbjgTOBN9K7vxc4O8nY/b170O7vBUgydn/vmbT7e1v52P29X52J/ZMkjd7Ikqkkn6KrYLZKspLurN17gM8kOQT4IfDcNvspwH7ACuBXwEsAWtL0TuC8Nt87xior4FV0PQZuTFcxWTlJkoaSZEfgMXQn6Gb8/t4kh9Jd7WKHHXYYfkckSTNqZMlUVT1/kkl7TjBvAYdNsp6lwNIJypcDj1yTGCVJSrIJ8DngdVV1a/+2pqqqJCO/v7eqjgWOBVi8ePGM3E8sSVpzs9YBhSRJsy3JhnSJ1Ceq6vOt2Pt7JUkDMZmSJK2XWs96HwUur6r39SaN3d8Ld7+/9+DWq9/utPt7gVOBvZJs0e7x3Qs4tU27NcnubVsH99YlSZoHZroDCkmS5oonAS8CLk5yYSv7G7y/V5I0IJMpSdJ6qarOAiZ77pP390qSVstmfpIkSZI0BJMpSZIkSRqCyZQkSZIkDcFkSpIkSZKGYDIlSZIkSUMwmZIkSZKkIZhMSZIkSdIQTKYkSZIkaQgmU5IkSZI0BJMpSZIkSRqCyZQkSZIkDcFkSpIkSZKGYDIlSZIkSUMwmZIkSZKkIZhMSZIkSdIQTKYkSZIkaQgmU5IkSZI0BJMpSZIkSRqCyZQkSZIkDcFkSpIkSZKGYDIlSZIkSUMwmZIkSZKkIZhMSZIkSdIQTKYkSZIkaQgmU5IkSZI0BJMpSZIkSRqCyZQkSZIkDWFWkqkkVye5OMmFSZa3si2TnJbkyvZ3i1aeJB9IsiLJRUke21vPkjb/lUmWzMa+SJIkSVo/zeaVqadW1S5Vtbi9PgI4vap2Bk5vrwH2BXZuw6HAMdAlX8CRwG7ArsCRYwmYJEmSJI3aXGrmtz9wfBs/HjigV35Cdc4GNk+yDbA3cFpV3VRVNwOnAfvMdNCSpHVXkqVJrk9ySa9srbWUSPK41hJjRVs2M7uHkqRRmq1kqoCvJzk/yaGtbOuquraN/xTYuo1vC/y4t+zKVjZZ+d0kOTTJ8iTLb7jhhrW1D5Kkdd9x3P1E3NpsKXEM8Je95TzpJ0nzyGwlU0+uqsfSVUyHJfnj/sSqKrqEa62oqmOranFVLV64cOHaWq0kaR1XVd8GbhpXvFZaSrRpm1bV2a1eO6G3LknSPDAryVRVXdP+Xg+cTHcm77pW8dD+Xt9mvwbYvrf4dq1ssnJJktbE2mopsW0bH19+N7agkKR104wnU0nuk+S+Y+PAXsAlwDJgrJ35EuCLbXwZcHBrq747cEur5E4F9kqyRWtOsVcrkyRprVjbLSWm2I4tKCRpHbRgFra5NXByuwd3AfDJqvpakvOAzyQ5BPgh8Nw2/ynAfsAK4FfASwCq6qYk7wTOa/O9o6rGN9WQJGm6rkuyTVVdO42WEnuMKz+zlW83wfySpHlixpOpqroKePQE5TcCe05QXsBhk6xrKbB0bccoSVqvjbWUeA93bylxeJIT6TqbuKUlXKcC7+51OrEX8KZ20u/W1qriHOBg4IMzuSOSpNGajStTkiTNCUk+RXdVaaskK+l65XsPa6+lxKvoegzcGPhqGyRJ84TJlCRpvVVVz59k0lppKVFVy4FHrkmMkqS5ay49tFeSJEmS1hkmU5IkSZI0BJv5Seu4R//jkTO6ve/99dtndHuSJElzlVemJEmSJGkIJlOSJEmSNASTKUmSJEkagsmUJEmSJA3BZEqSJEmShmAyJUmSJElDMJmSJEmSpCGYTEmSJEnSEHxor6R56yVfff2Mbu9j+75vRrcnSZJml1emJEmSJGkIJlOSJEmSNASTKUmSJEkagsmUJEmSJA3BZEqSJEmShmAyJUmSJElDMJmSJEmSpCH4nClJmgHvPeuFM77NNz7532d8m5IkrU+8MiVJkiRJQzCZkiRJkqQhmExJkiRJ0hBMpiRJkiRpCCZTkiRJkjQEkylJkiRJGoLJlCRJkiQNwedMSdJ66EvnPHlGt/fM3c6a0e1JkjQTvDIlSZIkSUMwmZIkSZKkIazzyVSSfZJckWRFkiNmOx5JkvqspyRp/lqnk6kkGwAfAvYFFgHPT7JodqOSJKljPSVJ89u63gHFrsCKqroKIMmJwP7AZYMsvN+jXjnC0CZ2ysXHzPg2JUmzZo3qKUnS3Jaqmu0YhpbkQGCfqnpZe/0iYLeqOnzcfIcCh7aXDwOuWMNNbwX8bA3XsTbNpXjmUiwwt+KZS7GA8UzeABLtAAAQ9ElEQVRlLsUCcyuetRXLA6tq4VpYz5w2i/WUpjaX/qekYfgZHr2B6ql1/crUQKrqWODYtbW+JMuravHaWt+amkvxzKVYYG7FM5diAeOZylyKBeZWPHMplvlkbddTmpqfY63r/AzPHev0PVPANcD2vdfbtTJJkuYC6ylJmsfW9WTqPGDnJDsluSdwELBslmOSJGmM9ZQkzWPrdDO/qlqV5HDgVGADYGlVXToDm55rTTHmUjxzKRaYW/HMpVjAeKYyl2KBuRXPXIplzpvFekpT83OsdZ2f4Tline6AQpIkSZJmy7rezE+SJEmSZoXJlCRJkiQNwWRKkiStE5LcnuTC3rDjCLf14iRHj2r90nhJKsm/914vSHJDki+vZrk9VjePRme9TqaS7Jjk10kubK/3SXJFkhVJjphiua8l+fn4D27rremctvynW89NJDk8yUtXF0OSh42rJG5N8ro235ZJTktyZfu7xSTr2zPJd9vyZyV5SCu/V4tpRYtxx1b+qCTHTXI8Nk9yUpLvJ7k8yROmGct3evvykyRfaOVJ8oEWy0VJHtvKFyb52mTvTyvbIMkF/WM/2XGfIJ4z2/s7FtP91+DYXJ3k4rae5b1tDHpskuSoJP/dju1rhj02STZKcm6S7yW5NMnbp3tsevMvS3LJ6vYnyTOSvGOSY/NXLY5LknwqyUbTfJ+e1/b90iTv7ZVP9j7tnWRVb/uvbdu+NO3/Z5rvzeFtG5Vkq3Hv2d3emzZtSVvvlUmW9MrP6r1P2yc5I8llLbbXDhHbR9v7fFG6/81NVnNs7vgMT/JeLU1yff89n2Y8n0j3P3VJW9eGUx2r8Z9jaQi/rqpdesPVsx2QtBb9Enhkko3b6z/FRynMfVW13g7AjsAlbXwD4H+ABwH3BL4HLJpkuT2BZwJfHlf+GeCgNv6vwCvb+L2BC1YXw7jyDYCf0j19GeDvgSPa+BHAeydZ338Df9TGXwUc1xv/1zZ+EPDp3jLfAHYYHwtwPPCyNn5PYPPpxDIurs8BB7fx/YCvAgF2B87pzfcx4EmTHRvg9cAn+8d+suM+QQxnAosnKB/m2FwNbDXBugZ9n14CnADco72+/7DHps27SRvfEDgH2H06x6ZNf047tpesbn/aNi+g+2z3Y9kW+AGwcW/7Lx40FuB+wI+Ahb3P4J5TvU9t+79o79MjgUtaXAva+/eQab43j2nrvMt7PNl7A2wJXNX+btHGt2jT3gBc18a3AR7bxu9L97+6aJqxbdobf19vmdV+hif6nwL+GHgsd/8/GzSe/drxCPAp7vzOG+hz7OAw3QH4xQRlGwD/QNcN/UXAy1v5HsC3gC+2/8v3AC8AzgUuBh7c5nsm3ffmBe3/ZetW/mLg6Da+kK4eO68NfoYd1vpAV5e9GziwvT4BeCPtNw+wK/Bf7bP6n8DDWvkevXnuAyxtn/MLgP1ne7/m+7BeX5kaZ1dgRVVdVVW/A04E9p9oxqo6HbitX5YkwNOAk1rR8cABbf5fAVcn2XUa8ewJ/E9V/bC93r+t8y7rnig8YNM2vhnwkwmWPwnYs8UM8CW6H2D9/dmM7ofWR9s+/K6qfj7NWMbWtSndsflCb/kTqnM2sHmSbdq0L9BVdhOtZzvgz4CP9MomPe7TMK1jM411TRXLK4F3VNUfAKrq+t7y0zo2bd5ftJcbtqGmc2zaFY7XA+8aZH+q+8Y+E3jGBKtbAGycZAFdUvOTacTyIODKqrqhvf4G8OcTxDL+fbqN7n36I7of7r+qqlV0P6SeM9W+jFdVF9TEZ7sne2/2Bk6rqpuq6mbgNGCftsxpdP+HVNW1VfXdNn4bcDld8jmd2G6FOz73G9P9v6/u2Ez6Ga6qbwM3TbKvg8RzSjseRVdxb9dbftr/49IANs6drQtObmWHALdU1eOBxwN/mWSnNu3RwCvovhteBDy0qnalq0de3eY5i+4E1GPo6v7/O8F2/wV4f9vGn9Orh6S17ETgoNaq43/TJfpjvg88pX1W30qXeI33ZuCb7XP+VOAfktxnxDGv10ym7rQt8OPe65Xc+UNnEPcDft5+wE20/HLgKdNY30F0Z3rHbF1V17bxnwJbT7Lcy4BTkqykqzje08rv2L8W4y0t5sli2wm4AfhYumZ1H+n9Mw4ay5gDgNPHfggy9bGe6jj9M10l94de2eqO+3gfa5XwW3o/Nqd7bKD7Efv1JOcnObRXPuixeTDwvCTLk3w1yc7jY5lgfyY9NumaP14IXE/3w/4cpnds3gn8E/CrceVT7c/d4qmqa4B/pLu6dC3dD5yvTyOWFcDDWnO0BXSfne3btKnep1+3WC4BnpLkfknuTXeFZGz56X5ux5vsvZnqPbuVLve5X286rRneY7izkhw4tiQfa/M8HPjg+Nim8RmeyrSOVWve9yJgrAnfsP/j0ur0m/k9u5XtBRzcvgPHvvvGvlPPaycyfkvX+uTrrfxiuiu10J0EODXJxcD/Bzxigu0+HTi6bWMZsOlYM1tpbaqqi+g+m88HThk3eTPgs61p9vuZ+LO6F3BE+6yeCWxE13JDI2IyNXOuBx4wyIzp7iV5FvDZiaa3s8CTPSDsr4D9qmo7uuY07xsytgV0zX+OaWdAfknX3Gc6sYx5PndNDKcbC0meAVxfVecPuJ6JvKCqHkX3Q+4pdD/+hooHeHJVPRbYFzgsyR+Pn2E1x+ZewG+qajHwb3SX5IeNhaq6vap2oftRsGuSRw6wPgCS7ELX3OXkqeabYH/uFk+7t2Z/umT8AcB9krxw0FjalZ1XAp8GvkPX1O72ARZdBTygqi4H3kv3g+lrwIUTLT/g53ZtWUXvOLUfYJ8DXtc7wTBwbFX1kra+y4HnDbD9gb97JtneIMfqw8C3q+o7o45HmkCAV/eSrJ3aSRyA3/bm+0Pv9R/o6jnoTkoc3eqHl9P9+BzvHnRXr8a2sW2vRYC0ti2jOzE5/rfTO4EzquqRdM1TJ/qsBvjz3md1h1Y3akRMpu50DXeewYbuR+k1SXbrNSl41hTL30jXlGVBf/ne9I3ozp4PYl/gu1V1Xa/surFmMu3v9W381BbbR5IsBB7drkpA94P0ieP3r8W4WYt5sthWAit76zqJLrkaKJaxlaS7gX9X4Cu9dU94rKeIBeBJwLOSXE13Cfxp6Xq8mfC4j12pacM74I6rJmNNrD7Z4hrm2PTXdT1wcm9dgx6blcDn2/jJdJfyhz02/bh+DpxB18xs0GPzBGBxO7ZnAQ9NcuZU+zNFPE8HflBVN1TV79s+PnEasVBVX6qq3arqCcAVdPcW3eXYTPA+ZSyWqvpoVT2uqv4YuLm3/MCf20lM9t5M9Z5B9z3767atDekSqU9U1ed780wrtqq6ne7/YKwJ5LQ/w6sxnf/xI+nuJ3l9b/k1+hxL03Qq8Mrc2QHKQ6fZrGkz7vx8Lplknq9zZ7PAsZNQ0qgsBd5eVRePK+9/Vl88ybKnAq8ea32T5DEjiVB3MJm603nAzul6HLsnXTO7ZVV1Ti+7XzbZwu3s7RnAga1oCd1Nr2MeStcEaRATXclZxp1f8nesu6r2brG9jO6H42ZJHtrm+1O6s9fjlz+Qrj3t2Nnmu8VWVT8FfpzkYa1oT+CyacQy5kC6myJ/M25fDk5nd7qmYGNNiiY8TlX1pqrarqp2pHtvvllVL5zsuI9dqWnDW9N1L7oV3PGD9hm97Uzr2CS5T5L7jo3TXVKfaF1THZsv0LVlBvgT7vzBP+1jk66HtM3b+MZ07/v3Bz02VXVMVT2gHdsnA/9dVXtMtT9TxPMjYPck925f5HsClw8aS9uHsV4Wt6DrWGHsh/tU79O9xmLpLb8Dd3aqMem+TPK5nchk782pwF5Jtmgx79XKxiygu2cydPcgXl5V468Yrza2tt2x3jlDd/X6+wMcm+l89wwcT4vjZXT3jD2/2v1/veWn9TmW1sBH6Oqn76Zr/vT/uPOq0yDeRtd06nzgZ5PM8xq6k04XJbmM7j4saSSqamVVfWCCSX8P/F2SC5j8M/5OununL0pyaXutUao50AvGbA3cvWer/eh+1P4P8OYplvsO3f1Ev6a7wrB3K38Q3U3YK+ia6N2rt8x3gfsNEMN96M4obzZuvvsBpwNX0t2Uv+UksT2bri349+jayj6olW/UYlrRYnxQb5mj6S4Xj49lF7r7Gy6i+/G/xXRiafOeCewzrizAh9pxvpheD3vAX9M117jbsenNswd37c1v0uM+7rie3/blUrqbiTcY5ti07X2vDZf2PyvTeJ82p7tadzFdzzyPHvbY0F3VuqDt2yXAW6dzbFbzeZx0f4AvA4+aYJm30/3IvwT4+Ng2B42F7kTCZW04qFc+4fvUtn8j8Mze/+dl7f3Zc4j35jV0/9er6Dpw+cgA781LW1wrgJf0yp9Jl0hAl6hWe58ubMN+g8ZGd/LrP9q2LwE+Qevdb7Jj0/8MT/L+foru3rbft30+ZJrHalU7HmP789bpfI4dHBwcHBzW9SFVM3XbwNyT7ibwL1fX9nSU23kM8Pqquts9OjMVw2SS3Iuux7Mn0zXFmbVYWjzfpuvG82aPzd3imUvHZmvgk1W15xyI5aF0P+Q3rTs7uJgTkiwFnlpVO6125tFs/47PcFWtmu33qsV0x+d4tmKQJGltWd+b+d1O1yzuwtXOuWa2At4yyzFMZge658msmu1Y0t3z9b7ejyyPTTNHj80b5kgsW9M9m2P56macBd8HFsyRzzDMvc+xJEnrtPX6ypQkSZIkDWt9vzIlSZIkSUMxmZIkSZKkIZhMSZIkaaSSHJfkwNXPKa1bTKYkSZI0p/Qe8i7NaSZT0hyU5AtJzk9yaZJDW9khSf47yblJ/i3J0a18YZLPJTmvDU+a3eglSeuyJG9JckWSs5J8KslfJ3lwkq+1uuk7SR7e5j0uyQeS/GeSq8auPrWHdh/d1vMN4P699T8uybfauk5Nsk0rPzPJPydZDrx2NvZdmi6zfmluemlV3ZRkY+C8JF+h617/scBtwDfpHkoL3QOI319VZyXZATgV+KPZCFqStG5L8njgz4FHAxsC36V76P2xwCuq6sokuwEfBp7WFtuG7pmMDweWAScBzwYeBiyie4TFZcDSJBsCH6R73twNSZ4HHEX38HOAe1bV4pHvqLSWmExJc9Nrkjy7jW8PvAj4VlXdBJDks8BD2/SnA4uSjC27aZJNquoXMxmwJGleeBLwxar6DfCbJF8CNgKeCHy2V9fcq7fMF6rqD8Bl7aHuAH8MfKqqbgd+kuSbrfxhwCOB09q6NgCu7a3r0yPYJ2lkTKakOSbJHnQJ0hOq6ldJzqR7+OtkV5vuAezeKj5Jkta2ewA/r6pdJpn+2954JpmnP/3SqnrCJNN/Od3gpNnkPVPS3LMZcHNLpB4O7A7cB/iTJFu0m3L/vDf/14FXj71IMlllJ0nS6vwH8MwkGyXZBHgG8CvgB0n+Au64H+rRq1nPt4HnJdmg3RP11FZ+BbAwyRPaujZM8oiR7Ik0A0ympLnna8CCJJcD7wHOBq4B3g2cS1fRXQ3c0uZ/DbA4yUVJLgNeMeMRS5Lmhao6j+6+p4uArwIX09U3LwAOSfI94FJg/9Ws6mTgSrp7pU4A/qut/3fAgcB727oupGtCKK2TUlWzHYOkAYzdB9WuTJ0MLK2qk2c7LknS/NKrb+5Nd4Xp0Kr67mzHJc1F3jMlrTveluTpdDcCfx34wizHI0man45NsoiuvjneREqanFemJEmSJGkI3jMlSZIkSUMwmZIkSZKkIZhMSZIkSdIQTKYkSZIkaQgmU5IkSZI0hP8f2nCP1w0mWmMAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(14,4))\n", "plt.subplot(1,2,1)\n", "sns.countplot(x='age', data=data, palette='viridis')\n", "plt.title(\"Distribution of Age\")\n", "\n", "plt.subplot(1,2,2)\n", "sns.countplot(data['gender'], palette='viridis')\n", "plt.title(\"Distribution of Gender\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Female 0.538013\n", "Male 0.461987\n", "Name: gender, dtype: float64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['gender'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The age distribution plot shows that our dataset represents an aging population. The most common age range is 70-80 years old. Our population also has a higher proportion of females than males." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### How long were hospital stays for a given admission?" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(8,6))\n", "sns.countplot(data['time_in_hospital'], palette='viridis')\n", "plt.xlabel(\"time in hospital (days)\")\n", "plt.title(\"Length of Hospital Stay\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean time in hospital: 4.39\n" ] } ], "source": [ "print(f\"Mean time in hospital: {data['time_in_hospital'].mean():.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Patients stayed on average 4.4 days in hospital. The longest stay was 14 days." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Number of Diagnoses, Procedures, Medications" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(14,5))\n", "\n", "plt.subplot(1,2,1)\n", "sns.kdeplot(data['num_medications'], shade=True, legend=False)\n", "plt.title(f\"Number of Medications, mean: {data['num_medications'].mean():.2f}\", size=14)\n", "\n", "plt.subplot(1,2,2)\n", "sns.kdeplot(data['num_lab_procedures'], shade=True, legend=False)\n", "plt.title(f\"Number of Lab Procedures, mean: {data['num_lab_procedures'].mean():.2f}\", size=14)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Patients on average were administered 16 medications during their hospital stay. The average number of lab procedures was 43." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What was the most common medical specialty?\n", "\n", "We also have information on the medical specialty of a patient's attending physician. This can give us a sense of the nature of a patient's illness during their hospital stay. For example, \"orthopedics\" would suggest that the patient's presenting issue was bone-related, while \"nephrology\" suggests a kidney problem." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 72 medical specialties.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
specialtycountprevalence
0InternalMedicine143280.143117
1Emergency/Trauma74490.074405
2Family/GeneralPractice73020.072937
3Cardiology52960.052900
4Surgery-General30680.030645
5Nephrology15440.015422
6Orthopedics13940.013924
7Orthopedics-Reconstructive12310.012296
8Radiologist11290.011277
9Pulmonology8560.008550
\n", "
" ], "text/plain": [ " specialty count prevalence\n", "0 InternalMedicine 14328 0.143117\n", "1 Emergency/Trauma 7449 0.074405\n", "2 Family/GeneralPractice 7302 0.072937\n", "3 Cardiology 5296 0.052900\n", "4 Surgery-General 3068 0.030645\n", "5 Nephrology 1544 0.015422\n", "6 Orthopedics 1394 0.013924\n", "7 Orthopedics-Reconstructive 1231 0.012296\n", "8 Radiologist 1129 0.011277\n", "9 Pulmonology 856 0.008550" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "medical_specialties = data['medical_specialty'].value_counts().reset_index()\n", "medical_specialties.columns = ['specialty', 'count']\n", "medical_specialties['prevalence'] = medical_specialties['count']/len(data)\n", "print(f\"There are {data['medical_specialty'].nunique()} medical specialties.\")\n", "medical_specialties.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What proportion of patients were on diabetes medication during their hospital stay?" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Yes 0.77184\n", "No 0.22816\n", "Name: diabetesMed, dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diabetesMed'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "77% of patients were on diabetes medication during their stay." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Do patients have normal A1C levels?\n", "\n", "The A1C blood test is used to diagnose whether a patient has type I or II diabetes, and represents the average levels of blood sugar over the past 3 months. The higher the A1C level, the poorer a patient's blood sugar control which indicates a higher risk of diabetes complications. The table below represents Mayo Clinic's [guideline](https://www.mayoclinic.org/tests-procedures/a1c-test/about/pac-20384643) of how to interpret A1C levels:\n", "\n", "|interpretation|A1C level|\n", "|-----------|--------|\n", "|no diabetes|<5.7|\n", "|pre-diabetes|5.7-6.4|\n", "|diabetes|>6.5|\n", "|well-managed diabetes|<7|\n", "|poorly managed diabetes|>8|\n", "\n", "Our dataset has a `A1Cresult` which reflects a patient's A1C level during their hospital stay. " ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ ">8 0.482994\n", "Norm 0.292783\n", ">7 0.224224\n", "Name: A1Cresult, dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['A1Cresult'].value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Proportion of hospital admissions with missing A1C result: 83.14%\n" ] } ], "source": [ "print(f\"Proportion of hospital admissions with missing A1C result: {data['A1Cresult'].isna().sum()/len(data):.2%}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of the hospital admissions where A1C was measured, almost half had a A1C level of greater than 8, which suggests that the patient's diabetes was poorly managed. However, the availability of A1C data is sparse in our dataset, so we may want to consider not including it in the first iteration of our model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5: Feature Selection and Engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our dataset contains quite a few categorical variables such as `race`, `age`, and `admission_type`. In general, machine learning models can't handle categorical variables so we can use one-hot and label encoding to convert our string features to numerical without adding a hierarchy. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### One-hot Encoding\n", "\n", "Let's say we want to convert a patient's race to a numerical feature. We could use label encoding to convert each race to values 0-5 but this suggests an inherent order among races that does not exist. With one-hot encoding, each race becomes an independent feature." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "categorical = ['race', 'admission_type']\n", "\n", "for c in categorical:\n", " data = pd.concat([data, pd.get_dummies(data[c], prefix=c)], axis=1)\n", " data.drop(columns=c)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[70-80) 25562\n", "[60-70) 22185\n", "[50-60) 17102\n", "[80-90) 16706\n", "[40-50) 9626\n", "[30-40) 3765\n", "[90-100) 2668\n", "[20-30) 1650\n", "[10-20) 690\n", "[0-10) 160\n", "Name: age, dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['age'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Label Encoding\n", "\n", "There are a couple of features where label encoding is applicable. \n", "\n", "- age (from `[0-10)` to `[70-80)`) can assume an ordinal relationship\n", "- gender (`Male` or `Female`) converts to 0 and 1 which is binary\n", "\n", "Let's go ahead and apply scikit-learn's [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to these two columns." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "label_encoder = LabelEncoder()\n", "\n", "data['age_label'] = label_encoder.fit_transform(data['age'])\n", "data['gender_bool'] = label_encoder.fit_transform(data['gender'].astype(str))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because our `gender` column contained some missing values, it was considered to be a mixed datatype. We had to convert it to a string datatype in order to label encoding to work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Modelling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 6: Defining the X and y Variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given that we now have a good understanding of our dataset, we can now aim to build predictive models. The first step is to separate our features and target into variables `X` and `y` respectively." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "med_features = ['metformin_bool', 'repaglinide_bool',\n", " 'nateglinide_bool', 'chlorpropamide_bool', 'glimepiride_bool',\n", " 'acetohexamide_bool', 'glipizide_bool', 'glyburide_bool',\n", " 'tolbutamide_bool', 'pioglitazone_bool', 'rosiglitazone_bool',\n", " 'acarbose_bool', 'miglitol_bool', 'troglitazone_bool',\n", " 'tolazamide_bool', 'examide_bool', 'citoglipton_bool', 'insulin_bool',\n", " 'glyburide-metformin_bool', 'glipizide-metformin_bool',\n", " 'glimepiride-pioglitazone_bool', 'metformin-rosiglitazone_bool',\n", " 'metformin-pioglitazone_bool']\n", "\n", "demographic_features = ['race_AfricanAmerican', 'race_Asian',\n", " 'race_Caucasian', 'race_Hispanic', 'race_Other', 'age_label',\n", " 'admission_type_Elective', 'admission_type_Newborn',\n", " 'admission_type_Trauma Center', 'admission_type_Urgent', 'gender_bool']\n", "\n", "other_features = ['num_lab_procedures', 'num_procedures',\n", " 'num_medications', 'number_outpatient', 'number_emergency',\n", " 'number_inpatient', 'number_diagnoses']\n", "\n", "all_features = med_features + demographic_features + other_features\n", "\n", "X = data[all_features]\n", "y = data['readmitted_bool']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 7: Choosing our Model\n", "\n", "When building a binary classification model, there are a wide selection of machine learning models to choose from:\n", "\n", "- Random Forest Classification\n", "- Logistic Regression\n", "- Linear Discriminant Analysis\n", "- Support Vector Machines (SVM)\n", "- Gaussian Naive Bayes\n", "- k-Nearest Neighbours\n", "\n", "We'll test out the [Random Forest Classifier (RFC)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for this dataset. RFC is an ensemble learning technique that works by creating a \"forest\" of decision trees. Each tree evaluates the data for a given patient and outputs a 0 or 1. Random Forest looks at the output of all trees and gives the majority vote as its result. Let's say we have a forest with 3 trees and 2 of them predict the patient will be readmitted. The majority vote is that the patient will be readmitted.\n", "\n", "![](images/random_forest.png)\n", "\n", "We're choosing Random Forest because:\n", "\n", "- it is robust to outliers\n", "- it is able to handle unbalanced datasets \n", "- it measures feature importance\n", "\n", "We'll import RFC from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) which is a very comprehensive Python library for data mining and data analysis." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can inspect the default parameters for RandomForestClassifier by creating an instance of the class and applying `get_params()`:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bootstrap': True,\n", " 'class_weight': None,\n", " 'criterion': 'gini',\n", " 'max_depth': None,\n", " 'max_features': 'auto',\n", " 'max_leaf_nodes': None,\n", " 'min_impurity_decrease': 0.0,\n", " 'min_impurity_split': None,\n", " 'min_samples_leaf': 1,\n", " 'min_samples_split': 2,\n", " 'min_weight_fraction_leaf': 0.0,\n", " 'n_estimators': 'warn',\n", " 'n_jobs': None,\n", " 'oob_score': False,\n", " 'random_state': None,\n", " 'verbose': 0,\n", " 'warm_start': False}" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "RandomForestClassifier().get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can keep the default values for most of these parameters. But there are a few that can be modified prior to training the model that can impact model performance. These are called [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)). Some RFC hyperparameters include:\n", "\n", "- `n_estimators`: number of trees in the forest\n", "- `max_depth`: maximum number of levels in each decision tree\n", "- `max_features`: maximum number of features considered for splitting a node\n", "- `min_samples_split`: number of data points placed in a node before the node is split \n", "\n", "These are external configurations that can't be learned from training the model. To select the optimal values of a hyperparameter, we'll need to use a technique called hyperparameter tuning. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 8: Hyperparameter Tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hyperparemter tuning is a critical step in the machine learning pipeline. It describes the process of choosing a set of optimal hyperparameters for a model. The hyperparameters that you select can have a significant impact on your model's performance. \n", "\n", "We're going to be testing out two hyperparameter tuning techniques offered by scikit-learn:\n", "\n", "- [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)\n", "- [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)\n", "\n", "\n", "\n", "With grid search, you define your search space as a grid of values and iterate over each grid point until you find the optimal combination of values. Let's say we want to tune `max_depth` and `n_estimators` in our RandomForestClassifer. We'll set our search space as follows:\n", "\n", "- n_estimators = [5,10,50]\n", "- max_depth = [3,5,10]\n", "\n", "This means that we'll have to train our model 9 times to test for every configuration of values. We'll choose the combination of n_estimators and max_depth that give us the best model performance.\n", "\n", "Let's implement this with scikit-learn's GridSearchCV. We first need to define our search space as a dictionary. We also need to initialize our model." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "search_space = {\n", " 'n_estimators': [5,10,50],\n", " 'max_depth': [3,5,10]\n", "}\n", "\n", "rfc = RandomForestClassifier(random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we set up grid search. GridSearchCV also performs [cross-validation](https://machinelearningmastery.com/k-fold-cross-validation/) (hence the `'CV'`) so we can specify how many folds we want in our analysis. We'll set our number of folds to 3. The more folds you use, the longer it will take to compute results." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "grid_search = GridSearchCV(rfc, search_space, cv=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last step is to run fit. We'll pass in X and y which we created in *Step 6*. GridSearchCV will use RFC's default metric, [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score). If we want to optimize our model using another metric, we can specify `scoring = 'precision'` (or whichever metric we're interested in) inside GridSearchCV. Let's stick with accuracy for now." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=3, error_score='raise-deprecating',\n", " estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,\n", " oob_score=False, random_state=42, verbose=0, warm_start=False),\n", " fit_params=None, iid='warn', n_jobs=None,\n", " param_grid={'n_estimators': [5, 10, 50], 'max_depth': [3, 5, 10]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", " scoring=None, verbose=0)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the optimal hyperparameters?" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimal hyperparameters: {'max_depth': 5, 'n_estimators': 50}\n", "Best score: 0.617\n" ] } ], "source": [ "print(f\"Optimal hyperparameters: {grid_search.best_params_}\")\n", "print(f\"Best score: {grid_search.best_score_:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the search space we defined for GridSearchCv, it looks like max_depth of 5 and n_estimators of 50 are our optimal hyperparmeters which gave us an accuracy of 0.617. \n", "\n", "We can also see a thorough report of our results with `cv_results_`. It shows fit time, score time, and mean train/test score (averaged over all folds). We'll sort by `mean_test_score`." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_depthparam_n_estimatorsparamssplit0_test_scoresplit1_test_scoresplit2_test_scoremean_test_scorestd_test_scorerank_test_scoresplit0_train_scoresplit1_train_scoresplit2_train_scoremean_train_scorestd_train_score
51.1323750.0095370.1216200.001741550{'max_depth': 5, 'n_estimators': 50}0.6090140.6332140.6083730.6168670.01156210.6336940.6216830.6274370.6276050.004905
20.8218720.0077620.1003300.000355350{'max_depth': 3, 'n_estimators': 50}0.6002640.6274610.6130470.6135910.01111020.6245090.6118390.6191210.6184900.005192
81.9070940.0711640.2023620.0287451050{'max_depth': 10, 'n_estimators': 50}0.6103320.6217670.6068140.6129710.00638330.6553440.6478280.6472590.6501440.003685
40.2510500.0080860.0355880.000535510{'max_depth': 5, 'n_estimators': 10}0.6061070.6245540.5966860.6091160.01157440.6315960.6153750.6229120.6232940.006628
70.3995700.0041940.0455920.0006991010{'max_depth': 10, 'n_estimators': 10}0.6095530.6142160.6015100.6084260.00524850.6541610.6440680.6421950.6468080.005255
\n", "
" ], "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "5 1.132375 0.009537 0.121620 0.001741 \n", "2 0.821872 0.007762 0.100330 0.000355 \n", "8 1.907094 0.071164 0.202362 0.028745 \n", "4 0.251050 0.008086 0.035588 0.000535 \n", "7 0.399570 0.004194 0.045592 0.000699 \n", "\n", " param_max_depth param_n_estimators params \\\n", "5 5 50 {'max_depth': 5, 'n_estimators': 50} \n", "2 3 50 {'max_depth': 3, 'n_estimators': 50} \n", "8 10 50 {'max_depth': 10, 'n_estimators': 50} \n", "4 5 10 {'max_depth': 5, 'n_estimators': 10} \n", "7 10 10 {'max_depth': 10, 'n_estimators': 10} \n", "\n", " split0_test_score split1_test_score split2_test_score mean_test_score \\\n", "5 0.609014 0.633214 0.608373 0.616867 \n", "2 0.600264 0.627461 0.613047 0.613591 \n", "8 0.610332 0.621767 0.606814 0.612971 \n", "4 0.606107 0.624554 0.596686 0.609116 \n", "7 0.609553 0.614216 0.601510 0.608426 \n", "\n", " std_test_score rank_test_score split0_train_score split1_train_score \\\n", "5 0.011562 1 0.633694 0.621683 \n", "2 0.011110 2 0.624509 0.611839 \n", "8 0.006383 3 0.655344 0.647828 \n", "4 0.011574 4 0.631596 0.615375 \n", "7 0.005248 5 0.654161 0.644068 \n", "\n", " split2_train_score mean_train_score std_train_score \n", "5 0.627437 0.627605 0.004905 \n", "2 0.619121 0.618490 0.005192 \n", "8 0.647259 0.650144 0.003685 \n", "4 0.622912 0.623294 0.006628 \n", "7 0.642195 0.646808 0.005255 " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pd.DataFrame(grid_search.cv_results_).sort_values(by='mean_test_score', ascending=False)\n", "results.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 2012, Bergstra and Bengio from the University of Montreal proposed a new technique called [random search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) which is similar to grid search but instead of sampling over a discrete set of values, you’re now randomly sampling from a distribution of values. Random search is effective in situations where not all hyperparameters are equally important.\n", "\n", "![](images/random_grid_search.png)\n", "\n", "The visualization above gives an example of when random search can perform better. With grid search, you’re only looking at 3 different values of a given hyperparamter. But with random search you’re looking at nine different values. As you increase the number of samples in your random search, you increase the probability of finding the optimal hyperparameters for your model. \n", "\n", "Let's test out random search using scikit-learn's RandomizedSearchCV. We'll define our search space over a uniform distribution of values. We'll iterate 9 times, just like we did for grid search. " ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimal hyperparameters: {'max_depth': 4, 'n_estimators': 59}\n", "Best score: 0.615\n" ] } ], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "from scipy.stats import randint\n", "\n", "search_space = {\n", " \"n_estimators\": randint(10,100),\n", " \"max_depth\": randint(1, 11)\n", "}\n", "\n", "random_search = RandomizedSearchCV(rfc, param_distributions=search_space, n_iter=9, cv=3)\n", "random_search.fit(X,y)\n", "\n", "print(f\"Optimal hyperparameters: {random_search.best_params_}\")\n", "print(f\"Best score: {random_search.best_score_:.3f}\")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_depthparam_n_estimatorsparamssplit0_test_scoresplit1_test_scoresplit2_test_scoremean_test_scorestd_test_scorerank_test_scoresplit0_train_scoresplit1_train_scoresplit2_train_scoremean_train_scorestd_train_score
71.1497460.0213920.1291110.002811459{'max_depth': 4, 'n_estimators': 59}0.6061070.6298580.6098410.6152690.01042810.6298730.6147460.6248900.6231700.006294
01.1365150.0573280.1265260.004007458{'max_depth': 4, 'n_estimators': 58}0.6051780.6300680.6091520.6147990.01091820.6296340.6147910.6251740.6232000.006218
12.2006760.4307520.1663570.003467665{'max_depth': 6, 'n_estimators': 65}0.6098530.6238950.6096910.6144790.00665830.6365110.6259380.6300290.6308260.004353
62.0045840.0410660.2052040.006243680{'max_depth': 6, 'n_estimators': 80}0.6103620.6222470.6083130.6136400.00614340.6366610.6262230.6303130.6310660.004294
42.8345320.1513900.2631070.004990977{'max_depth': 9, 'n_estimators': 77}0.6101520.6239250.6061250.6134010.00762150.6484220.6412210.6415200.6437210.003326
\n", "
" ], "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "7 1.149746 0.021392 0.129111 0.002811 \n", "0 1.136515 0.057328 0.126526 0.004007 \n", "1 2.200676 0.430752 0.166357 0.003467 \n", "6 2.004584 0.041066 0.205204 0.006243 \n", "4 2.834532 0.151390 0.263107 0.004990 \n", "\n", " param_max_depth param_n_estimators params \\\n", "7 4 59 {'max_depth': 4, 'n_estimators': 59} \n", "0 4 58 {'max_depth': 4, 'n_estimators': 58} \n", "1 6 65 {'max_depth': 6, 'n_estimators': 65} \n", "6 6 80 {'max_depth': 6, 'n_estimators': 80} \n", "4 9 77 {'max_depth': 9, 'n_estimators': 77} \n", "\n", " split0_test_score split1_test_score split2_test_score mean_test_score \\\n", "7 0.606107 0.629858 0.609841 0.615269 \n", "0 0.605178 0.630068 0.609152 0.614799 \n", "1 0.609853 0.623895 0.609691 0.614479 \n", "6 0.610362 0.622247 0.608313 0.613640 \n", "4 0.610152 0.623925 0.606125 0.613401 \n", "\n", " std_test_score rank_test_score split0_train_score split1_train_score \\\n", "7 0.010428 1 0.629873 0.614746 \n", "0 0.010918 2 0.629634 0.614791 \n", "1 0.006658 3 0.636511 0.625938 \n", "6 0.006143 4 0.636661 0.626223 \n", "4 0.007621 5 0.648422 0.641221 \n", "\n", " split2_train_score mean_train_score std_train_score \n", "7 0.624890 0.623170 0.006294 \n", "0 0.625174 0.623200 0.006218 \n", "1 0.630029 0.630826 0.004353 \n", "6 0.630313 0.631066 0.004294 \n", "4 0.641520 0.643721 0.003326 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pd.DataFrame(random_search.cv_results_).sort_values(by='mean_test_score', ascending=False)\n", "results.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grid Search and Random Search are uninformed methods which means that they do not take into consideration results from past evaluations. When you're working with a very large search space, you might want to consider a \"smarter\" approach to hyperparameter tuning such a Sequential-Based Model Optimization (SMBO). The SMBO approach keeps track of previous iteration results which is used to sample hyperapramters at the current iteration. In other words, SMBO is trying to reduce the number of iterations by sampling the most promising hyperparameters based on past results. You can check out [scikit-optimize](https://scikit-optimize.github.io/) to learn more about how to implement SMBO hyperparameter tuning with scikit-learn models. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 9: Evaluating Model Performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several metrics that we can use to evaluate model performance:\n", "\n", "- accuracy\n", "- precision\n", "- recall\n", "- F1-score\n", "- ROC AUC score\n", "\n", "A comprehensive list of classification metrics can be found in scikit-learn's metrics module [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics). \n", "\n", "In this walkthrough, we'll look at accuracy, precision and recall. But before we can start evaluating our model, let's split our data into two parts: 1) a training set and 2) a test set. We'll fit our model on the training data, and evaluate its performance using the test set." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=4, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=59, n_jobs=None,\n", " oob_score=False, random_state=99, verbose=0, warm_start=False)" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", "\n", "rfc = RandomForestClassifier(**random_search.best_params_)\n", "rfc.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accuracy is RFC's default evaluation metric. Using the score metric, we get the measured accuracy from the trained model. " ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.620\n" ] } ], "source": [ "accuracy = rfc.score(X_test, y_test)\n", "\n", "print(f\"Accuracy: {accuracy:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An accuracy score of 0.62 means that ~62% of hospital admissions were correctly labeled. The dataset that we're working with is relatively balanced, but let's say we have an imbalanced dataset where 90% of patients ended up being readmitted. Using accuracy to evaluate a model trained on imbalanced data is a problem because if we automatically predicted all patients to be readmitted, our accuracy would be 90% by default. Precision and recall are better ways to evaluate performance of models trained on imbalanced data. \n", "\n", "#### Precision and Recall \n", "\n", "Precision and recall are information retrieval metrics that evaluate classification models. \n", "\n", "- [Precision](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall) is the \"fraction of relevant instances among the retrieved instances\".\n", " - What proportion of predicted readmitted patients were actually readmitted?\n", "- Recall is the \"fraction of the total amount of relevant instances that were actually retrieved\".\n", " - What proportion of readmitted patients were identified correctly?\n", "\n", "Looking at the equations below, we can see that precision aims to minimize the number of **False Positives**, while recall aims to minimize the number of **False Negatives**.\n", "\n", "![](images/precision_recall.png)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision: 0.621\n", "Recall: 0.451\n" ] } ], "source": [ "from sklearn.metrics import precision_score, recall_score, confusion_matrix\n", "\n", "y_pred = rfc.predict(X_test)\n", "\n", "precision = precision_score(y_true=y_test, y_pred=y_pred)\n", "recall = recall_score(y_true=y_test, y_pred=y_pred)\n", "\n", "print(f\"Precision: {precision:.3f}\")\n", "print(f\"Recall: {recall:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Of the patients who were labelled as readmitted, ~62% were actually readmitted.\n", "- Of the patients who were actually readmitted, 45% were labelled as readmitted. \n", "\n", "Another way to assess our model's performance is to visualize our results with a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix)." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(33.0, 0.5, 'actual')" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "confusion = confusion_matrix(y_true=y_test, y_pred=y_pred)\n", "labels = np.array([['TN','FP'],['FN','TP']])\n", "\n", "sns.heatmap(confusion,annot=labels, fmt='', linewidths=2, cmap=\"Blues\")\n", "plt.xlabel(\"predicted\")\n", "plt.ylabel(\"actual\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The heatmap above shows that we have more False Negatives than False Positives. This means that when we predict a patient will be readmitted, there's a good chance that we got it right. But when we predict that a patient won't be admitted, we're missing quite a few patients who actually do return to the hospital. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 10: Examining Feature Importance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With RandomForestClassifier, we can dig further to examine which features were the most important in classification." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featuresimportance
39number_inpatient0.502475
38number_emergency0.145222
37number_outpatient0.103304
40number_diagnoses0.095260
36num_medications0.049601
28age_label0.021661
29admission_type_Elective0.019275
34num_lab_procedures0.018061
35num_procedures0.016270
17insulin_bool0.009118
0metformin_bool0.005045
25race_Caucasian0.003034
1repaglinide_bool0.001752
6glipizide_bool0.001735
24race_Asian0.001074
\n", "
" ], "text/plain": [ " features importance\n", "39 number_inpatient 0.502475\n", "38 number_emergency 0.145222\n", "37 number_outpatient 0.103304\n", "40 number_diagnoses 0.095260\n", "36 num_medications 0.049601\n", "28 age_label 0.021661\n", "29 admission_type_Elective 0.019275\n", "34 num_lab_procedures 0.018061\n", "35 num_procedures 0.016270\n", "17 insulin_bool 0.009118\n", "0 metformin_bool 0.005045\n", "25 race_Caucasian 0.003034\n", "1 repaglinide_bool 0.001752\n", "6 glipizide_bool 0.001735\n", "24 race_Asian 0.001074" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_importances = {\n", " 'features': list(X.columns.values),\n", " 'importance': list(rfc.feature_importances_)\n", "}\n", "\n", "important_features = pd.DataFrame(feature_importances)\n", "important_features.sort_values(by='importance', ascending=False).head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most important feautures appear to be `number_inpatient` and `number_emergency`, which represents the number of inpatient and emergency visits of the patient in the year preceding the encounter. So if a patient was admitted to the hospital in the past, this increases their chance of being readmitted in the future. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }