Predicting Hospital Readmission with a Binary Classification Model

In this tutorial, we’ll be looking at hospital admission data in patients with diabetes. This dataset was collected from 130 hospitals in the United States from 1999 to 2008. More details can be found on the UCI Machine Learning Repository website.

This walkthrough is divided into two parts:

  • Data preprocessing (Steps 1-5), which involves data cleaning, exploration, and feature engineering
  • Data modelling (Steps 6-10), where we will train a machine learning model to predict whether a patient will be readmitted to the hospital

Data Preprocessing

Step 1: Importing Depedencies

Before getting started, we’ll need to import several packages. These include:

  • pandas - a package for performing data analysis and manipulation
  • numpy - a package for scientific computing
  • matplotlib - the standard Python plotting package
  • seaborn - a dataframe-centric visualization package that is built off of matplotlib
In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Load the Data

We will be loading in the data as a pandas.DataFrame.

The data is stored in a csv file, which we can access locally (see data/patient_data.csv) or in the cloud (stored in a AWS S3 bucket). We’ll import the S3 version of this data using a pandas method called read_csv.

In [3]:
data = pd.read_csv("https://s3.us-east-2.amazonaws.com/explore.datasets/diabetes/patient_data.csv")

To get a glimpse of our data, we can use either the head(), which shows the first 5 rows of the dataframe.

In [4]:
data.head()
Out[4]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id time_in_hospital medical_specialty ... examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone diabetesMed readmitted
0 2278392 8222157 Caucasian Female [0-10) NaN 6 25 1 Pediatrics-Endocrinology ... No No No No No No No No No NO
1 149190 55629189 Caucasian Female [10-20) NaN 1 1 3 NaN ... No No Up No No No No No Yes >30
2 64410 86047875 AfricanAmerican Female [20-30) NaN 1 1 2 NaN ... No No No No No No No No Yes NO
3 500364 82442376 Caucasian Male [30-40) NaN 1 1 2 NaN ... No No Up No No No No No Yes NO
4 16680 42519267 Caucasian Male [40-50) NaN 1 1 1 NaN ... No No Steady No No No No No Yes NO

5 rows × 44 columns

How many rows and columns are in our dataset?

In [5]:
data.shape
Out[5]:
(101766, 44)

Our dataset has 101,766 rows and 45 columns. Each row represents a unique hospital admission. Columns represent patient demographics, medical details, and admission-specific information such as length of stay (time_in_hospital). We can see a list of all columns by applying .columns to our dataframe.

In [6]:
print(f"Columns: {data.columns.tolist()}")
Columns: ['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'time_in_hospital', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'diabetesMed', 'readmitted']

Looking at the columns, we can see that a large proportion are medication names. Let’s store these column names as a separate list, which we’ll get back to in a bit.

In [7]:
medications = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
    'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
    'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
    'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
    'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']

print(f"There are {len(medications)} medications represented as columns in the dataset.")
There are 23 medications represented as columns in the dataset.

How many hospital admissions and unique patients are in the dataset?

In [8]:
n_admissions = data['encounter_id'].nunique()
n_patients = data['patient_nbr'].nunique()

print(f"Number of hospital admissions: {n_admissions:,}")
print(f"Number of unique patients: {n_patients:,}")
Number of hospital admissions: 101,766
Number of unique patients: 71,518

How many patients have had more than one hospital admission?

In [9]:
admissions_per_patient = data['patient_nbr'].value_counts().reset_index()
admissions_per_patient.columns = ['patient_nbr', 'count']
multiple_admissions = admissions_per_patient[admissions_per_patient['count'] > 1]
In [10]:
print(f"Proportion of patients that have multiple admissions: {multiple_admissions['patient_nbr'].nunique()/n_patients:.2%}")
print(f"Maximum number of admissions for a given patient: {multiple_admissions['count'].max()}")
Proportion of patients that have multiple admissions: 23.45%
Maximum number of admissions for a given patient: 40

Almost one-quarter of the patients (23.45%) have had more than 1 hosptial admission. The maximum number of hospital admissions for a given patient is 40.

Step 3: Data Cleaning

Data cleaning is a crucial step in the machine learning pipeline, and typically requires the most time and effort in any data science project.

Decoding Admission Type

The admission_type_id column describes the type of admission and is represented by integers. The id column links to descriptors found in a separate file. We’ll update this column so that it represents the descriptor name instead of simply the id number.

Our mapper files are located in data/id_mappers/. They are also stored on the cloud in a AWS S3 bucket.

In [11]:
admission_type = pd.read_csv("https://s3.us-east-2.amazonaws.com/explore.datasets/diabetes/id_mappers/admission_type_id.csv")
admission_type
Out[11]:
admission_type_id description
0 1 Emergency
1 2 Urgent
2 3 Elective
3 4 Newborn
4 5 Not Available
5 6 NaN
6 7 Trauma Center
7 8 Not Mapped

We can see that the admission type mapper file has 3 values which represent missing data:

  1. NaN
  2. ‘Not Mapped’
  3. ‘Not Available’

Let’s collapse these into one category that represents a missing value. We can use pandas replace method to do this.

In [12]:
missing_values = ['nan', 'Not Available', 'Not Mapped']
admission_type['description'] = admission_type['description'].replace(missing_values, np.nan)
In [13]:
admission_type.columns = ['admission_type_id', 'admission_type']
In [14]:
data = data.merge(admission_type, on='admission_type_id')

Now that we have a “clean” mapper, we can apply it to our dataset. We can map admission_type_id values in our original dataframe to the descriptors in our admission_type_mapper dictionary.

In [15]:
data['admission_type'].value_counts()
Out[15]:
Emergency        53990
Elective         18869
Urgent           18480
Trauma Center       21
Newborn             10
Name: admission_type, dtype: int64
In [16]:
sns.countplot(x='admission_type', data=data, palette='magma')
plt.xlabel('type of hospital admission')
plt.show()
../_images/notebooks_walkthrough_29_0.png

Decoding Discharge Location

In [17]:
discharge_disposition = pd.read_csv("https://s3.us-east-2.amazonaws.com/explore.datasets/diabetes/id_mappers/discharge_disposition_id.csv")
discharge_disposition.sample(n=5, random_state=416)
Out[17]:
discharge_disposition_id description
13 14 Hospice / medical facility
8 9 Admitted as an inpatient to this hospital
14 15 Discharged/transferred within this institution...
20 21 Expired, place unknown. Medicaid only, hospice.
29 29 Discharged/transferred to a Critical Access Ho...

In medicine, “expired” is a term that describes a patient who has died. We only want to predict hospital readmission for living patients so we’re going to remove hospital admissions in which the patient was recorded as “expired” upon being discharged in our dataset.

We’ll first convert our description column to lowercase (str.lower()), then we’ll search for rows that contain “expired” (str.contains("expired")).

In [18]:
discharge_disposition['expired'] = discharge_disposition['description'].str.lower().str.contains('expired')

Now let’s take a look at all discharge dispositions that indicate an expired patient. We’ll create a new dataframe that filters for rows in which the expired column is True.

In [19]:
discharge_expired = discharge_disposition[discharge_disposition['expired']==True]
discharge_expired
Out[19]:
discharge_disposition_id description expired
10 11 Expired True
18 19 Expired at home. Medicaid only, hospice. True
19 20 Expired in a medical facility. Medicaid only, ... True
20 21 Expired, place unknown. Medicaid only, hospice. True
In [20]:
expired_ids = discharge_expired['discharge_disposition_id'].tolist()
print(f"discharge_disposition_id's that indicate an expired patient: {expired_ids}")
discharge_disposition_id's that indicate an expired patient: [11, 19, 20, 21]

The next step is to remove all rows in our original dataset that has discharge_disposition_id equal to one of the values in our expired_ids list.

In [21]:
data = data[~data['discharge_disposition_id'].isin(expired_ids)]

After removing expired patients, how many patients do we have in our dataset?

In [22]:
n_patients_nonexpired = data['patient_nbr'].nunique()
print(f"Original number of patients: {n_patients:,}")
print(f"Number of expired patients: {n_patients-n_patients_nonexpired:,}")
print(f"After filtering out expired patients: {n_patients_nonexpired:,}")
Original number of patients: 71,518
Number of expired patients: 1,079
After filtering out expired patients: 70,439

We had 1,079 expired patients in our dataset. After removing them, our dataset has 70,439 patients.

Converting Medication Features From Categorical to Boolean

Remember that list of medications we created when we loaded our data in Step 2? We’re going to convert these medication columns into boolean variables.

In [23]:
data[medications[0]].value_counts()
Out[23]:
No        80216
Steady    18256
Up         1067
Down        575
Name: metformin, dtype: int64

Medication columns are currently categorical datatypes that have several possible categories including:

  • “No” (not taking the medication)
  • “Up” (increased medication dose)
  • “Down” (decrease medication dose)
  • “Steady” (no changes in dose)

To keep things simple, we’ll update the column to “0” (not taking the medication) to “1” (taking the medication). We’re losing out on information regarding their dose change, but it’s a compromise we’re willing to make in order to simplify our dataset.

We can use numpy.where to convert all instances of “No” to 0 and everything else (i.e., “Up”, “Down”, “Steady”) to 1. Let’s loop through all medications and convert each column to boolean.

In [24]:
for m in medications:
    data[f'{m}_bool'] = np.where(data[m]=='No', 0, 1)
    data = data.drop(columns=m)

Our medication data are now represented as boolean features. Let’s take a look at the prevalence of these medications. We’ll calcualte the proportion of patients taking each type of medication. Because some patients have had multiple hospital admissions in this dataset, we’ll need to do some wrangling to determine whether a patient was on a given medication during any of their admissions. The wrangling process consists of the following steps:

  • applying groupby to patient_nbr and calculate the sum of admissions in which the patient was administered a medication
  • convert the column to boolean such that patients that have “0” are False and “1” is True
  • calculate the sum of patients on that specific medication
  • calculate the proportion of patients who were administered that medication
In [25]:
prevalence = []

for m in medications:
    patient_meds = data.groupby('patient_nbr')[f'{m}_bool'].sum().reset_index()
    patient_meds[f'{m}_bool'] = patient_meds[f'{m}_bool'].astype(bool)
    n_patients_on_med = patient_meds[f'{m}_bool'].sum()
    proportion = n_patients_on_med/n_patients
    prevalence.append(proportion)

Now that we have a list of medication prevalence, we can create a dataframe and sort by prevalence to determine which medications are most prevalent in our dataset.

In [26]:
medication_counts = pd.DataFrame({'medication': medications, 'prevalence':prevalence})
medication_counts = medication_counts.sort_values(by='prevalence', ascending=False)
medication_counts.head()
Out[26]:
medication prevalence
17 insulin 0.543779
0 metformin 0.229523
6 glipizide 0.138916
7 glyburide 0.118851
9 pioglitazone 0.082525

Let’s also visualize the top 10 most prevalent medications. We’ll use seaborn’s barplot method.

In [27]:
sns.barplot(x='medication', y='prevalence', data=medication_counts.head(10), palette='viridis')
plt.xticks(rotation=90)
plt.show()
../_images/notebooks_walkthrough_52_0.png

We can see that insulin is by far the most prevalent medication followed by metformin. More than half of the patients in our dataset were prescribed insulin whie 22% were prescribed metformin.

MeSH (or Medical Subject Headings) are a type of “tag” that describes a medical term. We’ll use RxNav’s API to further investigate which MeSH terms are assocaited with our list of medications. We’ll cretea a function called get_mesh_from_drug_name which returns relevant MeSH terms for a given drug name.

In [28]:
import json
import requests

def get_mesh_from_drug_name(drug_name):
    drug_name = drug_name.strip()
    rxclass_list = []
    try:
        r = requests.get(f"https://rxnav.nlm.nih.gov/REST/rxclass/class/byDrugName.json?drugName={drug_name}&relaSource=MESH")
        response = r.json()
        all_concepts = response['rxclassDrugInfoList']['rxclassDrugInfo']
        for i in all_concepts:
            rxclass_list.append(i['rxclassMinConceptItem']['className'])
    except:
        pass
    return list(set(rxclass_list))

We’ll pass the top 10 medications into get_mesh_from_drug_name and get their relevant MeSH terms. The results will be stored in a dictionary that we’ll call med_mesh_descriptors.

In [29]:
top_ten_meds = medication_counts.head(10)['medication'].tolist()

med_mesh_descriptors = dict()
for m in top_ten_meds:
    med_mesh_descriptors[m] = get_mesh_from_drug_name(m)
In [30]:
med_mesh_descriptors
Out[30]:
{'insulin': [],
 'metformin': ['Hypoglycemic Agents'],
 'glipizide': ['Hypoglycemic Agents'],
 'glyburide': ['Hypoglycemic Agents'],
 'pioglitazone': ['Hypoglycemic Agents'],
 'rosiglitazone': ['Hypoglycemic Agents'],
 'glimepiride': ['Immunosuppressive Agents',
  'Hypoglycemic Agents',
  'Anti-Arrhythmia Agents'],
 'repaglinide': ['Hypoglycemic Agents'],
 'nateglinide': ['Hypoglycemic Agents'],
 'glyburide-metformin': ['Hypoglycemic Agents']}

The results above show that all medications have the MeSH term hypoglycemic agent, which means it’s an anti-diabetic medication. Interestingly, the medication glimepiride also has two other associated MeSH terms ‘Anti-Arrhythmia Agents’ and ‘Immunosuppressive Agents’ in addition to it being a hypoglycemic agent.

If you want to learn more about each medication in our dataset, check out the RxNav dashboard which gives an overview of medication properties and interactions.

image0

Creating a Target Variable

The goal of our model will be to predict whether a patient will get readmitted to the hospital. Looking at the readmitted column, we see that there are 3 possible values:

  1. NO (not readmitted)
  2. >30 (readmitted more than 30 days after being discharged)
  3. <30 (readmitted within 30 days of being discharged)
In [31]:
data['readmitted'].value_counts()
Out[31]:
NO     53212
>30    35545
<30    11357
Name: readmitted, dtype: int64

To keep things simple, we’ll view this as a binary classification problem: did the patient get readmitted? We’re going to use numpy.where to convert all instances of “NO” to 0 (means patient did not get readmitted) and everyhting else to 1 (patient did get readmitted).

In [32]:
data['readmitted_bool'] = np.where(data['readmitted']=='NO', 0, 1)
data['readmitted_bool'].value_counts()
Out[32]:
0    53212
1    46902
Name: readmitted_bool, dtype: int64

Step 4: Data Exploration and Visualization

Assessing Missing Values

To get a better sense of the missing values in our data, let’s visualize it using missingno’s “nullity” matrix.

In [33]:
import missingno as msno

msno.matrix(data)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1145977d0>
../_images/notebooks_walkthrough_67_1.png

The data-dense columns are fully black, while the sparse columns (with missing values) have a mixture of white and black.

Patient Demographics: Age and Gender

In [34]:
plt.figure(figsize=(14,4))
plt.subplot(1,2,1)
sns.countplot(x='age', data=data, palette='viridis')
plt.title("Distribution of Age")

plt.subplot(1,2,2)
sns.countplot(data['gender'], palette='viridis')
plt.title("Distribution of Gender")
plt.show()
../_images/notebooks_walkthrough_70_0.png
In [35]:
data['gender'].value_counts(normalize=True)
Out[35]:
Female    0.538013
Male      0.461987
Name: gender, dtype: float64

The age distribution plot shows that our dataset represents an aging population. The most common age range is 70-80 years old. Our population also has a higher proportion of females than males.

How long were hospital stays for a given admission?

In [36]:
plt.figure(figsize=(8,6))
sns.countplot(data['time_in_hospital'], palette='viridis')
plt.xlabel("time in hospital (days)")
plt.title("Length of Hospital Stay")
plt.show()
../_images/notebooks_walkthrough_74_0.png
In [37]:
print(f"Mean time in hospital: {data['time_in_hospital'].mean():.2f}")
Mean time in hospital: 4.39

Patients stayed on average 4.4 days in hospital. The longest stay was 14 days.

Number of Diagnoses, Procedures, Medications

In [38]:
plt.figure(figsize=(14,5))

plt.subplot(1,2,1)
sns.kdeplot(data['num_medications'], shade=True, legend=False)
plt.title(f"Number of Medications, mean: {data['num_medications'].mean():.2f}", size=14)

plt.subplot(1,2,2)
sns.kdeplot(data['num_lab_procedures'], shade=True, legend=False)
plt.title(f"Number of Lab Procedures, mean: {data['num_lab_procedures'].mean():.2f}", size=14)
plt.show()
../_images/notebooks_walkthrough_78_0.png

Patients on average were administered 16 medications during their hospital stay. The average number of lab procedures was 43.

What was the most common medical specialty?

We also have information on the medical specialty of a patient’s attending physician. This can give us a sense of the nature of a patient’s illness during their hospital stay. For example, “orthopedics” would suggest that the patient’s presenting issue was bone-related, while “nephrology” suggests a kidney problem.

In [39]:
medical_specialties = data['medical_specialty'].value_counts().reset_index()
medical_specialties.columns = ['specialty', 'count']
medical_specialties['prevalence'] = medical_specialties['count']/len(data)
print(f"There are {data['medical_specialty'].nunique()} medical specialties.")
medical_specialties.head(10)
There are 72 medical specialties.
Out[39]:
specialty count prevalence
0 InternalMedicine 14328 0.143117
1 Emergency/Trauma 7449 0.074405
2 Family/GeneralPractice 7302 0.072937
3 Cardiology 5296 0.052900
4 Surgery-General 3068 0.030645
5 Nephrology 1544 0.015422
6 Orthopedics 1394 0.013924
7 Orthopedics-Reconstructive 1231 0.012296
8 Radiologist 1129 0.011277
9 Pulmonology 856 0.008550

What proportion of patients were on diabetes medication during their hospital stay?

In [40]:
data['diabetesMed'].value_counts(normalize=True)
Out[40]:
Yes    0.77184
No     0.22816
Name: diabetesMed, dtype: float64

77% of patients were on diabetes medication during their stay.

Do patients have normal A1C levels?

The A1C blood test is used to diagnose whether a patient has type I or II diabetes, and represents the average levels of blood sugar over the past 3 months. The higher the A1C level, the poorer a patient’s blood sugar control which indicates a higher risk of diabetes complications. The table below represents Mayo Clinic’s guideline of how to interpret A1C levels:

interpretation A1C level
no diabetes <5.7
pre-diabetes 5.7-6.4
diabetes >6.5
well-managed diabetes <7
poorly managed diabetes >8

Our dataset has a A1Cresult which reflects a patient’s A1C level during their hospital stay.

In [41]:
data['A1Cresult'].value_counts(normalize=True)
Out[41]:
>8      0.482994
Norm    0.292783
>7      0.224224
Name: A1Cresult, dtype: float64
In [42]:
print(f"Proportion of hospital admissions with missing A1C result: {data['A1Cresult'].isna().sum()/len(data):.2%}")
Proportion of hospital admissions with missing A1C result: 83.14%

Of the hospital admissions where A1C was measured, almost half had a A1C level of greater than 8, which suggests that the patient’s diabetes was poorly managed. However, the availability of A1C data is sparse in our dataset, so we may want to consider not including it in the first iteration of our model.

Step 5: Feature Selection and Engineering

Our dataset contains quite a few categorical variables such as race, age, and admission_type. In general, machine learning models can’t handle categorical variables so we can use one-hot and label encoding to convert our string features to numerical without adding a hierarchy.

One-hot Encoding

Let’s say we want to convert a patient’s race to a numerical feature. We could use label encoding to convert each race to values 0-5 but this suggests an inherent order among races that does not exist. With one-hot encoding, each race becomes an independent feature.

In [43]:
categorical = ['race', 'admission_type']

for c in categorical:
    data = pd.concat([data, pd.get_dummies(data[c], prefix=c)], axis=1)
    data.drop(columns=c)
In [44]:
data['age'].value_counts()
Out[44]:
[70-80)     25562
[60-70)     22185
[50-60)     17102
[80-90)     16706
[40-50)      9626
[30-40)      3765
[90-100)     2668
[20-30)      1650
[10-20)       690
[0-10)        160
Name: age, dtype: int64

Label Encoding

There are a couple of features where label encoding is applicable.

  • age (from [0-10) to [70-80)) can assume an ordinal relationship
  • gender (Male or Female) converts to 0 and 1 which is binary

Let’s go ahead and apply scikit-learn’s LabelEncoder to these two columns.

In [45]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

data['age_label'] = label_encoder.fit_transform(data['age'])
data['gender_bool'] = label_encoder.fit_transform(data['gender'].astype(str))

Because our gender column contained some missing values, it was considered to be a mixed datatype. We had to convert it to a string datatype in order to label encoding to work.

Data Modelling

Step 6: Defining the X and y Variables

Given that we now have a good understanding of our dataset, we can now aim to build predictive models. The first step is to separate our features and target into variables X and y respectively.

In [46]:
med_features = ['metformin_bool', 'repaglinide_bool',
    'nateglinide_bool', 'chlorpropamide_bool', 'glimepiride_bool',
    'acetohexamide_bool', 'glipizide_bool', 'glyburide_bool',
    'tolbutamide_bool', 'pioglitazone_bool', 'rosiglitazone_bool',
    'acarbose_bool', 'miglitol_bool', 'troglitazone_bool',
    'tolazamide_bool', 'examide_bool', 'citoglipton_bool', 'insulin_bool',
    'glyburide-metformin_bool', 'glipizide-metformin_bool',
    'glimepiride-pioglitazone_bool', 'metformin-rosiglitazone_bool',
    'metformin-pioglitazone_bool']

demographic_features = ['race_AfricanAmerican', 'race_Asian',
    'race_Caucasian', 'race_Hispanic', 'race_Other', 'age_label',
    'admission_type_Elective', 'admission_type_Newborn',
    'admission_type_Trauma Center', 'admission_type_Urgent', 'gender_bool']

other_features = ['num_lab_procedures', 'num_procedures',
       'num_medications', 'number_outpatient', 'number_emergency',
       'number_inpatient', 'number_diagnoses']

all_features = med_features + demographic_features + other_features

X = data[all_features]
y = data['readmitted_bool']

Step 7: Choosing our Model

When building a binary classification model, there are a wide selection of machine learning models to choose from:

  • Random Forest Classification
  • Logistic Regression
  • Linear Discriminant Analysis
  • Support Vector Machines (SVM)
  • Gaussian Naive Bayes
  • k-Nearest Neighbours

We’ll test out the Random Forest Classifier (RFC) for this dataset. RFC is an ensemble learning technique that works by creating a “forest” of decision trees. Each tree evaluates the data for a given patient and outputs a 0 or 1. Random Forest looks at the output of all trees and gives the majority vote as its result. Let’s say we have a forest with 3 trees and 2 of them predict the patient will be readmitted. The majority vote is that the patient will be readmitted.

image0

We’re choosing Random Forest because:

  • it is robust to outliers
  • it is able to handle unbalanced datasets
  • it measures feature importance

We’ll import RFC from scikit-learn which is a very comprehensive Python library for data mining and data analysis.

In [47]:
from sklearn.ensemble import RandomForestClassifier

We can inspect the default parameters for RandomForestClassifier by creating an instance of the class and applying get_params():

In [48]:
RandomForestClassifier().get_params()
Out[48]:
{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

You can keep the default values for most of these parameters. But there are a few that can be modified prior to training the model that can impact model performance. These are called hyperparameters. Some RFC hyperparameters include:

  • n_estimators: number of trees in the forest
  • max_depth: maximum number of levels in each decision tree
  • max_features: maximum number of features considered for splitting a node
  • min_samples_split: number of data points placed in a node before the node is split

These are external configurations that can’t be learned from training the model. To select the optimal values of a hyperparameter, we’ll need to use a technique called hyperparameter tuning.

Step 8: Hyperparameter Tuning

Hyperparemter tuning is a critical step in the machine learning pipeline. It describes the process of choosing a set of optimal hyperparameters for a model. The hyperparameters that you select can have a significant impact on your model’s performance.

We’re going to be testing out two hyperparameter tuning techniques offered by scikit-learn:

With grid search, you define your search space as a grid of values and iterate over each grid point until you find the optimal combination of values. Let’s say we want to tune max_depth and n_estimators in our RandomForestClassifer. We’ll set our search space as follows:

  • n_estimators = [5,10,50]
  • max_depth = [3,5,10]

This means that we’ll have to train our model 9 times to test for every configuration of values. We’ll choose the combination of n_estimators and max_depth that give us the best model performance.

Let’s implement this with scikit-learn’s GridSearchCV. We first need to define our search space as a dictionary. We also need to initialize our model.

In [49]:
from sklearn.model_selection import GridSearchCV

search_space = {
    'n_estimators': [5,10,50],
    'max_depth': [3,5,10]
}

rfc = RandomForestClassifier(random_state=42)

Now, we set up grid search. GridSearchCV also performs cross-validation (hence the 'CV') so we can specify how many folds we want in our analysis. We’ll set our number of folds to 3. The more folds you use, the longer it will take to compute results.

In [50]:
grid_search = GridSearchCV(rfc, search_space, cv=3)

The last step is to run fit. We’ll pass in X and y which we created in Step 6. GridSearchCV will use RFC’s default metric, accuracy. If we want to optimize our model using another metric, we can specify scoring = 'precision' (or whichever metric we’re interested in) inside GridSearchCV. Let’s stick with accuracy for now.

In [51]:
grid_search.fit(X, y)
Out[51]:
GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [5, 10, 50], 'max_depth': [3, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

What are the optimal hyperparameters?

In [52]:
print(f"Optimal hyperparameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
Optimal hyperparameters: {'max_depth': 5, 'n_estimators': 50}
Best score: 0.617

Based on the search space we defined for GridSearchCv, it looks like max_depth of 5 and n_estimators of 50 are our optimal hyperparmeters which gave us an accuracy of 0.617.

We can also see a thorough report of our results with cv_results_. It shows fit time, score time, and mean train/test score (averaged over all folds). We’ll sort by mean_test_score.

In [53]:
results = pd.DataFrame(grid_search.cv_results_).sort_values(by='mean_test_score', ascending=False)
results.head()
Out[53]:
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score mean_train_score std_train_score
5 1.132375 0.009537 0.121620 0.001741 5 50 {'max_depth': 5, 'n_estimators': 50} 0.609014 0.633214 0.608373 0.616867 0.011562 1 0.633694 0.621683 0.627437 0.627605 0.004905
2 0.821872 0.007762 0.100330 0.000355 3 50 {'max_depth': 3, 'n_estimators': 50} 0.600264 0.627461 0.613047 0.613591 0.011110 2 0.624509 0.611839 0.619121 0.618490 0.005192
8 1.907094 0.071164 0.202362 0.028745 10 50 {'max_depth': 10, 'n_estimators': 50} 0.610332 0.621767 0.606814 0.612971 0.006383 3 0.655344 0.647828 0.647259 0.650144 0.003685
4 0.251050 0.008086 0.035588 0.000535 5 10 {'max_depth': 5, 'n_estimators': 10} 0.606107 0.624554 0.596686 0.609116 0.011574 4 0.631596 0.615375 0.622912 0.623294 0.006628
7 0.399570 0.004194 0.045592 0.000699 10 10 {'max_depth': 10, 'n_estimators': 10} 0.609553 0.614216 0.601510 0.608426 0.005248 5 0.654161 0.644068 0.642195 0.646808 0.005255

In 2012, Bergstra and Bengio from the University of Montreal proposed a new technique called random search which is similar to grid search but instead of sampling over a discrete set of values, you’re now randomly sampling from a distribution of values. Random search is effective in situations where not all hyperparameters are equally important.

image0

The visualization above gives an example of when random search can perform better. With grid search, you’re only looking at 3 different values of a given hyperparamter. But with random search you’re looking at nine different values. As you increase the number of samples in your random search, you increase the probability of finding the optimal hyperparameters for your model.

Let’s test out random search using scikit-learn’s RandomizedSearchCV. We’ll define our search space over a uniform distribution of values. We’ll iterate 9 times, just like we did for grid search.

In [54]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

search_space = {
    "n_estimators": randint(10,100),
    "max_depth": randint(1, 11)
}

random_search = RandomizedSearchCV(rfc, param_distributions=search_space, n_iter=9, cv=3)
random_search.fit(X,y)

print(f"Optimal hyperparameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.3f}")
Optimal hyperparameters: {'max_depth': 4, 'n_estimators': 59}
Best score: 0.615
In [55]:
results = pd.DataFrame(random_search.cv_results_).sort_values(by='mean_test_score', ascending=False)
results.head()
Out[55]:
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score mean_train_score std_train_score
7 1.149746 0.021392 0.129111 0.002811 4 59 {'max_depth': 4, 'n_estimators': 59} 0.606107 0.629858 0.609841 0.615269 0.010428 1 0.629873 0.614746 0.624890 0.623170 0.006294
0 1.136515 0.057328 0.126526 0.004007 4 58 {'max_depth': 4, 'n_estimators': 58} 0.605178 0.630068 0.609152 0.614799 0.010918 2 0.629634 0.614791 0.625174 0.623200 0.006218
1 2.200676 0.430752 0.166357 0.003467 6 65 {'max_depth': 6, 'n_estimators': 65} 0.609853 0.623895 0.609691 0.614479 0.006658 3 0.636511 0.625938 0.630029 0.630826 0.004353
6 2.004584 0.041066 0.205204 0.006243 6 80 {'max_depth': 6, 'n_estimators': 80} 0.610362 0.622247 0.608313 0.613640 0.006143 4 0.636661 0.626223 0.630313 0.631066 0.004294
4 2.834532 0.151390 0.263107 0.004990 9 77 {'max_depth': 9, 'n_estimators': 77} 0.610152 0.623925 0.606125 0.613401 0.007621 5 0.648422 0.641221 0.641520 0.643721 0.003326

Grid Search and Random Search are uninformed methods which means that they do not take into consideration results from past evaluations. When you’re working with a very large search space, you might want to consider a “smarter” approach to hyperparameter tuning such a Sequential-Based Model Optimization (SMBO). The SMBO approach keeps track of previous iteration results which is used to sample hyperapramters at the current iteration. In other words, SMBO is trying to reduce the number of iterations by sampling the most promising hyperparameters based on past results. You can check out scikit-optimize to learn more about how to implement SMBO hyperparameter tuning with scikit-learn models.

Step 9: Evaluating Model Performance

There are several metrics that we can use to evaluate model performance:

  • accuracy
  • precision
  • recall
  • F1-score
  • ROC AUC score

A comprehensive list of classification metrics can be found in scikit-learn’s metrics module documentation.

In this walkthrough, we’ll look at accuracy, precision and recall. But before we can start evaluating our model, let’s split our data into two parts: 1) a training set and 2) a test set. We’ll fit our model on the training data, and evaluate its performance using the test set.

In [56]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfc = RandomForestClassifier(**random_search.best_params_)
rfc.fit(X_train, y_train)
Out[56]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=59, n_jobs=None,
            oob_score=False, random_state=99, verbose=0, warm_start=False)

Accuracy is RFC’s default evaluation metric. Using the score metric, we get the measured accuracy from the trained model.

In [57]:
accuracy = rfc.score(X_test, y_test)

print(f"Accuracy: {accuracy:.3f}")
Accuracy: 0.620

An accuracy score of 0.62 means that ~62% of hospital admissions were correctly labeled. The dataset that we’re working with is relatively balanced, but let’s say we have an imbalanced dataset where 90% of patients ended up being readmitted. Using accuracy to evaluate a model trained on imbalanced data is a problem because if we automatically predicted all patients to be readmitted, our accuracy would be 90% by default. Precision and recall are better ways to evaluate performance of models trained on imbalanced data.

Precision and Recall

Precision and recall are information retrieval metrics that evaluate classification models.

  • Precision is the “fraction of relevant instances among the retrieved instances”.
    • What proportion of predicted readmitted patients were actually readmitted?
  • Recall is the “fraction of the total amount of relevant instances that were actually retrieved”.
    • What proportion of readmitted patients were identified correctly?

Looking at the equations below, we can see that precision aims to minimize the number of False Positives, while recall aims to minimize the number of False Negatives.

image0

In [58]:
from sklearn.metrics import precision_score, recall_score, confusion_matrix

y_pred = rfc.predict(X_test)

precision = precision_score(y_true=y_test, y_pred=y_pred)
recall = recall_score(y_true=y_test, y_pred=y_pred)

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
Precision: 0.621
Recall: 0.451
  • Of the patients who were labelled as readmitted, ~62% were actually readmitted.
  • Of the patients who were actually readmitted, 45% were labelled as readmitted.

Another way to assess our model’s performance is to visualize our results with a confusion matrix.

In [59]:
confusion = confusion_matrix(y_true=y_test, y_pred=y_pred)
labels = np.array([['TN','FP'],['FN','TP']])

sns.heatmap(confusion,annot=labels, fmt='', linewidths=2, cmap="Blues")
plt.xlabel("predicted")
plt.ylabel("actual")
Out[59]:
Text(33.0, 0.5, 'actual')
../_images/notebooks_walkthrough_129_1.png

The heatmap above shows that we have more False Negatives than False Positives. This means that when we predict a patient will be readmitted, there’s a good chance that we got it right. But when we predict that a patient won’t be admitted, we’re missing quite a few patients who actually do return to the hospital.

Step 10: Examining Feature Importance

With RandomForestClassifier, we can dig further to examine which features were the most important in classification.

In [60]:
feature_importances = {
    'features': list(X.columns.values),
    'importance': list(rfc.feature_importances_)
}

important_features = pd.DataFrame(feature_importances)
important_features.sort_values(by='importance', ascending=False).head(15)
Out[60]:
features importance
39 number_inpatient 0.502475
38 number_emergency 0.145222
37 number_outpatient 0.103304
40 number_diagnoses 0.095260
36 num_medications 0.049601
28 age_label 0.021661
29 admission_type_Elective 0.019275
34 num_lab_procedures 0.018061
35 num_procedures 0.016270
17 insulin_bool 0.009118
0 metformin_bool 0.005045
25 race_Caucasian 0.003034
1 repaglinide_bool 0.001752
6 glipizide_bool 0.001735
24 race_Asian 0.001074

The most important feautures appear to be number_inpatient and number_emergency, which represents the number of inpatient and emergency visits of the patient in the year preceding the encounter. So if a patient was admitted to the hospital in the past, this increases their chance of being readmitted in the future.