This is an extension of my previous blog post on gaussian processes. In this post we will cover how to use GPytorch to model multiple vital signs at the same time. One would want to do this when you have multiple features that are correlated or have obvious dependencies. Another of looking at would be if you have a set of features where the missing values are Missing at Random.
This is an alternative to using a single gaussian process model to impute each feature independently. Other alternatives include linear interpolation, K-nearest-neighbors, Bayesian Ridge Regression, Random Forest, etc. More can be found here from sci-kit-learn.

We'll be using public data from the Physionet 2019 competition on the early prediction of sepsis. Data can be found here

Sources

The original paper for Multi-task Gaussian processes.

This application for using MTGP for vital signs is based on the approach from this paper

Most of the GPyTorch code comes from this example.

Data

Here is a short code snippet to prepare the data:

#collapse
import pandas as pd
import os
import numpy as np
import re
from cache_em_all import Cachable

DATA_DIR = "training_setB/"  # Path to the data

def load_single_file(file_path):
    df = pd.read_csv(file_path, sep='|')
    df['hours'] = df.index
    df['patient'] = re.search('p(.*?).psv', file_path).group(1)
    return df
    
def get_data_files():
    return [os.path.join(DATA_DIR, x) for x in sorted(os.listdir(DATA_DIR)) if int(x[1:-4]) % 5 > 0]

@Cachable('training_setB.csv')
def load_data():
    data = get_data_files()
    data_frames = [load_single_file(d) for d in data]
    merged = pd.concat(data_frames)
    return merged

We'll be looking patient 10019, who has a good amount of missing data. We'll only be modeling 6 vital signs:

Heart rate (HR)
Temperature (Temp)
Respiratory Rate (Resp)
Systolic Blood Pressure (SBP)
Diastolic blood Pressure (DBP)
Mean Arteiral Pressure (MAP)

df = load_data()
df = df.loc[df.patient == 100019]\
[['HR', 'Temp', 'SBP', 'DBP', 'MAP', 'Resp']].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

The goal will be to model these 6 features simultaneously to fill in the missing values.

Defining our Model

We'll be using the GPyTorch package for our guassian processes, the only package I'm aware of in Python that supports multi-task learning for gaussian processes.

Our model will consist of a python class with a constructor and a forward method that is called in our training loop later.

What's important to note here is we must specify how many tasks we want to learn. In our case, that is 6. Another difference from the Exact GP model is that we are defining two covariance modules, the standard RBFKernal, and the index kernal, which according to the documentation "is a lookup table containing inter-task covariance".

Our final covariance kernal is a multiplication of these two covariance kernals: $k([x,i], [x', j]) = k_{inputs}(x, x') * k_{tasks}(i,j)$

In our code: covar = covar_x.mul(covar_i)

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.RBFKernel()
        self.task_covar_module = gpytorch.kernels.IndexKernel(num_tasks=6, rank=1)


    def forward(self,x,i):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        covar_i = self.task_covar_module(i)
        covar = covar_x.mul(covar_i)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar)

Data Preparation

This is the most tricky part.

I'll explain what to contitutes full_train_x, full_train_i, and full_train_y in our instatiation of our Multitask model:

MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood)

df = load_data()
df = df.loc[df.patient == 100019]

Assuming that we will repeat this process with many patients, there inevitably will be a patient for which an entire feature is missing. We will want to exclude that, and impute with the average or median for that feature after.

impute_features = ['HR', 'DBP', 'Temp', 'SBP', 'MAP', 'Resp']
missing_feats = df.columns[df.isna().all()].tolist()
non_empty_feats = [f for f in set(impute_features) - set(missing_feats)]
non_empty_feats

['HR', 'Resp', 'SBP', 'DBP', 'MAP', 'Temp']

We than collect each of our features into a list of features that are torch tensors, our first tensor in the list being temperature:

feats = []
for f in non_empty_feats:
    feats.append(torch.tensor(df[f].values, dtype=torch.float))
feats[0]

tensor([nan, 56., 58., 56., 54., 52., 52., 58., 54., 54., 52., 58., 54., 58.,
        58., 58., 56., 50., 54., 52., 52., 52., 54., 58., 58., 70., 68., 68.,
        64., nan, 64., nan, 61., nan, 56., 58., 62., 58., 62., nan, 58., 60.,
        60., 62., 68., nan])

We define our full_train_x as the indicies with non-null values. Here, we see that our temperature feature has non-null values at index 3, 12, 14, etc.

full_train_x = [torch.where(~torch.isnan(f))[0] for f in feats]
full_train_x[0]

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 32, 34, 35, 36, 37, 38, 40,
        41, 42, 43, 44])

We then concatenate all the features index tensors together into one:

full_train_x = torch.cat(full_train_x)
full_train_x

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 32, 34, 35, 36, 37, 38, 40,
        41, 42, 43, 44,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 14, 16,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 37, 38, 40, 41, 42, 43, 44,  1,  2,
         3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
        21, 22, 23, 24, 25, 26, 27, 28, 30, 32, 34, 35, 37, 38, 40, 41, 42, 43,
        44,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 32, 34, 35, 37, 38, 40,
        41, 42, 43, 44,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
        15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 32, 34, 35,
        37, 38, 40, 41, 42, 43, 44,  3, 12, 14, 18, 20, 22, 34, 37, 42])

For full_train_i, this is the concatenated tensor of indicies to specify which value belongs to which feature, as if we had given an ordinal label to each feature.

In this example, we have assigned 0 to the temperature feature, and there is 9 non-null temperature values in our observation. We assign 1 to DBP, 2 to Resp, etc.

full_train_i = []
for i, f in enumerate(feats):
    full_train_i.append(torch.full_like(f[~torch.isnan(f)], dtype=torch.long, fill_value=i))
full_train_i = torch.cat(full_train_i)
full_train_i

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5,
        5, 5, 5, 5])

Finally, full_train_y are the non-null observed values for each feature, concatenated together.

full_train_y = [f[~torch.isnan(f)] for f in feats]
full_train_y = torch.cat(full_train_y)
full_train_y

tensor([ 56.0000,  58.0000,  56.0000,  54.0000,  52.0000,  52.0000,  58.0000,
         54.0000,  54.0000,  52.0000,  58.0000,  54.0000,  58.0000,  58.0000,
         58.0000,  56.0000,  50.0000,  54.0000,  52.0000,  52.0000,  52.0000,
         54.0000,  58.0000,  58.0000,  70.0000,  68.0000,  68.0000,  64.0000,
         64.0000,  61.0000,  56.0000,  58.0000,  62.0000,  58.0000,  62.0000,
         58.0000,  60.0000,  60.0000,  62.0000,  68.0000,  22.0000,  24.0000,
         22.0000,  20.0000,  20.0000,  22.0000,  20.0000,  18.0000,  16.0000,
         16.0000,  14.0000,  20.0000,  18.0000,  18.0000,  20.0000,  18.0000,
         24.0000,  22.0000,  21.0000,  19.0000,  22.0000,  18.0000,  16.0000,
         18.0000,  16.0000,  19.0000,  22.0000,  18.0000,  20.0000,  18.0000,
        145.0000, 163.0000, 151.0000, 112.0000, 113.0000, 109.0000, 158.0000,
        127.0000, 129.0000, 101.0000, 176.0000, 190.0000, 188.0000, 198.0000,
        164.0000, 160.0000, 156.0000, 191.0000, 147.0000, 170.0000, 158.0000,
        172.0000, 176.0000, 155.5000, 182.0000, 143.0000, 133.0000, 156.0000,
        160.0000, 135.5000, 153.0000, 180.0000, 177.0000, 194.0000, 158.0000,
        222.0000, 170.0000, 156.0000, 153.0000,  74.0000,  77.0000,  63.0000,
         57.0000,  64.0000,  63.0000,  77.0000,  63.0000,  65.0000,  63.0000,
         82.0000, 105.0000,  92.0000,  85.0000,  74.0000,  71.0000,  65.0000,
         86.0000,  68.0000,  68.0000,  71.0000,  85.0000, 119.0000,  80.0000,
         85.0000,  72.0000,  69.0000,  87.0000,  78.0000,  76.5000,  71.0000,
         78.0000,  81.0000,  77.0000,  71.5000,  93.0000,  76.0000,  69.0000,
         72.0000, 103.0000, 116.0000, 105.0000,  79.0000,  84.0000,  95.0000,
        112.0000,  86.0000,  92.0000,  84.0000, 100.0000, 147.0000, 146.0000,
        142.0000,  99.0000,  97.0000, 101.0000, 140.0000,  91.0000,  95.0000,
        118.0000, 116.0000, 152.0000, 116.0000, 134.0000,  92.0000,  98.0000,
        137.0000, 118.0000, 101.5000,  98.0000, 102.0000, 113.5000, 110.0000,
         92.0000, 142.0000, 105.0000, 100.0000, 101.0000,  36.8000,  36.4000,
         36.5000,  36.4000,  36.0000,  36.5000,  36.7000,  36.6000,  36.5000])

Important: Double check to make sure your features are in the same order for each of full_train_x, full_train_i, and full_train_y

Now that we have all our training data is the correct format, we just need to instantiate our likelihood. Note that our train_x argument for the MultitaskGPModel must be a tuple of (full_train_x, full_train_i).

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood)

Training Loop

Now fit our model to the data. Our model input is the training tuple (full_train_x, full_train_i), which we then feed into the ExactMarginalLogLikelihood MLE to determine our loss, by comparison with full_train_y. 10,000 iterations seems to work for a good fit visually.

training_iterations = 10_000

model.train()
likelihood.train()

optimizer = torch.optim.Adam([
    {'params': model.parameters()},  
], lr=0.1)

mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iterations):
    optimizer.zero_grad()
    output = model(full_train_x, full_train_i)
    loss = -mll(output, full_train_y)
    loss.backward()
    optimizer.step()

Model Prediction

Now we place our model and likelihood in evaluation mode, and create test_x, a tensor that holds all full list (non null and null) of indicies, and test_i_tasks, which contains a list of tensors with indicies that indentify our features with full length, which means they are the length of features with non and null values.

model.eval()
likelihood.eval()

timesteps = len(df)
test_x = torch.tensor(list(range(timesteps)), dtype=torch.long)
test_x

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45])

test_i_tasks = []
for i in range(len(non_empty_feats)):
    test_i_tasks.append(torch.full_like(test_x, dtype=torch.long, fill_value=i))
test_i_tasks

[tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 tensor([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]),
 tensor([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]),
 tensor([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5])]

Finally, we create a list of MultivariateNormal models in observed_pred_ys, and we predict the mean of each of these models to impute our features:

observed_pred_ys = []
with torch.no_grad(), gpytorch.settings.fast_pred_samples():
    for test_i_task in test_i_tasks:
        observed_pred_ys.append(likelihood(model(test_x, test_i_task)))

for f, preds in zip(non_empty_feats, observed_pred_ys):
    df[f] = preds.mean.detach().numpy()

df[impute_features].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

<matplotlib.legend.Legend at 0x1219658d0>

From our imputed graph we can easily compare and see that where there were previously missing values, we have no filled them in with reference to the other trends in vitals signs.

Going beyond 1 patient

We can put all our previous code in a function MTGP_impute(df, timesteps), which takes as input our df for a patient and the number of rows (timesteps) we have observed for that patient:

#collapse
def MTGP_impute(df, timesteps):
    impute_features = ['HR', 'DBP', 'Temp', 'SBP', 'MAP', 'Resp']
    df[impute_features].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    missing_feats = df.columns[df.isna().all()].tolist()
    non_empty_feats = [f for f in set(impute_features) - set(missing_feats)]

    feats = []
    for f in non_empty_feats:
        feats.append(torch.tensor(df[f].values, dtype=torch.float))
    
    full_train_x = [torch.where(~torch.isnan(f))[0] for f in feats]
    
    if full_train_x == []:
        return df
    
    full_train_x = torch.cat(full_train_x)

    full_train_i = []
    for i, f in enumerate(feats):
        full_train_i.append(torch.full_like(f[~torch.isnan(f)], dtype=torch.long, fill_value=i))
    full_train_i = torch.cat(full_train_i)


    full_train_y = [f[~torch.isnan(f)] for f in feats]
    full_train_y = torch.cat(full_train_y)

    # Instantiate likelihood and model
    likelihood = gpytorch.likelihoods.GaussianLikelihood()

    model = MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood)

    training_iterations = 10000

    model.train()
    likelihood.train()

    optimizer = torch.optim.Adam([
        {'params': model.parameters()},  # Includes GaussianLikelihood parameters
    ], lr=0.1)

    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

    for i in range(training_iterations):
        optimizer.zero_grad()
        output = model(full_train_x, full_train_i)
        loss = -mll(output, full_train_y)
        loss.backward()
        optimizer.step()

    model.eval()
    likelihood.eval()

    test_x = torch.tensor(list(range(timesteps)), dtype=torch.long)
    
    test_i_tasks = []
    for i in range(len(non_empty_feats)):
        test_i_tasks.append(torch.full_like(test_x, dtype=torch.long, fill_value=i))
        
    observed_pred_ys = []
    with torch.no_grad(), gpytorch.settings.fast_pred_samples():
        for test_i_task in test_i_tasks:
            observed_pred_ys.append(likelihood(model(test_x, test_i_task)))
    
    for f, preds in zip(non_empty_feats, observed_pred_ys):
        df[f] = preds.mean.detach().numpy()
        
    df[impute_features].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

    
    return df

df = load_data()
df = df.loc[df.patient == 100019]

MTGP_impute(df, len(df))

We can use our function to iterate through each patient in our full data frame. This is the second data set prodicded by Physionet, and contains 16,000 patients:

df = load_data()
impute_df = [MTGP_impute(df_group, len(df_group)) for patient, df_group in tqdm(df.groupby('patient'))]
impute_df = pd.concat(impute_df)

0%| | 3/16000 [01:37<141:45:35, 31.90s/it]

Looks like might take 141 hours, or 5-6 days. That will be unacceptable for most people. Enter: Joblib

We can parallelize our fucntion across multiple cores. My current machine has 8 cores:

impute_df = Parallel(n_jobs=8)(delayed(MTGP_impute)(df_group, len(df_group)) for patient, df_group in tqdm(df.groupby('patient')))

0%| | 24/16000 [01:21<6:56:02, 1.56s/it]

Now, its looks like we can finish our imputation in under 7 hours! Also, instead of doing 10,000 iterations for each patient, we could stop iterating when our loss only changes a small amount each iteration.

I hope you have found this useful, and please let me know of any mistakes or clarifications I can make!