# Multi-Task Gaussian Processes for Vital Sign Imputation

How to model a multiple vital signs with multi-task gaussian processes

- Sources
- Data
- Defining our Model
- Data Preparation
- Training Loop
- Model Prediction
- Going beyond 1 patient

This is an extension of my previous blog post on gaussian processes. In this post we will cover how to use GPytorch to model multiple vital signs at the same time. One would want to do this when you have multiple features that are correlated or have obvious dependencies. Another of looking at would be if you have a set of features where the missing values are Missing at Random.

This is an alternative to using a single gaussian process model to impute each feature independently. Other alternatives include linear interpolation, K-nearest-neighbors, Bayesian Ridge Regression, Random Forest, etc. More can be found here from sci-kit-learn.

We'll be using public data from the Physionet 2019 competition on the early prediction of sepsis. Data can be found here

Here is a short code snippet to prepare the data:

```
#collapse
import pandas as pd
import os
import numpy as np
import re
from cache_em_all import Cachable
DATA_DIR = "training_setB/" # Path to the data
def load_single_file(file_path):
df = pd.read_csv(file_path, sep='|')
df['hours'] = df.index
df['patient'] = re.search('p(.*?).psv', file_path).group(1)
return df
def get_data_files():
return [os.path.join(DATA_DIR, x) for x in sorted(os.listdir(DATA_DIR)) if int(x[1:-4]) % 5 > 0]
@Cachable('training_setB.csv')
def load_data():
data = get_data_files()
data_frames = [load_single_file(d) for d in data]
merged = pd.concat(data_frames)
return merged
```

We'll be looking patient 10019, who has a good amount of missing data. We'll only be modeling 6 vital signs:

- Heart rate (HR)
- Temperature (Temp)
- Respiratory Rate (Resp)
- Systolic Blood Pressure (SBP)
- Diastolic blood Pressure (DBP)
- Mean Arteiral Pressure (MAP)

```
df = load_data()
df = df.loc[df.patient == 100019]\
[['HR', 'Temp', 'SBP', 'DBP', 'MAP', 'Resp']].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
```

The goal will be to model these 6 features simultaneously to fill in the missing values.

We'll be using the GPyTorch package for our guassian processes, the only package I'm aware of in Python that supports multi-task learning for gaussian processes.

Our model will consist of a python class with a constructor and a forward method that is called in our training loop later.

What's important to note here is we must specify how many tasks we want to learn. In our case, that is 6. Another difference from the Exact GP model is that we are defining two covariance modules, the standard RBFKernal, and the index kernal, which according to the documentation "is a lookup table containing inter-task covariance".

Our final covariance kernal is a multiplication of these two covariance kernals: $k([x,i], [x', j]) = k_{inputs}(x, x') * k_{tasks}(i,j)$

In our code: `covar = covar_x.mul(covar_i)`

```
class MultitaskGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.RBFKernel()
self.task_covar_module = gpytorch.kernels.IndexKernel(num_tasks=6, rank=1)
def forward(self,x,i):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
covar_i = self.task_covar_module(i)
covar = covar_x.mul(covar_i)
return gpytorch.distributions.MultivariateNormal(mean_x, covar)
```

This is the most tricky part.

I'll explain what to contitutes full_train_x, full_train_i, and full_train_y in our instatiation of our Multitask model:

`MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood)`

```
df = load_data()
df = df.loc[df.patient == 100019]
```

Assuming that we will repeat this process with many patients, there inevitably will be a patient for which an entire feature is missing. We will want to exclude that, and impute with the average or median for that feature after.

```
impute_features = ['HR', 'DBP', 'Temp', 'SBP', 'MAP', 'Resp']
missing_feats = df.columns[df.isna().all()].tolist()
non_empty_feats = [f for f in set(impute_features) - set(missing_feats)]
non_empty_feats
```

We than collect each of our features into a list of features that are torch tensors, our first tensor in the list being temperature:

```
feats = []
for f in non_empty_feats:
feats.append(torch.tensor(df[f].values, dtype=torch.float))
feats[0]
```

We define our `full_train_x`

as the *indicies with non-null values*. Here, we see that our temperature feature has non-null values at index 3, 12, 14, etc.

```
full_train_x = [torch.where(~torch.isnan(f))[0] for f in feats]
full_train_x[0]
```

We then concatenate all the features index tensors together into one:

```
full_train_x = torch.cat(full_train_x)
full_train_x
```

For `full_train_i`

, this is the concatenated tensor of *indicies to specify which value belongs to which feature*, as if we had given an ordinal label to each feature.

In this example, we have assigned 0 to the temperature feature, and there is 9 non-null temperature values in our observation. We assign 1 to DBP, 2 to Resp, etc.

```
full_train_i = []
for i, f in enumerate(feats):
full_train_i.append(torch.full_like(f[~torch.isnan(f)], dtype=torch.long, fill_value=i))
full_train_i = torch.cat(full_train_i)
full_train_i
```

Finally, `full_train_y`

are the *non-null observed values* for each feature, concatenated together.

```
full_train_y = [f[~torch.isnan(f)] for f in feats]
full_train_y = torch.cat(full_train_y)
full_train_y
```

Important: Double check to make sure your features are in the same order for each of `full_train_x`

, `full_train_i`

, and `full_train_y`

Now that we have all our training data is the correct format, we just need to instantiate our likelihood. Note that our train_x argument for the MultitaskGPModel must be a tuple of (full_train_x, full_train_i).

```
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood)
```

Now fit our model to the data. Our model input is the training tuple `(full_train_x, full_train_i)`

, which we then feed into the `ExactMarginalLogLikelihood`

MLE to determine our loss, by comparison with full_train_y. 10,000 iterations seems to work for a good fit visually.

```
training_iterations = 10_000
model.train()
likelihood.train()
optimizer = torch.optim.Adam([
{'params': model.parameters()},
], lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for i in range(training_iterations):
optimizer.zero_grad()
output = model(full_train_x, full_train_i)
loss = -mll(output, full_train_y)
loss.backward()
optimizer.step()
```

Now we place our model and likelihood in evaluation mode, and create `test_x`

, a tensor that holds all *full list (non null and null) of indicies*, and
`test_i_tasks`

, which contains a list of tensors with indicies that indentify our features with *full length*, which means they are the length of features with non and null values.

```
model.eval()
likelihood.eval()
timesteps = len(df)
test_x = torch.tensor(list(range(timesteps)), dtype=torch.long)
test_x
```

```
test_i_tasks = []
for i in range(len(non_empty_feats)):
test_i_tasks.append(torch.full_like(test_x, dtype=torch.long, fill_value=i))
test_i_tasks
```

Finally, we create a list of MultivariateNormal models in `observed_pred_ys`

, and we predict the mean of each of these models to impute our features:

```
observed_pred_ys = []
with torch.no_grad(), gpytorch.settings.fast_pred_samples():
for test_i_task in test_i_tasks:
observed_pred_ys.append(likelihood(model(test_x, test_i_task)))
for f, preds in zip(non_empty_feats, observed_pred_ys):
df[f] = preds.mean.detach().numpy()
df[impute_features].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
```

From our imputed graph we can easily compare and see that where there were previously missing values, we have no filled them in with reference to the other trends in vitals signs.

We can put all our previous code in a function `MTGP_impute(df, timesteps)`

, which takes as input our df for a patient and the number of rows (timesteps) we have observed for that patient:

```
#collapse
def MTGP_impute(df, timesteps):
impute_features = ['HR', 'DBP', 'Temp', 'SBP', 'MAP', 'Resp']
df[impute_features].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
missing_feats = df.columns[df.isna().all()].tolist()
non_empty_feats = [f for f in set(impute_features) - set(missing_feats)]
feats = []
for f in non_empty_feats:
feats.append(torch.tensor(df[f].values, dtype=torch.float))
full_train_x = [torch.where(~torch.isnan(f))[0] for f in feats]
if full_train_x == []:
return df
full_train_x = torch.cat(full_train_x)
full_train_i = []
for i, f in enumerate(feats):
full_train_i.append(torch.full_like(f[~torch.isnan(f)], dtype=torch.long, fill_value=i))
full_train_i = torch.cat(full_train_i)
full_train_y = [f[~torch.isnan(f)] for f in feats]
full_train_y = torch.cat(full_train_y)
# Instantiate likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood)
training_iterations = 10000
model.train()
likelihood.train()
optimizer = torch.optim.Adam([
{'params': model.parameters()}, # Includes GaussianLikelihood parameters
], lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for i in range(training_iterations):
optimizer.zero_grad()
output = model(full_train_x, full_train_i)
loss = -mll(output, full_train_y)
loss.backward()
optimizer.step()
model.eval()
likelihood.eval()
test_x = torch.tensor(list(range(timesteps)), dtype=torch.long)
test_i_tasks = []
for i in range(len(non_empty_feats)):
test_i_tasks.append(torch.full_like(test_x, dtype=torch.long, fill_value=i))
observed_pred_ys = []
with torch.no_grad(), gpytorch.settings.fast_pred_samples():
for test_i_task in test_i_tasks:
observed_pred_ys.append(likelihood(model(test_x, test_i_task)))
for f, preds in zip(non_empty_feats, observed_pred_ys):
df[f] = preds.mean.detach().numpy()
df[impute_features].plot().legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
return df
```

```
df = load_data()
df = df.loc[df.patient == 100019]
```

```
MTGP_impute(df, len(df))
```

We can use our function to iterate through each patient in our full data frame. This is the second data set prodicded by Physionet, and contains 16,000 patients:

```
df = load_data()
impute_df = [MTGP_impute(df_group, len(df_group)) for patient, df_group in tqdm(df.groupby('patient'))]
impute_df = pd.concat(impute_df)
```

0%| | 3/16000 [01:37<141:45:35, 31.90s/it]

Looks like might take 141 hours, or 5-6 days. That will be unacceptable for most people. Enter: Joblib

We can parallelize our fucntion across multiple cores. My current machine has 8 cores:

```
impute_df = Parallel(n_jobs=8)(delayed(MTGP_impute)(df_group, len(df_group)) for patient, df_group in tqdm(df.groupby('patient')))
```

0%| | 24/16000 [01:21<6:56:02, 1.56s/it]

Now, its looks like we can finish our imputation in under 7 hours! Also, instead of doing 10,000 iterations for each patient, we could stop iterating when our loss only changes a small amount each iteration.

I hope you have found this useful, and please let me know of any mistakes or clarifications I can make!