Working with missing data
This tutorial covers the torchtime.data.UEA
class however the imputation examples also apply to other data sets.
Simulating missing data
PhysioNet data sets feature missing data however most UEA/UCR data sets are regularly sampled and fully observed.
We often need our models to handle time series that are irregularly sampled, partially observed and of unequal length. To aid model development, missing data can be simulated in UEA/UCR data sets using the missing
argument.
Note
Data points are dropped independently at random. The missing
argument represents the probability that a data point is missing. Results can be reproduced using the seed
argument.
Regularly sampled data with missing time points
If missing
is a single value, data are dropped across all channels. This simulates regularly sampled data where some time points are not recorded e.g. dropped data over a network. Using the CharacterTrajectories data set as an example:
from torch.utils.data import DataLoader
from torchtime.data import UEA
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=0.5, # 50% proportion missing assumption
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, -0.1849, 0.1978, 0.3263],
[ 1.0000, nan, nan, nan],
[ 2.0000, -0.3744, 0.2511, 0.4260],
[ 3.0000, nan, nan, nan],
[ 4.0000, nan, nan, nan],
[ 5.0000, nan, nan, nan],
[ 6.0000, nan, nan, nan],
[ 7.0000, -1.0270, -0.1670, 0.2144],
[ 8.0000, nan, nan, nan],
[ 9.0000, -1.3501, -0.4994, 0.2447]])
Regularly sampled data with partial observation
Alternatively, data can be dropped independently for each channel by passing a list representing the proportion missing for each channel. This simulates regularly sampled data with partial observation i.e. not all channels are recorded at each time point.
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5], # 80/20/50% proportion missing assumption for each channel
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, nan, 0.1978, 0.3263],
[ 1.0000, nan, 0.2399, nan],
[ 2.0000, nan, 0.2511, nan],
[ 3.0000, nan, nan, 0.4016],
[ 4.0000, nan, nan, 0.3410],
[ 5.0000, nan, 0.0824, 0.2739],
[ 6.0000, nan, -0.0302, 0.2281],
[ 7.0000, nan, -0.1670, 0.2144],
[ 8.0000, nan, nan, nan],
[ 9.0000, -1.3501, -0.4994, 0.2447]])
Note that each time point has a varying number of observations.
Missing data masks
In some applications, the absence/presence of data can itself be informative. For example, a doctor may be more likely to order a particular diagnostic test if they believe the patient has a medical condition. Missing data/observational masks can be used to inform models of missing data. These are appended by setting the mask
argument to True
.
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5],
mask=True,
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, nan, 0.1978, 0.3263, 0.0000, 1.0000, 1.0000],
[ 1.0000, nan, 0.2399, nan, 0.0000, 1.0000, 0.0000],
[ 2.0000, nan, 0.2511, nan, 0.0000, 1.0000, 0.0000],
[ 3.0000, nan, nan, 0.4016, 0.0000, 0.0000, 1.0000],
[ 4.0000, nan, nan, 0.3410, 0.0000, 0.0000, 1.0000],
[ 5.0000, nan, 0.0824, 0.2739, 0.0000, 1.0000, 1.0000],
[ 6.0000, nan, -0.0302, 0.2281, 0.0000, 1.0000, 1.0000],
[ 7.0000, nan, -0.1670, 0.2144, 0.0000, 1.0000, 1.0000],
[ 8.0000, nan, nan, nan, 0.0000, 0.0000, 0.0000],
[ 9.0000, -1.3501, -0.4994, 0.2447, 1.0000, 1.0000, 1.0000]])
Note the final three channels indicate whether data were recorded.
Time deltas
Some models require the time since the previous observation as an input e.g. GRU-D. This can be added using the delta
argument. See Che et al, 2018 for implementation details.
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5],
delta=True,
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, nan, 0.1978, 0.3263, 0.0000, 0.0000, 0.0000],
[ 1.0000, nan, 0.2399, nan, 1.0000, 1.0000, 1.0000],
[ 2.0000, nan, 0.2511, nan, 2.0000, 1.0000, 2.0000],
[ 3.0000, nan, nan, 0.4016, 3.0000, 1.0000, 3.0000],
[ 4.0000, nan, nan, 0.3410, 4.0000, 2.0000, 1.0000],
[ 5.0000, nan, 0.0824, 0.2739, 5.0000, 3.0000, 1.0000],
[ 6.0000, nan, -0.0302, 0.2281, 6.0000, 1.0000, 1.0000],
[ 7.0000, nan, -0.1670, 0.2144, 7.0000, 1.0000, 1.0000],
[ 8.0000, nan, nan, nan, 8.0000, 1.0000, 1.0000],
[ 9.0000, -1.3501, -0.4994, 0.2447, 9.0000, 2.0000, 2.0000]])
Note the second channel is observed at times 2 and 5 therefore the time delta at time 5 is 3 i.e. 3 time units since last observation. Note that time delta is 0 at time 0 by definition.
Combining output options
The time
, mask
and delta
arguments can be combined as required. The channel order is always time stamp, time series data, missing data mask then time delta.
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5],
time=False,
mask=True,
delta=True,
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ nan, 0.1978, 0.3263, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000,
0.0000],
[ nan, 0.2399, nan, 0.0000, 1.0000, 0.0000, 1.0000, 1.0000,
1.0000],
[ nan, 0.2511, nan, 0.0000, 1.0000, 0.0000, 2.0000, 1.0000,
2.0000],
[ nan, nan, 0.4016, 0.0000, 0.0000, 1.0000, 3.0000, 1.0000,
3.0000],
[ nan, nan, 0.3410, 0.0000, 0.0000, 1.0000, 4.0000, 2.0000,
1.0000],
[ nan, 0.0824, 0.2739, 0.0000, 1.0000, 1.0000, 5.0000, 3.0000,
1.0000],
[ nan, -0.0302, 0.2281, 0.0000, 1.0000, 1.0000, 6.0000, 1.0000,
1.0000],
[ nan, -0.1670, 0.2144, 0.0000, 1.0000, 1.0000, 7.0000, 1.0000,
1.0000],
[ nan, nan, nan, 0.0000, 0.0000, 0.0000, 8.0000, 1.0000,
1.0000],
[-1.3501, -0.4994, 0.2447, 1.0000, 1.0000, 1.0000, 9.0000, 2.0000,
2.0000]])
Note the initial time channel is not returned but the missing data and time delta channels are appended to the data.
Note
The channel order is always time stamp (if specified), time series, missing data mask (if specified) and time delta (if specified). The time stamp is one channel, and the time series, missing data mask and time delta each have the same number of channels as the time series.
Imputing missing data
Missing data can be imputed using the impute
argument. torchtime
currently supports ``zero’’, mean and forward imputation as well as custom imputation functions.
Warning
By design, imputation has no impact on the missing data mask or time delta channels!
Zero imputation
Missing values are set to zero:
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5],
impute="zero",
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, 0.0000, 0.1978, 0.3263],
[ 1.0000, 0.0000, 0.2399, 0.0000],
[ 2.0000, 0.0000, 0.2511, 0.0000],
[ 3.0000, 0.0000, 0.0000, 0.4016],
[ 4.0000, 0.0000, 0.0000, 0.3410],
[ 5.0000, 0.0000, 0.0824, 0.2739],
[ 6.0000, 0.0000, -0.0302, 0.2281],
[ 7.0000, 0.0000, -0.1670, 0.2144],
[ 8.0000, 0.0000, 0.0000, 0.0000],
[ 9.0000, -1.3501, -0.4994, 0.2447]])
Mean imputation
Under mean imputation, missing data are replaced with the training data channel mean:
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5],
impute="mean",
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, 0.1163, 0.1978, 0.3263],
[ 1.0000, 0.1163, 0.2399, -0.2935],
[ 2.0000, 0.1163, 0.2511, -0.2935],
[ 3.0000, 0.1163, -0.0722, 0.4016],
[ 4.0000, 0.1163, -0.0722, 0.3410],
[ 5.0000, 0.1163, 0.0824, 0.2739],
[ 6.0000, 0.1163, -0.0302, 0.2281],
[ 7.0000, 0.1163, -0.1670, 0.2144],
[ 8.0000, 0.1163, -0.0722, -0.2935],
[ 9.0000, -1.3501, -0.4994, 0.2447]])
Forward imputation
Under forward imputation, missing values are replaced with the previous channel observation. Note that this approach does not impute any initial missing values, therefore these are replaced with the training data channel mean.
This approach ensures that knowledge of the time series at times t > i is not used when imputing values at time i. This is required when developing models that make online predictions.
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5],
impute="forward",
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, 0.1163, 0.1978, 0.3263],
[ 1.0000, 0.1163, 0.2399, 0.3263],
[ 2.0000, 0.1163, 0.2511, 0.3263],
[ 3.0000, 0.1163, 0.2511, 0.4016],
[ 4.0000, 0.1163, 0.2511, 0.3410],
[ 5.0000, 0.1163, 0.0824, 0.2739],
[ 6.0000, 0.1163, -0.0302, 0.2281],
[ 7.0000, 0.1163, -0.1670, 0.2144],
[ 8.0000, 0.1163, -0.1670, 0.2144],
[ 9.0000, -1.3501, -0.4994, 0.2447]])
Note
torchtime.impute
includes imputation functions for tensors with missing data. See the API for more information.
Handling categorical variables
The mean and forward imputation options above assume all variables are continuous. To impute missing values for a categorical variable using the channel mode (rather than the channel mean), pass the channel indices for each categorical channel in a list to the categorical
argument.
For additional flexibility, the calculated channel mean/mode can be overridden using the override
argument. This accepts a dictionary as in the example below and can be used to impute missing data with a fixed value.
For example, assuming mean imputation is used for a data set with 10 channels where the 2nd and 8th are categorical:
The
categorical
argument should be[1, 7]
as channels are indexed from zero.To replace missing values in the 5th channel with a value of 100 rather than the channel mean, pass the dictionary
{4: 100}
to theoverride
argument.
For a real-world example see the implementation of the PhysioNet2012 data set, where channel 20 (MechVent) is categorical (i.e. yes or no) with an overridden mode of zero.
Custom imputation functions
Alternatively a custom imputation function can be passed to impute
. This must accept X
(raw time series), y
(labels), fill
(training data means/modes for each channel after overriding values as above) and select
(the channels to impute) and return X
and y
tensors post imputation.
def five_imputation(X, y, fill, select):
return X.nan_to_num(5), y.nan_to_num(5) # set missing values to five
char_traj = UEA(
dataset="CharacterTrajectories",
split="train",
train_prop=0.7,
missing=[0.8, 0.2, 0.5],
impute=five_imputation,
seed=123,
)
dataloader = DataLoader(char_traj, batch_size=32)
print(next(iter(dataloader))["X"][0, 0:10])
Output:
...
tensor([[ 0.0000, 5.0000, 0.1978, 0.3263],
[ 1.0000, 5.0000, 0.2399, 5.0000],
[ 2.0000, 5.0000, 0.2511, 5.0000],
[ 3.0000, 5.0000, 5.0000, 0.4016],
[ 4.0000, 5.0000, 5.0000, 0.3410],
[ 5.0000, 5.0000, 0.0824, 0.2739],
[ 6.0000, 5.0000, -0.0302, 0.2281],
[ 7.0000, 5.0000, -0.1670, 0.2144],
[ 8.0000, 5.0000, 5.0000, 5.0000],
[ 9.0000, -1.3501, -0.4994, 0.2447]])