Time series data sets

PhysioNet2012
PhysioNet2019
PhysioNet2019Binary
UEA

class torchtime.data.PhysioNet2012(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns the PhysioNet Challenge 2012 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can imputed using the impute argument. See the missing data tutorial for more information.

Data channels are in the following order:

0. Mins:: Minutes since ICU admission. Derived from the PhysioNet time stamp.
1. Albumin:: Albumin (g/dL)
2. ALP:: Alkaline phosphatase (IU/L)
3. ALT:: Alanine transaminase (IU/L)
4. AST:: Aspartate transaminase (IU/L)
5. Bilirubin:: Bilirubin (mg/dL)
6. BUN:: Blood urea nitrogen (mg/dL)
7. Cholesterol:: Cholesterol (mg/dL)
8. Creatinine:: Serum creatinine (mg/dL)
9. DiasABP:: Invasive diastolic arterial blood pressure (mmHg)
10. FiO2:: Fractional inspired O₂ (0-1)
11. GCS:: Glasgow Coma Score (3-15)
12. Glucose:: Serum glucose (mg/dL)
13. HCO3:: Serum bicarbonate (mmol/L)
14. HCT:: Hematocrit (%)
15. HR:: Heart rate (bpm)
16. K:: Serum potassium (mEq/L)
17. Lactate:: Lactate (mmol/L)
18. Mg:: Serum magnesium (mmol/L)
19. MAP:: Invasive mean arterial blood pressure (mmHg)
20. MechVent:: Mechanical ventilation respiration (0:false, or 1:true)
21. Na:: Serum sodium (mEq/L)
22. NIDiasABP:: Non-invasive diastolic arterial blood pressure (mmHg)
23. NIMAP:: Non-invasive mean arterial blood pressure (mmHg)
24. NISysABP:: Non-invasive systolic arterial blood pressure (mmHg)
25. PaCO2:: Partial pressure of arterial CO₂ (mmHg)]
26. PaO2:: Partial pressure of arterial O₂ (mmHg)
27. pH:: Arterial pH (0-14)
28. Platelets:: Platelets (cells/nL)
29. RespRate:: Respiration rate (bpm)
30. SaO2:: O₂ saturation in hemoglobin (%)
31. SysABP:: Invasive systolic arterial blood pressure (mmHg)
32. Temp:: Temperature (°C)
33. TroponinI:: Troponin-I (μg/L). Note this is labelled TropI in the PhysioNet data dictionary.
34. TroponinT:: Troponin-T (μg/L). Note this is labelled TropT in the PhysioNet data dictionary.
35. Urine:: Urine output (mL)
36. WBC:: White blood cell count (cells/nL)
37. Weight:: Weight (kg)
38. Age:: Age (years) at ICU admission
39. Gender:: Gender (0: female, or 1: male)
40. Height:: Height (cm) at ICU admission
41. ICUType1:: Type of ICU unit (1: Coronary Care Unit)
42. ICUType2:: Type of ICU unit (2: Cardiac Surgery Recovery Unit)
43. ICUType3:: Type of ICU unit (3: Medical ICU)
44. ICUType4:: Type of ICU unit (4: Surgical ICU)

Note

Channels 38 to 41 do not vary with time.

Variables 11 (GCS) and 27 (pH) are assumed to be ordinal and are imputed using the same method as a continuous variable.

Variable 20 (MechVent) has value Nan (the majority of values) or 1. It is assumed that value 1 indicates that mechanical ventilation has been used and NaN indicates either missing data or no mechanical ventilation. Accordingly, the channel mode is assumed to be zero.

Variables 41-44 are the one-hot encoded value of ICUType.

Parameters:

split (str) – The data split to return, either train, val (validation) or test.
train_prop (float) – Proportion of data in the training set.
val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).
impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).
time (bool) – Append time stamp in the first channel (default True).
mask (bool) – Append missing data mask for each channel (default False).
delta (bool) – Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (bool) – Standardise the time series (default False).
overwrite_cache (bool) – Overwrite saved cache (default False).
path (str) – Location of the .torchtime cache directory (default “.”).
seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the time since admission in minutes). See above for the order of the PhysioNet channels. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the Physionet data. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1 = 127) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Note that PhysioNet trajectories are of unequal length and are therefore padded with NaNs to the length of the longest trajectory in the data.

Type:: Tensor

y

In-hospital survival (the In-hospital_death variable) for each patient. y = 1 indicates an in-hospital death. A tensor of shape (n, 1).

Type:: Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type:: Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns:: A PyTorch Dataset object which can be passed to a DataLoader.

class torchtime.data.PhysioNet2019(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns the PhysioNet Challenge 2019 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can imputed using the impute argument. See the missing data tutorial for more information.

Parameters:

split (str) – The data split to return, either train, val (validation) or test.
train_prop (float) – Proportion of data in the training set.
val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).
impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).
time (bool) – Append time stamp in the first channel (default True).
mask (bool) – Append missing data mask for each channel (default False).
delta (bool) –
Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (bool) – Standardise the time series (default False).
overwrite_cache (bool) – Overwrite saved cache (default False).
path (str) – Location of the .torchtime cache directory (default “.”).
seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the ICULOS time stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the Physionet data. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Note that PhysioNet trajectories are of unequal length and are therefore padded with NaNs to the length of the longest trajectory in the data.

Type:: Tensor

y

SepsisLabel at each time point. A tensor of shape (n, s, 1).

Type:: Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type:: Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns:: A PyTorch Dataset object which can be passed to a DataLoader.

class torchtime.data.PhysioNet2019Binary(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns a binary prediction variant of the PhysioNet Challenge 2019 data as a PyTorch Dataset.

In contrast with the full challenge, the first 72 hours of data are used to predict whether a patient develops sepsis at any point during the period of hospitalisation as in Kidger et al (2020). See the PhysioNet website for a description of the data set.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can imputed using the impute argument. See the missing data tutorial for more information.

Parameters:

split (str) – The data split to return, either train, val (validation) or test.
train_prop (float) – Proportion of data in the training set.
val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).
impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).
time (bool) – Append time stamp in the first channel (default True).
mask (bool) – Append missing data mask for each channel (default False).
delta (bool) –
Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (bool) – Standardise the time series (default False).
overwrite_cache (bool) – Overwrite saved cache (default False).
path (str) – Location of the .torchtime cache directory (default “.”).
seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the ICULOS time stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the Physionet data. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Note that PhysioNet trajectories are of unequal length and are therefore padded with NaNs to the length of the longest trajectory in the data.

Type:: Tensor

y

Whether patient is diagnosed with sepsis at any time during hospitalisation. A tensor of shape (n, 1).

Type:: Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type:: Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns:: A PyTorch Dataset object which can be passed to a DataLoader.

class torchtime.data.UEA(dataset, split, train_prop, val_prop=None, missing=0.0, impute='none', categorical=[], channel_means={}, time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns a time series classification data set from the UEA/UCR repository as a PyTorch Dataset. See the UEA/UCR repository website for the data sets.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can be simulated by dropping data at random. Support is also provided to impute missing data. These options are controlled by the missing and impute arguments. See the missing data tutorial for more information.

Warning

Mean imputation is unsuitable for categorical variables. To impute missing values for a categorical variable with the channel mode (rather than the channel mean), pass the channel indices to the categorical argument. Note this is also required for forward imputation to appropriately impute initial missing values.

Alternatively, the calculated channel mean/mode can be overridden using the channel_means argument. This can be used to impute missing data with a fixed value.

Parameters:

dataset (str) – The UEA/UCR data set from list here.
split (str) – The data split to return, either train, val (validation) or test.
train_prop (float) – Proportion of data in the training set.
val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).
missing (Union[float, List[float]]) – The proportion of data to drop at random. If missing is a single value, data are dropped from all channels. To drop data independently across each channel, pass a list of the proportion missing for each channel e.g. [0.5, 0.2, 0.8]. Default 0 i.e. no missing data simulation.
impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”). See warning above.
categorical (List[int]) – List with channel indices of categorical variables. Only required if imputing data. Default [] i.e. no categorical variables.
channel_means (Dict[int, float]) – Override the calculated channel mean/mode when imputing data. Only used if imputing data. Dictionary with channel indices and values e.g. {1: 4.5, 3: 7.2} (default {} i.e. no overridden channel mean/modes).
time (bool) – Append time stamp in the first channel (default True).
mask (bool) – Append missing data mask for each channel (default False).
delta (bool) –
Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (bool) – Standardise the time series (default False).
overwrite_cache (bool) – Overwrite saved cache (default False).
path (str) – Location of the .torchtime cache directory (default “.”).
seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the data set. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Where trajectories are of unequal lengths they are padded with NaNs to the length of the longest trajectory in the data.

Type:: Tensor

y

One-hot encoded label data. A tensor of shape (n, l) where l is the number of classes.

Type:: Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type:: Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns:: A PyTorch Dataset object which can be passed to a DataLoader.