Time series data sets

class torchtime.data.PhysioNet2012(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns the PhysioNet Challenge 2012 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can imputed using the impute argument. See the missing data tutorial for more information.

Data channels are in the following order:

0. Mins

Minutes since ICU admission. Derived from the PhysioNet time stamp.

1. Albumin

Albumin (g/dL)

2. ALP

Alkaline phosphatase (IU/L)

3. ALT

Alanine transaminase (IU/L)

4. AST

Aspartate transaminase (IU/L)

5. Bilirubin

Bilirubin (mg/dL)

6. BUN

Blood urea nitrogen (mg/dL)

7. Cholesterol

Cholesterol (mg/dL)

8. Creatinine

Serum creatinine (mg/dL)

9. DiasABP

Invasive diastolic arterial blood pressure (mmHg)

10. FiO2

Fractional inspired O2 (0-1)

11. GCS

Glasgow Coma Score (3-15)

12. Glucose

Serum glucose (mg/dL)

13. HCO3

Serum bicarbonate (mmol/L)

14. HCT

Hematocrit (%)

15. HR

Heart rate (bpm)

16. K

Serum potassium (mEq/L)

17. Lactate

Lactate (mmol/L)

18. Mg

Serum magnesium (mmol/L)

19. MAP

Invasive mean arterial blood pressure (mmHg)

20. MechVent

Mechanical ventilation respiration (0:false, or 1:true)

21. Na

Serum sodium (mEq/L)

22. NIDiasABP

Non-invasive diastolic arterial blood pressure (mmHg)

23. NIMAP

Non-invasive mean arterial blood pressure (mmHg)

24. NISysABP

Non-invasive systolic arterial blood pressure (mmHg)

25. PaCO2

Partial pressure of arterial CO2 (mmHg)]

26. PaO2

Partial pressure of arterial O2 (mmHg)

27. pH

Arterial pH (0-14)

28. Platelets

Platelets (cells/nL)

29. RespRate

Respiration rate (bpm)

30. SaO2

O2 saturation in hemoglobin (%)

31. SysABP

Invasive systolic arterial blood pressure (mmHg)

32. Temp

Temperature (°C)

33. TroponinI

Troponin-I (μg/L). Note this is labelled TropI in the PhysioNet data dictionary.

34. TroponinT

Troponin-T (μg/L). Note this is labelled TropT in the PhysioNet data dictionary.

35. Urine

Urine output (mL)

36. WBC

White blood cell count (cells/nL)

37. Weight

Weight (kg)

38. Age

Age (years) at ICU admission

39. Gender

Gender (0: female, or 1: male)

40. Height

Height (cm) at ICU admission

41. ICUType1

Type of ICU unit (1: Coronary Care Unit)

42. ICUType2

Type of ICU unit (2: Cardiac Surgery Recovery Unit)

43. ICUType3

Type of ICU unit (3: Medical ICU)

44. ICUType4

Type of ICU unit (4: Surgical ICU)

Note

Channels 38 to 41 do not vary with time.

Variables 11 (GCS) and 27 (pH) are assumed to be ordinal and are imputed using the same method as a continuous variable.

Variable 20 (MechVent) has value Nan (the majority of values) or 1. It is assumed that value 1 indicates that mechanical ventilation has been used and NaN indicates either missing data or no mechanical ventilation. Accordingly, the channel mode is assumed to be zero.

Variables 41-44 are the one-hot encoded value of ICUType.

Parameters
  • split (str) – The data split to return, either train, val (validation) or test.

  • train_prop (float) – Proportion of data in the training set.

  • val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).

  • impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).

  • time (bool) – Append time stamp in the first channel (default True).

  • mask (bool) – Append missing data mask for each channel (default False).

  • delta (bool) – Append time since previous observation for each channel calculated as in Che et al (2018). Default False.

  • standardise (bool) – Standardise the time series (default False).

  • overwrite_cache (bool) – Overwrite saved cache (default False).

  • path (str) – Location of the .torchtime cache directory (default “.”).

  • seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the time since admission in minutes). See above for the order of the PhysioNet channels. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the Physionet data. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1 = 127) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Note that PhysioNet trajectories are of unequal length and are therefore padded with NaNs to the length of the longest trajectory in the data.

Type

Tensor

y

In-hospital survival (the In-hospital_death variable) for each patient. y = 1 indicates an in-hospital death. A tensor of shape (n, 1).

Type

Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type

Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns

A PyTorch Dataset object which can be passed to a DataLoader.

class torchtime.data.PhysioNet2019(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns the PhysioNet Challenge 2019 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can imputed using the impute argument. See the missing data tutorial for more information.

Parameters
  • split (str) – The data split to return, either train, val (validation) or test.

  • train_prop (float) – Proportion of data in the training set.

  • val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).

  • impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).

  • time (bool) – Append time stamp in the first channel (default True).

  • mask (bool) – Append missing data mask for each channel (default False).

  • delta (bool) –

    Append time since previous observation for each channel calculated as in Che et al (2018). Default False.

  • standardise (bool) – Standardise the time series (default False).

  • overwrite_cache (bool) – Overwrite saved cache (default False).

  • path (str) – Location of the .torchtime cache directory (default “.”).

  • seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the ICULOS time stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the Physionet data. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Note that PhysioNet trajectories are of unequal length and are therefore padded with NaNs to the length of the longest trajectory in the data.

Type

Tensor

y

SepsisLabel at each time point. A tensor of shape (n, s, 1).

Type

Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type

Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns

A PyTorch Dataset object which can be passed to a DataLoader.

class torchtime.data.PhysioNet2019Binary(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns a binary prediction variant of the PhysioNet Challenge 2019 data as a PyTorch Dataset.

In contrast with the full challenge, the first 72 hours of data are used to predict whether a patient develops sepsis at any point during the period of hospitalisation as in Kidger et al (2020). See the PhysioNet website for a description of the data set.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can imputed using the impute argument. See the missing data tutorial for more information.

Parameters
  • split (str) – The data split to return, either train, val (validation) or test.

  • train_prop (float) – Proportion of data in the training set.

  • val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).

  • impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).

  • time (bool) – Append time stamp in the first channel (default True).

  • mask (bool) – Append missing data mask for each channel (default False).

  • delta (bool) –

    Append time since previous observation for each channel calculated as in Che et al (2018). Default False.

  • standardise (bool) – Standardise the time series (default False).

  • overwrite_cache (bool) – Overwrite saved cache (default False).

  • path (str) – Location of the .torchtime cache directory (default “.”).

  • seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the ICULOS time stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the Physionet data. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Note that PhysioNet trajectories are of unequal length and are therefore padded with NaNs to the length of the longest trajectory in the data.

Type

Tensor

y

Whether patient is diagnosed with sepsis at any time during hospitalisation. A tensor of shape (n, 1).

Type

Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type

Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns

A PyTorch Dataset object which can be passed to a DataLoader.

class torchtime.data.UEA(dataset, split, train_prop, val_prop=None, missing=0.0, impute='none', categorical=[], channel_means={}, time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]

Returns a time series classification data set from the UEA/UCR repository as a PyTorch Dataset. See the UEA/UCR repository website for the data sets.

The proportion of data in the training, validation and (optional) test data sets are specified by the train_prop and val_prop arguments. For a training/validation split specify train_prop only. For a training/validation/test split specify both train_prop and val_prop.

For example train_prop=0.8 generates a 80/20% train/validation split, but train_prop=0.8, val_prop=0.1 generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.

When passed to a PyTorch DataLoader, batches are a named dictionary with X, y and length data. The split argument determines whether training, validation or test data are returned.

Missing data can be simulated by dropping data at random. Support is also provided to impute missing data. These options are controlled by the missing and impute arguments. See the missing data tutorial for more information.

Warning

Mean imputation is unsuitable for categorical variables. To impute missing values for a categorical variable with the channel mode (rather than the channel mean), pass the channel indices to the categorical argument. Note this is also required for forward imputation to appropriately impute initial missing values.

Alternatively, the calculated channel mean/mode can be overridden using the channel_means argument. This can be used to impute missing data with a fixed value.

Parameters
  • dataset (str) – The UEA/UCR data set from list here.

  • split (str) – The data split to return, either train, val (validation) or test.

  • train_prop (float) – Proportion of data in the training set.

  • val_prop (Optional[float]) – Proportion of data in the validation set (optional, see above).

  • missing (Union[float, List[float]]) – The proportion of data to drop at random. If missing is a single value, data are dropped from all channels. To drop data independently across each channel, pass a list of the proportion missing for each channel e.g. [0.5, 0.2, 0.8]. Default 0 i.e. no missing data simulation.

  • impute (Union[str, Callable[[Tensor], Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”). See warning above.

  • categorical (List[int]) – List with channel indices of categorical variables. Only required if imputing data. Default [] i.e. no categorical variables.

  • channel_means (Dict[int, float]) – Override the calculated channel mean/mode when imputing data. Only used if imputing data. Dictionary with channel indices and values e.g. {1: 4.5, 3: 7.2} (default {} i.e. no overridden channel mean/modes).

  • time (bool) – Append time stamp in the first channel (default True).

  • mask (bool) – Append missing data mask for each channel (default False).

  • delta (bool) –

    Append time since previous observation for each channel calculated as in Che et al (2018). Default False.

  • standardise (bool) – Standardise the time series (default False).

  • overwrite_cache (bool) – Overwrite saved cache (default False).

  • path (str) – Location of the .torchtime cache directory (default “.”).

  • seed (Optional[int]) – Random seed for reproducibility (optional).

X

A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels. By default, a time stamp is appended as the first channel. If time is False, the time stamp is omitted and the tensor has shape (n, s, c).

A missing data mask and/or time delta channels can be appended with the mask and delta arguments. These each have the same number of channels as the data set. For example, if time, mask and delta are all True, X has shape (n, s, 3 * c + 1) and the channels are in the order: time stamp, time series, missing data mask, time deltas.

Where trajectories are of unequal lengths they are padded with NaNs to the length of the longest trajectory in the data.

Type

Tensor

y

One-hot encoded label data. A tensor of shape (n, l) where l is the number of classes.

Type

Tensor

length

Length of each trajectory prior to padding. A tensor of shape (n).

Type

Tensor

Note

X, y and length are available for the training, validation and test splits by appending _train, _val and _test respectively. For example, y_val returns the labels for the validation data set. These attributes are available regardless of the split argument.

Returns

A PyTorch Dataset object which can be passed to a DataLoader.