Time series data sets
- class torchtime.data.PhysioNet2012(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns the PhysioNet Challenge 2012 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_prop
andval_prop
arguments. For a training/validation split specifytrain_prop
only. For a training/validation/test split specify bothtrain_prop
andval_prop
.For example
train_prop=0.8
generates a 80/20% train/validation split, buttrain_prop=0.8
,val_prop=0.1
generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X
,y
andlength
data. Thesplit
argument determines whether training, validation or test data are returned.Missing data can imputed using the
impute
argument. See the missing data tutorial for more information.Data channels are in the following order:
- 0. Mins:
Minutes since ICU admission. Derived from the PhysioNet time stamp.
- 1. Albumin:
Albumin (g/dL)
- 2. ALP:
Alkaline phosphatase (IU/L)
- 3. ALT:
Alanine transaminase (IU/L)
- 4. AST:
Aspartate transaminase (IU/L)
- 5. Bilirubin:
Bilirubin (mg/dL)
- 6. BUN:
Blood urea nitrogen (mg/dL)
- 7. Cholesterol:
Cholesterol (mg/dL)
- 8. Creatinine:
Serum creatinine (mg/dL)
- 9. DiasABP:
Invasive diastolic arterial blood pressure (mmHg)
- 10. FiO2:
Fractional inspired O2 (0-1)
- 11. GCS:
Glasgow Coma Score (3-15)
- 12. Glucose:
Serum glucose (mg/dL)
- 13. HCO3:
Serum bicarbonate (mmol/L)
- 14. HCT:
Hematocrit (%)
- 15. HR:
Heart rate (bpm)
- 16. K:
Serum potassium (mEq/L)
- 17. Lactate:
Lactate (mmol/L)
- 18. Mg:
Serum magnesium (mmol/L)
- 19. MAP:
Invasive mean arterial blood pressure (mmHg)
- 20. MechVent:
Mechanical ventilation respiration (0:false, or 1:true)
- 21. Na:
Serum sodium (mEq/L)
- 22. NIDiasABP:
Non-invasive diastolic arterial blood pressure (mmHg)
- 23. NIMAP:
Non-invasive mean arterial blood pressure (mmHg)
- 24. NISysABP:
Non-invasive systolic arterial blood pressure (mmHg)
- 25. PaCO2:
Partial pressure of arterial CO2 (mmHg)]
- 26. PaO2:
Partial pressure of arterial O2 (mmHg)
- 27. pH:
Arterial pH (0-14)
- 28. Platelets:
Platelets (cells/nL)
- 29. RespRate:
Respiration rate (bpm)
- 30. SaO2:
O2 saturation in hemoglobin (%)
- 31. SysABP:
Invasive systolic arterial blood pressure (mmHg)
- 32. Temp:
Temperature (°C)
- 33. TroponinI:
Troponin-I (μg/L). Note this is labelled TropI in the PhysioNet data dictionary.
- 34. TroponinT:
Troponin-T (μg/L). Note this is labelled TropT in the PhysioNet data dictionary.
- 35. Urine:
Urine output (mL)
- 36. WBC:
White blood cell count (cells/nL)
- 37. Weight:
Weight (kg)
- 38. Age:
Age (years) at ICU admission
- 39. Gender:
Gender (0: female, or 1: male)
- 40. Height:
Height (cm) at ICU admission
- 41. ICUType1:
Type of ICU unit (1: Coronary Care Unit)
- 42. ICUType2:
Type of ICU unit (2: Cardiac Surgery Recovery Unit)
- 43. ICUType3:
Type of ICU unit (3: Medical ICU)
- 44. ICUType4:
Type of ICU unit (4: Surgical ICU)
Note
Channels 38 to 41 do not vary with time.
Variables 11 (GCS) and 27 (pH) are assumed to be ordinal and are imputed using the same method as a continuous variable.
Variable 20 (MechVent) has value
Nan
(the majority of values) or 1. It is assumed that value 1 indicates that mechanical ventilation has been used andNaN
indicates either missing data or no mechanical ventilation. Accordingly, the channel mode is assumed to be zero.Variables 41-44 are the one-hot encoded value of ICUType.
- Parameters:
split (
str
) – The data split to return, either train, val (validation) or test.train_prop (
float
) – Proportion of data in the training set.val_prop (
Optional
[float
]) – Proportion of data in the validation set (optional, see above).impute (
Union
[str
,Callable
[[Tensor
],Tensor
]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).time (
bool
) – Append time stamp in the first channel (default True).mask (
bool
) – Append missing data mask for each channel (default False).delta (
bool
) – Append time since previous observation for each channel calculated as in Che et al (2018). Default False.standardise (
bool
) – Standardise the time series (default False).overwrite_cache (
bool
) – Overwrite saved cache (default False).path (
str
) – Location of the.torchtime
cache directory (default “.”).seed (
Optional
[int
]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the time since admission in minutes). See above for the order of the PhysioNet channels. By default, a time stamp is appended as the first channel. If
time
is False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
mask
anddelta
arguments. These each have the same number of channels as the Physionet data. For example, iftime
,mask
anddelta
are all True,X
has shape (n, s, 3 * c + 1 = 127) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Note that PhysioNet trajectories are of unequal length and are therefore padded with
NaNs
to the length of the longest trajectory in the data.- Type:
Tensor
- y
In-hospital survival (the
In-hospital_death
variable) for each patient. y = 1 indicates an in-hospital death. A tensor of shape (n, 1).- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X
,y
andlength
are available for the training, validation and test splits by appending_train
,_val
and_test
respectively. For example,y_val
returns the labels for the validation data set. These attributes are available regardless of thesplit
argument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.
- class torchtime.data.PhysioNet2019(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns the PhysioNet Challenge 2019 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_prop
andval_prop
arguments. For a training/validation split specifytrain_prop
only. For a training/validation/test split specify bothtrain_prop
andval_prop
.For example
train_prop=0.8
generates a 80/20% train/validation split, buttrain_prop=0.8
,val_prop=0.1
generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X
,y
andlength
data. Thesplit
argument determines whether training, validation or test data are returned.Missing data can imputed using the
impute
argument. See the missing data tutorial for more information.- Parameters:
split (
str
) – The data split to return, either train, val (validation) or test.train_prop (
float
) – Proportion of data in the training set.val_prop (
Optional
[float
]) – Proportion of data in the validation set (optional, see above).impute (
Union
[str
,Callable
[[Tensor
],Tensor
]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).time (
bool
) – Append time stamp in the first channel (default True).mask (
bool
) – Append missing data mask for each channel (default False).delta (
bool
) –Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (
bool
) – Standardise the time series (default False).overwrite_cache (
bool
) – Overwrite saved cache (default False).path (
str
) – Location of the.torchtime
cache directory (default “.”).seed (
Optional
[int
]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the
ICULOS
time stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. Iftime
is False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
mask
anddelta
arguments. These each have the same number of channels as the Physionet data. For example, iftime
,mask
anddelta
are all True,X
has shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Note that PhysioNet trajectories are of unequal length and are therefore padded with
NaNs
to the length of the longest trajectory in the data.- Type:
Tensor
- y
SepsisLabel
at each time point. A tensor of shape (n, s, 1).- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X
,y
andlength
are available for the training, validation and test splits by appending_train
,_val
and_test
respectively. For example,y_val
returns the labels for the validation data set. These attributes are available regardless of thesplit
argument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.
- class torchtime.data.PhysioNet2019Binary(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns a binary prediction variant of the PhysioNet Challenge 2019 data as a PyTorch Dataset.
In contrast with the full challenge, the first 72 hours of data are used to predict whether a patient develops sepsis at any point during the period of hospitalisation as in Kidger et al (2020). See the PhysioNet website for a description of the data set.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_prop
andval_prop
arguments. For a training/validation split specifytrain_prop
only. For a training/validation/test split specify bothtrain_prop
andval_prop
.For example
train_prop=0.8
generates a 80/20% train/validation split, buttrain_prop=0.8
,val_prop=0.1
generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X
,y
andlength
data. Thesplit
argument determines whether training, validation or test data are returned.Missing data can imputed using the
impute
argument. See the missing data tutorial for more information.- Parameters:
split (
str
) – The data split to return, either train, val (validation) or test.train_prop (
float
) – Proportion of data in the training set.val_prop (
Optional
[float
]) – Proportion of data in the validation set (optional, see above).impute (
Union
[str
,Callable
[[Tensor
],Tensor
]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).time (
bool
) – Append time stamp in the first channel (default True).mask (
bool
) – Append missing data mask for each channel (default False).delta (
bool
) –Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (
bool
) – Standardise the time series (default False).overwrite_cache (
bool
) – Overwrite saved cache (default False).path (
str
) – Location of the.torchtime
cache directory (default “.”).seed (
Optional
[int
]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the
ICULOS
time stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. Iftime
is False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
mask
anddelta
arguments. These each have the same number of channels as the Physionet data. For example, iftime
,mask
anddelta
are all True,X
has shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Note that PhysioNet trajectories are of unequal length and are therefore padded with
NaNs
to the length of the longest trajectory in the data.- Type:
Tensor
- y
Whether patient is diagnosed with sepsis at any time during hospitalisation. A tensor of shape (n, 1).
- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X
,y
andlength
are available for the training, validation and test splits by appending_train
,_val
and_test
respectively. For example,y_val
returns the labels for the validation data set. These attributes are available regardless of thesplit
argument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.
- class torchtime.data.UEA(dataset, split, train_prop, val_prop=None, missing=0.0, impute='none', categorical=[], channel_means={}, time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns a time series classification data set from the UEA/UCR repository as a PyTorch Dataset. See the UEA/UCR repository website for the data sets.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_prop
andval_prop
arguments. For a training/validation split specifytrain_prop
only. For a training/validation/test split specify bothtrain_prop
andval_prop
.For example
train_prop=0.8
generates a 80/20% train/validation split, buttrain_prop=0.8
,val_prop=0.1
generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X
,y
andlength
data. Thesplit
argument determines whether training, validation or test data are returned.Missing data can be simulated by dropping data at random. Support is also provided to impute missing data. These options are controlled by the
missing
andimpute
arguments. See the missing data tutorial for more information.Warning
Mean imputation is unsuitable for categorical variables. To impute missing values for a categorical variable with the channel mode (rather than the channel mean), pass the channel indices to the
categorical
argument. Note this is also required for forward imputation to appropriately impute initial missing values.Alternatively, the calculated channel mean/mode can be overridden using the
channel_means
argument. This can be used to impute missing data with a fixed value.- Parameters:
dataset (
str
) – The UEA/UCR data set from list here.split (
str
) – The data split to return, either train, val (validation) or test.train_prop (
float
) – Proportion of data in the training set.val_prop (
Optional
[float
]) – Proportion of data in the validation set (optional, see above).missing (
Union
[float
,List
[float
]]) – The proportion of data to drop at random. Ifmissing
is a single value, data are dropped from all channels. To drop data independently across each channel, pass a list of the proportion missing for each channel e.g.[0.5, 0.2, 0.8]
. Default 0 i.e. no missing data simulation.impute (
Union
[str
,Callable
[[Tensor
],Tensor
]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”). See warning above.categorical (
List
[int
]) – List with channel indices of categorical variables. Only required if imputing data. Default[]
i.e. no categorical variables.channel_means (
Dict
[int
,float
]) – Override the calculated channel mean/mode when imputing data. Only used if imputing data. Dictionary with channel indices and values e.g.{1: 4.5, 3: 7.2}
(default{}
i.e. no overridden channel mean/modes).time (
bool
) – Append time stamp in the first channel (default True).mask (
bool
) – Append missing data mask for each channel (default False).delta (
bool
) –Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (
bool
) – Standardise the time series (default False).overwrite_cache (
bool
) – Overwrite saved cache (default False).path (
str
) – Location of the.torchtime
cache directory (default “.”).seed (
Optional
[int
]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels. By default, a time stamp is appended as the first channel. If
time
is False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
mask
anddelta
arguments. These each have the same number of channels as the data set. For example, iftime
,mask
anddelta
are all True,X
has shape (n, s, 3 * c + 1) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Where trajectories are of unequal lengths they are padded with
NaNs
to the length of the longest trajectory in the data.- Type:
Tensor
- y
One-hot encoded label data. A tensor of shape (n, l) where l is the number of classes.
- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X
,y
andlength
are available for the training, validation and test splits by appending_train
,_val
and_test
respectively. For example,y_val
returns the labels for the validation data set. These attributes are available regardless of thesplit
argument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.