Time series data sets
- class torchtime.data.PhysioNet2012(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns the PhysioNet Challenge 2012 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_propandval_proparguments. For a training/validation split specifytrain_proponly. For a training/validation/test split specify bothtrain_propandval_prop.For example
train_prop=0.8generates a 80/20% train/validation split, buttrain_prop=0.8,val_prop=0.1generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X,yandlengthdata. Thesplitargument determines whether training, validation or test data are returned.Missing data can imputed using the
imputeargument. See the missing data tutorial for more information.Data channels are in the following order:
- 0. Mins:
Minutes since ICU admission. Derived from the PhysioNet time stamp.
- 1. Albumin:
Albumin (g/dL)
- 2. ALP:
Alkaline phosphatase (IU/L)
- 3. ALT:
Alanine transaminase (IU/L)
- 4. AST:
Aspartate transaminase (IU/L)
- 5. Bilirubin:
Bilirubin (mg/dL)
- 6. BUN:
Blood urea nitrogen (mg/dL)
- 7. Cholesterol:
Cholesterol (mg/dL)
- 8. Creatinine:
Serum creatinine (mg/dL)
- 9. DiasABP:
Invasive diastolic arterial blood pressure (mmHg)
- 10. FiO2:
Fractional inspired O2 (0-1)
- 11. GCS:
Glasgow Coma Score (3-15)
- 12. Glucose:
Serum glucose (mg/dL)
- 13. HCO3:
Serum bicarbonate (mmol/L)
- 14. HCT:
Hematocrit (%)
- 15. HR:
Heart rate (bpm)
- 16. K:
Serum potassium (mEq/L)
- 17. Lactate:
Lactate (mmol/L)
- 18. Mg:
Serum magnesium (mmol/L)
- 19. MAP:
Invasive mean arterial blood pressure (mmHg)
- 20. MechVent:
Mechanical ventilation respiration (0:false, or 1:true)
- 21. Na:
Serum sodium (mEq/L)
- 22. NIDiasABP:
Non-invasive diastolic arterial blood pressure (mmHg)
- 23. NIMAP:
Non-invasive mean arterial blood pressure (mmHg)
- 24. NISysABP:
Non-invasive systolic arterial blood pressure (mmHg)
- 25. PaCO2:
Partial pressure of arterial CO2 (mmHg)]
- 26. PaO2:
Partial pressure of arterial O2 (mmHg)
- 27. pH:
Arterial pH (0-14)
- 28. Platelets:
Platelets (cells/nL)
- 29. RespRate:
Respiration rate (bpm)
- 30. SaO2:
O2 saturation in hemoglobin (%)
- 31. SysABP:
Invasive systolic arterial blood pressure (mmHg)
- 32. Temp:
Temperature (°C)
- 33. TroponinI:
Troponin-I (μg/L). Note this is labelled TropI in the PhysioNet data dictionary.
- 34. TroponinT:
Troponin-T (μg/L). Note this is labelled TropT in the PhysioNet data dictionary.
- 35. Urine:
Urine output (mL)
- 36. WBC:
White blood cell count (cells/nL)
- 37. Weight:
Weight (kg)
- 38. Age:
Age (years) at ICU admission
- 39. Gender:
Gender (0: female, or 1: male)
- 40. Height:
Height (cm) at ICU admission
- 41. ICUType1:
Type of ICU unit (1: Coronary Care Unit)
- 42. ICUType2:
Type of ICU unit (2: Cardiac Surgery Recovery Unit)
- 43. ICUType3:
Type of ICU unit (3: Medical ICU)
- 44. ICUType4:
Type of ICU unit (4: Surgical ICU)
Note
Channels 38 to 41 do not vary with time.
Variables 11 (GCS) and 27 (pH) are assumed to be ordinal and are imputed using the same method as a continuous variable.
Variable 20 (MechVent) has value
Nan(the majority of values) or 1. It is assumed that value 1 indicates that mechanical ventilation has been used andNaNindicates either missing data or no mechanical ventilation. Accordingly, the channel mode is assumed to be zero.Variables 41-44 are the one-hot encoded value of ICUType.
- Parameters:
split (
str) – The data split to return, either train, val (validation) or test.train_prop (
float) – Proportion of data in the training set.val_prop (
Optional[float]) – Proportion of data in the validation set (optional, see above).impute (
Union[str,Callable[[Tensor],Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).time (
bool) – Append time stamp in the first channel (default True).mask (
bool) – Append missing data mask for each channel (default False).delta (
bool) – Append time since previous observation for each channel calculated as in Che et al (2018). Default False.standardise (
bool) – Standardise the time series (default False).overwrite_cache (
bool) – Overwrite saved cache (default False).path (
str) – Location of the.torchtimecache directory (default “.”).seed (
Optional[int]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the time since admission in minutes). See above for the order of the PhysioNet channels. By default, a time stamp is appended as the first channel. If
timeis False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
maskanddeltaarguments. These each have the same number of channels as the Physionet data. For example, iftime,maskanddeltaare all True,Xhas shape (n, s, 3 * c + 1 = 127) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Note that PhysioNet trajectories are of unequal length and are therefore padded with
NaNsto the length of the longest trajectory in the data.- Type:
Tensor
- y
In-hospital survival (the
In-hospital_deathvariable) for each patient. y = 1 indicates an in-hospital death. A tensor of shape (n, 1).- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X,yandlengthare available for the training, validation and test splits by appending_train,_valand_testrespectively. For example,y_valreturns the labels for the validation data set. These attributes are available regardless of thesplitargument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.
- class torchtime.data.PhysioNet2019(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns the PhysioNet Challenge 2019 data as a PyTorch Dataset. See the PhysioNet website for a description of the data set.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_propandval_proparguments. For a training/validation split specifytrain_proponly. For a training/validation/test split specify bothtrain_propandval_prop.For example
train_prop=0.8generates a 80/20% train/validation split, buttrain_prop=0.8,val_prop=0.1generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X,yandlengthdata. Thesplitargument determines whether training, validation or test data are returned.Missing data can imputed using the
imputeargument. See the missing data tutorial for more information.- Parameters:
split (
str) – The data split to return, either train, val (validation) or test.train_prop (
float) – Proportion of data in the training set.val_prop (
Optional[float]) – Proportion of data in the validation set (optional, see above).impute (
Union[str,Callable[[Tensor],Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).time (
bool) – Append time stamp in the first channel (default True).mask (
bool) – Append missing data mask for each channel (default False).delta (
bool) –Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (
bool) – Standardise the time series (default False).overwrite_cache (
bool) – Overwrite saved cache (default False).path (
str) – Location of the.torchtimecache directory (default “.”).seed (
Optional[int]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the
ICULOStime stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. Iftimeis False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
maskanddeltaarguments. These each have the same number of channels as the Physionet data. For example, iftime,maskanddeltaare all True,Xhas shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Note that PhysioNet trajectories are of unequal length and are therefore padded with
NaNsto the length of the longest trajectory in the data.- Type:
Tensor
- y
SepsisLabelat each time point. A tensor of shape (n, s, 1).- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X,yandlengthare available for the training, validation and test splits by appending_train,_valand_testrespectively. For example,y_valreturns the labels for the validation data set. These attributes are available regardless of thesplitargument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.
- class torchtime.data.PhysioNet2019Binary(split, train_prop, val_prop=None, impute='none', time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns a binary prediction variant of the PhysioNet Challenge 2019 data as a PyTorch Dataset.
In contrast with the full challenge, the first 72 hours of data are used to predict whether a patient develops sepsis at any point during the period of hospitalisation as in Kidger et al (2020). See the PhysioNet website for a description of the data set.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_propandval_proparguments. For a training/validation split specifytrain_proponly. For a training/validation/test split specify bothtrain_propandval_prop.For example
train_prop=0.8generates a 80/20% train/validation split, buttrain_prop=0.8,val_prop=0.1generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X,yandlengthdata. Thesplitargument determines whether training, validation or test data are returned.Missing data can imputed using the
imputeargument. See the missing data tutorial for more information.- Parameters:
split (
str) – The data split to return, either train, val (validation) or test.train_prop (
float) – Proportion of data in the training set.val_prop (
Optional[float]) – Proportion of data in the validation set (optional, see above).impute (
Union[str,Callable[[Tensor],Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”).time (
bool) – Append time stamp in the first channel (default True).mask (
bool) – Append missing data mask for each channel (default False).delta (
bool) –Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (
bool) – Standardise the time series (default False).overwrite_cache (
bool) – Overwrite saved cache (default False).path (
str) – Location of the.torchtimecache directory (default “.”).seed (
Optional[int]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels in the PhysioNet data (including the
ICULOStime stamp). The channels are ordered as set out on the PhysioNet website. By default, a time stamp is appended as the first channel. Iftimeis False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
maskanddeltaarguments. These each have the same number of channels as the Physionet data. For example, iftime,maskanddeltaare all True,Xhas shape (n, s, 3 * c + 1 = 121) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Note that PhysioNet trajectories are of unequal length and are therefore padded with
NaNsto the length of the longest trajectory in the data.- Type:
Tensor
- y
Whether patient is diagnosed with sepsis at any time during hospitalisation. A tensor of shape (n, 1).
- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X,yandlengthare available for the training, validation and test splits by appending_train,_valand_testrespectively. For example,y_valreturns the labels for the validation data set. These attributes are available regardless of thesplitargument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.
- class torchtime.data.UEA(dataset, split, train_prop, val_prop=None, missing=0.0, impute='none', categorical=[], channel_means={}, time=True, mask=False, delta=False, standardise=False, overwrite_cache=False, path='.', seed=None)[source]
Returns a time series classification data set from the UEA/UCR repository as a PyTorch Dataset. See the UEA/UCR repository website for the data sets.
The proportion of data in the training, validation and (optional) test data sets are specified by the
train_propandval_proparguments. For a training/validation split specifytrain_proponly. For a training/validation/test split specify bothtrain_propandval_prop.For example
train_prop=0.8generates a 80/20% train/validation split, buttrain_prop=0.8,val_prop=0.1generates a 80/10/10% train/validation/test split. Splits are formed using stratified sampling.When passed to a PyTorch DataLoader, batches are a named dictionary with
X,yandlengthdata. Thesplitargument determines whether training, validation or test data are returned.Missing data can be simulated by dropping data at random. Support is also provided to impute missing data. These options are controlled by the
missingandimputearguments. See the missing data tutorial for more information.Warning
Mean imputation is unsuitable for categorical variables. To impute missing values for a categorical variable with the channel mode (rather than the channel mean), pass the channel indices to the
categoricalargument. Note this is also required for forward imputation to appropriately impute initial missing values.Alternatively, the calculated channel mean/mode can be overridden using the
channel_meansargument. This can be used to impute missing data with a fixed value.- Parameters:
dataset (
str) – The UEA/UCR data set from list here.split (
str) – The data split to return, either train, val (validation) or test.train_prop (
float) – Proportion of data in the training set.val_prop (
Optional[float]) – Proportion of data in the validation set (optional, see above).missing (
Union[float,List[float]]) – The proportion of data to drop at random. Ifmissingis a single value, data are dropped from all channels. To drop data independently across each channel, pass a list of the proportion missing for each channel e.g.[0.5, 0.2, 0.8]. Default 0 i.e. no missing data simulation.impute (
Union[str,Callable[[Tensor],Tensor]]) – Method used to impute missing data, either none, zero, mean, forward or a custom imputation function (default “none”). See warning above.categorical (
List[int]) – List with channel indices of categorical variables. Only required if imputing data. Default[]i.e. no categorical variables.channel_means (
Dict[int,float]) – Override the calculated channel mean/mode when imputing data. Only used if imputing data. Dictionary with channel indices and values e.g.{1: 4.5, 3: 7.2}(default{}i.e. no overridden channel mean/modes).time (
bool) – Append time stamp in the first channel (default True).mask (
bool) – Append missing data mask for each channel (default False).delta (
bool) –Append time since previous observation for each channel calculated as in Che et al (2018). Default False.
standardise (
bool) – Standardise the time series (default False).overwrite_cache (
bool) – Overwrite saved cache (default False).path (
str) – Location of the.torchtimecache directory (default “.”).seed (
Optional[int]) – Random seed for reproducibility (optional).
- X
A tensor of default shape (n, s, c + 1) where n = number of trajectories, s = (longest) trajectory length and c = number of channels. By default, a time stamp is appended as the first channel. If
timeis False, the time stamp is omitted and the tensor has shape (n, s, c).A missing data mask and/or time delta channels can be appended with the
maskanddeltaarguments. These each have the same number of channels as the data set. For example, iftime,maskanddeltaare all True,Xhas shape (n, s, 3 * c + 1) and the channels are in the order: time stamp, time series, missing data mask, time deltas.Where trajectories are of unequal lengths they are padded with
NaNsto the length of the longest trajectory in the data.- Type:
Tensor
- y
One-hot encoded label data. A tensor of shape (n, l) where l is the number of classes.
- Type:
Tensor
- length
Length of each trajectory prior to padding. A tensor of shape (n).
- Type:
Tensor
Note
X,yandlengthare available for the training, validation and test splits by appending_train,_valand_testrespectively. For example,y_valreturns the labels for the validation data set. These attributes are available regardless of thesplitargument.- Returns:
A PyTorch Dataset object which can be passed to a DataLoader.