Benchmark time series data sets for PyTorch
PyTorch data sets for supervised time series classification and prediction problems, including:
All UEA/UCR classification repository data sets
PhysioNet Challenge 2012 (in-hospital mortality)
PhysioNet Challenge 2019 (sepsis prediction)
A binary prediction variant of the 2019 PhysioNet Challenge
Why use torchtime
?
Saves time. You don’t have to write your own PyTorch data classes.
Better research. Use common, reproducible implementations of data sets for a level playing field when evaluating models.
Installation
Install PyTorch followed by torchtime
:
$ pip install torchtime
or
$ conda install torchtime -c conda-forge
There is currently no Windows build for conda
. Feedback is welcome from conda
users in particular.
Getting started
Data classes have a common API. The split
argument determines whether training (”train”), validation (”val”) or test (”test”) data are returned. The size of the splits are controlled with the train_prop
and (optional) val_prop
arguments.
PhysioNet data sets
Three PhysioNet data sets are currently supported:
torchtime.data.PhysioNet2012
returns the 2012 challenge (in-hospital mortality) [link].torchtime.data.PhysioNet2019
returns the 2019 challenge (sepsis prediction) [link].torchtime.data.PhysioNet2019Binary
returns a binary prediction variant of the 2019 challenge.
For example, to load training data for the 2012 challenge with a 70/30% training/validation split and create a DataLoader for model training:
from torch.utils.data import DataLoader
from torchtime.data import PhysioNet2012
physionet2012 = PhysioNet2012(
split="train",
train_prop=0.7,
)
dataloader = DataLoader(physionet2012, batch_size=32)
UEA/UCR repository data sets
The torchtime.data.UEA
class returns the UEA/UCR repository data set specified by the dataset
argument, for example:
from torch.utils.data import DataLoader
from torchtime.data import UEA
arrowhead = UEA(
dataset="ArrowHead",
split="train",
train_prop=0.7,
)
dataloader = DataLoader(arrowhead, batch_size=32)
Using the DataLoader
Batches are dictionaries of tensors X
, y
and length
:
X
are the time series data. The package follows the batch first convention thereforeX
has shape (n, s, c) where n is batch size, s is (longest) trajectory length and c is the number of channels. By default, the first channel is a time stamp.y
are one-hot encoded labels of shape (n, l) where l is the number of classes.length
are the length of each trajectory (before padding if sequences are of irregular length) i.e. a tensor of shape (n).
For example, ArrowHead is a univariate time series therefore X
has two channels, the time stamp followed by the time series (c = 2). Each series has 251 observations (s = 251) and there are three classes (l = 3). For a batch size of 32:
next_batch = next(iter(dataloader))
next_batch["X"].shape # torch.Size([32, 251, 2])
next_batch["y"].shape # torch.Size([32, 3])
next_batch["length"].shape # torch.Size([32])
See Using DataLoaders for more information.
Advanced options
Missing data can be imputed by setting
impute
to mean (replace with training data channel means) or forward (replace with previous observation). Alternatively a custom imputation function can be passed to theimpute
argument.A time stamp (added by default), missing data mask and the time since previous observation can be appended with the boolean arguments
time
,mask
anddelta
respectively.Time series data are standardised using the
standardise
boolean argument.The location of cached data can be changed with the
path
argument, for example to share a single cache location across projects.For reproducibility, an optional random
seed
can be specified.Missing data can be simulated using the
missing
argument to drop data at random from UEA/UCR data sets.
Other resources
If you’re looking for the TensorFlow equivalent for PhysioNet data sets try medical_ts_datasets.
Acknowledgements
torchtime
uses some of the data processing ideas in Kidger et al, 2020 [1] and Che et al, 2018 [2].
This work is supported by the Engineering and Physical Sciences Research Council, Centre for Doctoral Training in Cloud Computing for Big Data, Newcastle University (grant number EP/L015358/1).
Citing torchtime
If you use this software, please cite the paper:
@software{darke_torchtime_2022,
author = Darke, Philip and Missier, Paolo and Bacardit, Jaume,
title = "Benchmark time series data sets for {PyTorch} - the torchtime package",
month = July,
year = 2022,
publisher={arXiv},
doi = 10.48550/arXiv.2207.12503,
url = https://doi.org/10.48550/arXiv.2207.12503,
}
DOIs are also available for each version of the package here.
References
Kidger, P, Morrill, J, Foster, J, et al. Neural Controlled Differential Equations for Irregular Time Series. arXiv 2005.08926 (2020). [arXiv]
Che, Z, Purushotham, S, Cho, K, et al. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci Rep 8, 6085 (2018). [doi]
Silva, I, Moody, G, Scott, DJ, et al. Predicting In-Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. Comput Cardiol 2012;39:245-248 (2010). [hdl]
Reyna, M, Josef, C, Jeter, R, et al. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. Critical Care Medicine 48 2: 210-217 (2019). [doi]
Reyna, M, Josef, C, Jeter, R, et al. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0). PhysioNet (2019). [doi]
Goldberger, A, Amaral, L, Glass, L, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220 (2000). [doi]
Löning, M, Bagnall, A, Ganesh, S, et al. sktime: A Unified Interface for Machine Learning with Time Series. Workshop on Systems for ML at NeurIPS 2019 (2019). [doi]
Löning, M, Bagnall, A, Middlehurst, M, et al. alan-turing-institute/sktime: v0.10.1 (v0.10.1). Zenodo (2022). [doi]
License
Released under the MIT license.