Dataset#
The dataset module provides the TimeSeriesDataset class for loading and processing time series data in a format suitable for PyTorch DataLoader.
TimeSeriesDataset#
A PyTorch Dataset class that handles time series data extraction using chunk specifications. It supports single or multiple DataFrames and automatically extracts chunks based on the provided specifications.
Purpose:
TimeSeriesDataset provides a PyTorch-compatible interface for loading time series data with chunk-based extraction. It automatically handles the extraction of encoding, decoding, and label chunks from your DataFrames based on chunk specifications.
Key Features:
PyTorch Compatible: Inherits from
torch.utils.data.Dataset, works seamlessly withDataLoaderMultiple DataFrames: Supports both single DataFrame and list of DataFrames (useful for multiple time series)
Automatic Chunk Extraction: Uses
ChunkExtractorinternally to extract chunks based on specificationsTime Index Tracking: Optionally returns time indices for each chunk
Initialization Parameters:
data_frames(pd.DataFrame | list[pd.DataFrame]): Input data. Can be a single DataFrame or a list of DataFrames. If multiple DataFrames are provided, they are treated as separate time series.chunk_specs(list[BaseChunkSpec]): List of chunk specifications defining what data to extract. Typically obtained frommodel.make_chunk_specs().return_time_index(bool): IfTrue, includes time index arrays in the output dictionary. Default isTrue. Time indices are useful for tracking which time steps correspond to predictions.
Properties:
``data_frames`` (list[pd.DataFrame]): List of input DataFrames (always a list, even if single DataFrame was provided).
``chunk_specs`` (list[BaseChunkSpec]): Chunk specifications used for extraction.
``return_time_index`` (bool): Whether to include time indices in outputs.
``chunk_extractors`` (list[ChunkExtractor]): Internal list of chunk extractors, one per DataFrame.
``lengths`` (list[int]): List of dataset lengths for each DataFrame.
``min_start_time_index`` (int): Minimum valid start time index for extraction.
Methods:
``__len__()``: Returns the total number of samples in the dataset. Calculated as the sum of lengths for all DataFrames.
``__getitem__(i)``: Returns a sample at index
i. The index is mapped to the appropriate DataFrame and time position. Returns a dictionary with chunk tags as keys and tensors as values. Ifreturn_time_index=True, also includes time index arrays.``_preprocess()``: Internal method called during initialization. Creates
ChunkExtractorinstances for each DataFrame and calculates dataset lengths. Automatically called by__init__().``plot_chunks()``: Visualizes the chunk specifications as a horizontal bar chart. Useful for understanding the temporal structure of your model’s input/output windows. Requires matplotlib.
Typical Usage:
Create a model (e.g.,
MLP) which generates chunk specifications viamake_chunk_specs()Create a
TimeSeriesDatasetwith your data and the chunk specificationsUse with PyTorch
DataLoaderfor batching
Example:
import pandas as pd
import numpy as np
from torch.utils.data import DataLoader
from deep_time_series.dataset import TimeSeriesDataset
from deep_time_series.model import MLP
# Create sample data
data = pd.DataFrame({
'temperature': np.sin(np.arange(100)),
'humidity': np.cos(np.arange(100)),
})
# Create model
model = MLP(
hidden_size=64,
encoding_length=10,
decoding_length=5,
target_names=['temperature'],
nontarget_names=['humidity'],
n_hidden_layers=2,
)
# Get chunk specifications from model
chunk_specs = model.make_chunk_specs()
# Create dataset
dataset = TimeSeriesDataset(
data_frames=data,
chunk_specs=chunk_specs,
return_time_index=True, # Include time indices in output
)
# Use with DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Iterate over batches
for batch in dataloader:
# batch is a dictionary with keys like:
# - 'encoding.targets': tensor of shape (batch_size, 10, 1)
# - 'label.targets': tensor of shape (batch_size, 5, 1)
# - 'encoding.targets.time_index': tensor of shape (batch_size, 10)
# - 'label.targets.time_index': tensor of shape (batch_size, 5)
pass
Multiple DataFrames:
When working with multiple time series (e.g., different sensors or locations), pass a list of DataFrames:
# Multiple time series
data1 = pd.DataFrame({'temperature': np.sin(np.arange(100))})
data2 = pd.DataFrame({'temperature': np.cos(np.arange(100))})
dataset = TimeSeriesDataset(
data_frames=[data1, data2],
chunk_specs=chunk_specs,
)
# The dataset will concatenate all series and track which series each sample comes from
Dataset Length:
The length of the dataset is determined by: - The length of each DataFrame - The chunk specifications (specifically, the minimum time index)
The formula is approximately: len(df) - chunk_length + 1 for each DataFrame.
Visualization:
You can visualize the chunk specifications using the plot_chunks() method:
dataset.plot_chunks() # Shows a horizontal bar chart of chunk windows
- class TimeSeriesDataset(data_frames, chunk_specs, return_time_index=True)[source]#
Bases:
Dataset- Parameters:
data_frames (DataFrame | list[pandas.core.frame.DataFrame]) –
chunk_specs (list[deep_time_series.chunk.BaseChunkSpec]) –
return_time_index (bool) –
- class TimeSeriesDataset(data_frames, chunk_specs, return_time_index=True)[source]#
Bases:
Dataset- Parameters:
data_frames (DataFrame | list[pandas.core.frame.DataFrame]) –
chunk_specs (list[deep_time_series.chunk.BaseChunkSpec]) –
return_time_index (bool) –