Data Input/Output#

This guide explains how to load, preprocess, and transform data in DeepTimeSeries.

Data Format#

DeepTimeSeries uses pandas DataFrame as the primary data format. Each DataFrame represents a time series where:

Rows correspond to time steps
Columns correspond to features (target variables and non-target features)
The index can be used for time information, but is not required

Loading Data#

Since DeepTimeSeries works with pandas DataFrames, you can load data from various sources using pandas:

import pandas as pd

# Load from CSV
data = pd.read_csv('data.csv')

# Load from Excel
data = pd.read_excel('data.xlsx')

# Load from Parquet
data = pd.read_parquet('data.parquet')

# Create from NumPy array
import numpy as np
data = pd.DataFrame({
    'target': np.sin(np.arange(100)),
    'feature1': np.cos(np.arange(100)),
    'feature2': np.random.randn(100)
})

Multiple Time Series#

You can work with multiple time series by passing a list of DataFrames to TimeSeriesDataset:

import pandas as pd
import deep_time_series as dts

# Create multiple time series
series1 = pd.DataFrame({'value': np.arange(100)})
series2 = pd.DataFrame({'value': np.arange(50)})

# Pass as a list
dataset = dts.TimeSeriesDataset(
    data_frames=[series1, series2],
    chunk_specs=chunk_specs
)

Data Preprocessing#

The ColumnTransformer class provides a convenient way to apply transformations to specific columns of your DataFrame.

Basic Usage#

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import deep_time_series as dts

# Create transformer
transformer = dts.ColumnTransformer(
    transformer_tuples=[
        (StandardScaler(), ['target', 'feature1']),
        (MinMaxScaler(), ['feature2'])
    ]
)

# Fit and transform
data_transformed = transformer.fit_transform(data)

# Or fit and transform separately
transformer.fit(data)
data_transformed = transformer.transform(data)

Using Transformer Dictionary#

You can also use a dictionary to specify transformers:

from sklearn.preprocessing import StandardScaler

transformer = dts.ColumnTransformer(
    transformer_dict={
        'target': StandardScaler(),
        'feature1': StandardScaler(),
        'feature2': StandardScaler()
    }
)

data_transformed = transformer.fit_transform(data)

Inverse Transform#

To convert transformed data back to original scale:

# Transform
data_transformed = transformer.transform(data)

# Inverse transform
data_original = transformer.inverse_transform(data_transformed)

Understanding Chunk Specifications#

Chunk specifications define how time series data is extracted and organized for model training and prediction. They are the core concept for understanding how DeepTimeSeries handles temporal data.

What are Chunk Specifications?#

A chunk specification (ChunkSpec) defines: - What features to extract (via names) - When to extract them (via range_) - How to label them (via tag) - What data type to use (via dtype)

Types of Chunk Specifications#

DeepTimeSeries provides three types of chunk specifications:

EncodingChunkSpec: Defines the input window for the encoder - Used for historical data that the model uses to understand patterns - Typically has negative range values (e.g., range_=(-10, 0))
DecodingChunkSpec: Defines the input window for the decoder - Used in autoregressive models where decoder needs previous predictions - Can have positive range values (e.g., range_=(0, 5))
LabelChunkSpec: Defines the target window for prediction - Used for ground truth labels during training - Typically has positive range values (e.g., range_=(0, 5))

Understanding the range_ Parameter#

The range_ parameter is a tuple (start, end) that defines the relative time window:

Negative values: Refer to past time steps relative to the current position
Zero: Represents the current time step
Positive values: Refer to future time steps relative to the current position

Examples:

import numpy as np
import deep_time_series as dts

# Extract 10 past time steps (t-10 to t-1, excluding current time)
encoding_spec = dts.EncodingChunkSpec(
    tag='input',
    names=['target', 'feature1'],
    range_=(-10, 0),  # From 10 steps ago to current (exclusive)
    dtype=np.float32
)

# Extract 5 future time steps (t to t+4)
label_spec = dts.LabelChunkSpec(
    tag='output',
    names=['target'],
    range_=(0, 5),  # From current to 4 steps ahead
    dtype=np.float32
)

# Extract overlapping window (t-5 to t+5)
window_spec = dts.EncodingChunkSpec(
    tag='context',
    names=['target'],
    range_=(-5, 6),  # 5 steps before to 5 steps after
    dtype=np.float32
)

Visual Representation#

For a time series with data points at indices 0, 1, 2, …, and a chunk specification with range_=(-3, 2):

Time indices:  ...  -3  -2  -1   0   1   2   3  ...
Relative to t: ...  -3  -2  -1  [t]  1   2   3  ...

range_=(-3, 2) extracts: [t-3, t-2, t-1, t, t+1]
                          (5 time steps total)

Tag Naming Convention#

Tags are used to identify different chunks in the data dictionary. They follow a prefix pattern:

EncodingChunkSpec: Automatically adds 'encoding.' prefix
DecodingChunkSpec: Automatically adds 'decoding.' prefix
LabelChunkSpec: Automatically adds 'label.' prefix

Example:

spec = dts.EncodingChunkSpec(tag='my_feature', ...)
# spec.tag will be 'encoding.my_feature'

spec = dts.LabelChunkSpec(tag='target', ...)
# spec.tag will be 'label.target'

Tags must be unique within a list of chunk specifications. When you access data from TimeSeriesDataset, you’ll use these tags as dictionary keys.

Choosing the Right Data Type#

The dtype parameter controls the NumPy data type used for the extracted chunks:

np.float32: Recommended for most cases (memory efficient, sufficient precision)
np.float64: Use when high precision is required (uses more memory)
np.int32: Use for integer features (e.g., counts, categories)

Example:

# For continuous values
continuous_spec = dts.EncodingChunkSpec(
    tag='features',
    names=['temperature', 'humidity'],
    range_=(-10, 0),
    dtype=np.float32  # Standard for neural networks
)

# For integer counts
count_spec = dts.EncodingChunkSpec(
    tag='counts',
    names=['sales_count', 'visitor_count'],
    range_=(-10, 0),
    dtype=np.int32
)

Creating Dataset#

After preprocessing, create a TimeSeriesDataset with appropriate chunk specifications:

import deep_time_series as dts
from deep_time_series.chunk import EncodingChunkSpec, LabelChunkSpec
import numpy as np

# Define chunk specifications
encoding_spec = dts.EncodingChunkSpec(
    tag='input',
    names=['target', 'feature1', 'feature2'],
    range_=(-10, 0),  # 10 time steps before current time
    dtype=np.float32
)

label_spec = dts.LabelChunkSpec(
    tag='output',
    names=['target'],
    range_=(0, 5),  # 5 time steps ahead
    dtype=np.float32
)

# Create dataset
dataset = dts.TimeSeriesDataset(
    data_frames=data_transformed,
    chunk_specs=[encoding_spec, label_spec],
    return_time_index=True  # Optional: include time index in chunks
)

How ChunkExtractor Works#

The ChunkExtractor class is responsible for extracting data chunks from a DataFrame according to chunk specifications. Understanding its behavior helps you debug and optimize your data pipeline.

Data Extraction Mechanism#

When you create a TimeSeriesDataset, it internally creates a ChunkExtractor for each DataFrame. The extractor:

Preprocesses the DataFrame: Converts specified columns to the required data type
Calculates window boundaries: Determines the minimum and maximum time indices needed
Extracts chunks: For each sample, extracts the specified time windows

Example of extraction process:

import pandas as pd
import numpy as np
from deep_time_series.chunk import ChunkExtractor, EncodingChunkSpec, LabelChunkSpec

# Create sample data
df = pd.DataFrame({
    'target': np.arange(20),
    'feature': np.arange(20) * 2
})

# Define specifications
encoding_spec = EncodingChunkSpec(
    tag='input',
    names=['target', 'feature'],
    range_=(-3, 0),  # Last 3 time steps
    dtype=np.float32
)

label_spec = LabelChunkSpec(
    tag='output',
    names=['target'],
    range_=(0, 2),  # Next 2 time steps
    dtype=np.float32
)

# Create extractor
extractor = ChunkExtractor(df, [encoding_spec, label_spec])

# Extract chunk at time index 5
chunk = extractor.extract(start_time_index=5, return_time_index=False)

# chunk contains:
# - 'encoding.input': shape (3, 2) - 3 time steps, 2 features
# - 'label.output': shape (2, 1) - 2 time steps, 1 feature

Understanding start_time_index#

The start_time_index parameter determines where in the time series to start extraction:

start_time_index = 0: Extract from the beginning of the series
start_time_index = 5: Extract starting from index 5
The extractor ensures that start_time_index + chunk_min_t >= 0

For the example above with range_=(-3, 0) for encoding: - At start_time_index=5, it extracts indices [2, 3, 4] (5-3 to 5-1) - At start_time_index=0, it cannot extract because 0 + (-3) < 0

Multiple Chunk Specifications#

When multiple chunk specifications are provided, the extractor:

Finds the overall window: [min(range[0]), max(range[1])]
Extracts data for the entire window
Then slices each specification’s portion

Example:

# Specification 1: range_=(-5, 0)
# Specification 2: range_=(0, 3)
# Overall window: range_=(-5, 3) - extracts 8 time steps total

spec1 = EncodingChunkSpec(tag='past', names=['target'], range_=(-5, 0), dtype=np.float32)
spec2 = LabelChunkSpec(tag='future', names=['target'], range_=(0, 3), dtype=np.float32)

extractor = ChunkExtractor(df, [spec1, spec2])
# chunk_min_t = -5, chunk_max_t = 3, chunk_length = 8

chunk = extractor.extract(start_time_index=10)
# Extracts indices [5, 6, 7, 8, 9, 10, 11, 12]
# Then slices:
#   - 'encoding.past': indices [5, 6, 7, 8, 9] (relative -5 to -1)
#   - 'label.future': indices [10, 11, 12] (relative 0 to 2)

The return_time_index Option#

When return_time_index=True, the extractor also includes time index information:

chunk = extractor.extract(start_time_index=5, return_time_index=True)

# chunk contains:
# - 'encoding.input': the data array
# - 'encoding.input.time_index': array([2, 3, 4]) - actual DataFrame indices
# - 'label.output': the data array
# - 'label.output.time_index': array([5, 6]) - actual DataFrame indices

This is useful when you need to track which time points correspond to predictions.

TimeSeriesDataset Deep Dive#

The TimeSeriesDataset class wraps ChunkExtractor to provide a PyTorch-compatible Dataset interface.

Dataset Length Calculation#

The length of the dataset depends on:

DataFrame length: Total number of time steps
Chunk length: Maximum window size needed (chunk_max_t - chunk_min_t)
Minimum start index: Ensures valid extractions (max(0, -chunk_min_t))

Formula: length = len(df) - chunk_length + 1 - min_start_time_index

Example:

df = pd.DataFrame({'value': np.arange(100)})  # 100 time steps

spec = EncodingChunkSpec(
    tag='input',
    names=['value'],
    range_=(-10, 0),  # chunk_length = 10
    dtype=np.float32
)

dataset = TimeSeriesDataset(df, [spec])
# Length = 100 - 10 + 1 - 0 = 91 samples

# Can extract from indices 0 to 90
# At index 0: extracts [0:10]
# At index 90: extracts [90:100]

Multiple DataFrames Handling#

When multiple DataFrames are provided, the dataset:

Creates a separate ChunkExtractor for each DataFrame
Concatenates samples from all DataFrames
Uses cumulative indexing to map dataset index to DataFrame index

Example:

series1 = pd.DataFrame({'value': np.arange(50)})   # 50 time steps
series2 = pd.DataFrame({'value': np.arange(30)})   # 30 time steps

dataset = TimeSeriesDataset([series1, series2], chunk_specs)

# series1 contributes: 50 - chunk_length + 1 samples
# series2 contributes: 30 - chunk_length + 1 samples
# Total length = sum of contributions

# dataset[0] to dataset[N-1]: from series1
# dataset[N] to dataset[M-1]: from series2

Indexing Behavior#

The __getitem__ method:

Determines which DataFrame the index belongs to using cumulative sum
Calculates the relative index within that DataFrame
Adjusts for min_start_time_index to get the actual start position
Calls the appropriate ChunkExtractor.extract()

Memory Efficiency#

The dataset stores: - DataFrames: Original data (can be large) - ChunkExtractors: Lightweight objects that reference the DataFrames - Metadata: Lengths, indices (minimal memory)

Data is extracted on-demand during __getitem__, so memory usage is efficient. However, for very large datasets, consider:

Using DataLoader with num_workers > 0 for parallel loading
Preprocessing and saving to disk in chunk format
Using data streaming for extremely large datasets

Model-Based Chunk Specifications#

Models can automatically generate chunk specifications:

from deep_time_series.model import MLP

# Create model
model = MLP(
    encoding_length=10,
    decoding_length=5,
    target_names=['target'],
    nontarget_names=['feature1', 'feature2'],
    # ... other parameters
)

# Get chunk specifications from model
chunk_specs = model.make_chunk_specs()

# Create dataset
dataset = dts.TimeSeriesDataset(
    data_frames=data_transformed,
    chunk_specs=chunk_specs
)

The make_chunk_specs() method creates appropriate specifications based on: - encoding_length: Creates EncodingChunkSpec with range_=(-encoding_length, 0) - decoding_length: Creates DecodingChunkSpec and LabelChunkSpec with range_=(0, decoding_length) - target_names and nontarget_names: Determines which features to include

Converting Model Outputs#

After model prediction, use ChunkInverter to convert tensor outputs back to DataFrame format:

Basic Usage#

import torch
from deep_time_series.chunk import ChunkInverter

# Create inverter (use the same chunk_specs as dataset)
inverter = ChunkInverter(chunk_specs)

# Model outputs are dictionaries with tag keys
# Example output from model
outputs = {
    'head.target': torch.randn(32, 5, 1)  # (batch, time, features)
}

# Convert to DataFrame
df_output = inverter.invert('head.target', outputs['head.target'])
# Returns DataFrame with columns ['target'] and MultiIndex (batch_index, time_index)

Converting Multiple Outputs#

# Convert entire output dictionary
outputs_dict = inverter.invert_dict(outputs)
# Returns dictionary of DataFrames

Tag Matching#

The inverter can match tags in different formats:

Full tag: 'head.target'
Core tag: 'target'
Tag without prefix: 'label.target' → matches 'target'

Practical Examples#

Complete Workflow Example#

Full example from data loading to prediction:

import pandas as pd
import numpy as np
import torch
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from sklearn.preprocessing import StandardScaler

import deep_time_series as dts
from deep_time_series.model import MLP
from deep_time_series.chunk import ChunkInverter

# 1. Load data
data = pd.read_csv('timeseries_data.csv')

# 2. Preprocess
transformer = dts.ColumnTransformer(
    transformer_tuples=[
        (StandardScaler(), data.columns)
    ]
)
data_transformed = transformer.fit_transform(data)

# 3. Create model
model = MLP(
    hidden_size=64,
    encoding_length=10,
    decoding_length=5,
    target_names=['target'],
    nontarget_names=['feature1', 'feature2'],
    n_hidden_layers=2,
)

# 4. Create dataset
dataset = dts.TimeSeriesDataset(
    data_frames=data_transformed,
    chunk_specs=model.make_chunk_specs()
)
dataloader = DataLoader(dataset, batch_size=32)

# 5. Train model
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model, train_dataloaders=dataloader)

# 6. Make predictions
model.eval()
batch = next(iter(dataloader))
with torch.no_grad():
    predictions = model(batch)

# 7. Convert predictions to DataFrame
inverter = ChunkInverter(model.make_chunk_specs())
predictions_df = inverter.invert('head.target', predictions['head.target'])

# 8. Inverse transform if needed
predictions_original_scale = transformer.inverse_transform(
    predictions_df.reset_index(level='time_index', drop=True)
)

Example: Multiple Time Series with Different Preprocessing#

When working with multiple time series that need different preprocessing:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
import deep_time_series as dts

# Load multiple series
series1 = pd.read_csv('series1.csv')
series2 = pd.read_csv('series2.csv')

# Different transformers for different series
transformer1 = dts.ColumnTransformer(
    transformer_tuples=[(StandardScaler(), series1.columns)]
)
transformer2 = dts.ColumnTransformer(
    transformer_tuples=[(RobustScaler(), series2.columns)]
)

# Transform separately
series1_transformed = transformer1.fit_transform(series1)
series2_transformed = transformer2.fit_transform(series2)

# Create model
model = dts.model.MLP(
    encoding_length=20,
    decoding_length=10,
    target_names=['target'],
    nontarget_names=['feature1', 'feature2'],
    # ... other parameters
)

# Create dataset with multiple series
dataset = dts.TimeSeriesDataset(
    data_frames=[series1_transformed, series2_transformed],
    chunk_specs=model.make_chunk_specs()
)

Example: Custom Chunk Specifications#

Creating custom chunk specifications for advanced use cases:

import numpy as np
import deep_time_series as dts
from deep_time_series.chunk import (
    EncodingChunkSpec,
    DecodingChunkSpec,
    LabelChunkSpec
)

# Define custom specifications
# Encoding: last 30 days
encoding_spec = EncodingChunkSpec(
    tag='historical',
    names=['sales', 'price', 'promotion'],
    range_=(-30, 0),
    dtype=np.float32
)

# Decoding: autoregressive input (previous predictions)
decoding_spec = DecodingChunkSpec(
    tag='previous_pred',
    names=['sales'],
    range_=(-5, 0),  # Last 5 predictions
    dtype=np.float32
)

# Label: next 7 days
label_spec = LabelChunkSpec(
    tag='forecast',
    names=['sales'],
    range_=(0, 7),
    dtype=np.float32
)

# Create dataset
dataset = dts.TimeSeriesDataset(
    data_frames=data,
    chunk_specs=[encoding_spec, decoding_spec, label_spec]
)

# Access chunks
sample = dataset[0]
# sample['encoding.historical']: shape (30, 3)
# sample['decoding.previous_pred']: shape (5, 1)
# sample['label.forecast']: shape (7, 1)

Example: Working with Time Index#

Using time index information for tracking predictions:

import pandas as pd
import numpy as np
import deep_time_series as dts

# Create data with datetime index
dates = pd.date_range('2024-01-01', periods=100, freq='D')
data = pd.DataFrame({
    'value': np.random.randn(100)
}, index=dates)

# Create dataset with return_time_index=True
dataset = dts.TimeSeriesDataset(
    data_frames=data,
    chunk_specs=chunk_specs,
    return_time_index=True
)

# Get a sample
sample = dataset[10]

# Access time indices
encoding_times = sample['encoding.input.time_index']
label_times = sample['label.output.time_index']

# Convert to actual datetime if needed
encoding_dates = dates[encoding_times]
label_dates = dates[label_times]

print(f"Encoding period: {encoding_dates[0]} to {encoding_dates[-1]}")
print(f"Forecast period: {label_dates[0]} to {label_dates[-1]}")

Example: Handling Missing Data#

Preprocessing data with missing values:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import deep_time_series as dts

# Load data with missing values
data = pd.read_csv('data_with_missing.csv')

# Handle missing values before creating transformer
# Option 1: Forward fill
data_ffill = data.fillna(method='ffill')

# Option 2: Interpolation
data_interp = data.interpolate(method='linear')

# Option 3: Drop rows with missing values
data_clean = data.dropna()

# Then apply transformer
transformer = dts.ColumnTransformer(
    transformer_tuples=[(StandardScaler(), data_clean.columns)]
)
data_transformed = transformer.fit_transform(data_clean)

# Create dataset
dataset = dts.TimeSeriesDataset(
    data_frames=data_transformed,
    chunk_specs=chunk_specs
)

Example: Visualizing Chunk Structure#

Visualizing how chunks are extracted:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import deep_time_series as dts

# Create sample data
data = pd.DataFrame({
    'value': np.sin(np.arange(100) * 0.1)
})

# Create specifications
encoding_spec = dts.EncodingChunkSpec(
    tag='input',
    names=['value'],
    range_=(-10, 0),
    dtype=np.float32
)

label_spec = dts.LabelChunkSpec(
    tag='output',
    names=['value'],
    range_=(0, 5),
    dtype=np.float32
)

# Create dataset
dataset = dts.TimeSeriesDataset(
    data_frames=data,
    chunk_specs=[encoding_spec, label_spec]
)

# Visualize chunk structure
dataset.plot_chunks()  # Shows the chunk ranges

# Get a sample and visualize
sample = dataset[50]

plt.figure(figsize=(12, 4))
plt.plot(data['value'], label='Full series', alpha=0.3)

# Plot encoding window
encoding_data = sample['encoding.input']
encoding_start = 50 - 10
plt.plot(range(encoding_start, encoding_start + 10),
         encoding_data.flatten(), 'o-', label='Encoding window')

# Plot label window
label_data = sample['label.output']
plt.plot(range(50, 50 + 5),
         label_data.flatten(), 's-', label='Label window')

plt.legend()
plt.title('Chunk Extraction Visualization')
plt.show()

Troubleshooting Common Issues#

This section covers common errors and how to resolve them.

Error: “Tags are duplicated”#

Problem: You get a ValueError saying tags are duplicated.

Cause: Multiple chunk specifications have the same tag.

Solution: Ensure each chunk specification has a unique tag:

# Wrong: Same tag used twice
spec1 = EncodingChunkSpec(tag='input', ...)
spec2 = EncodingChunkSpec(tag='input', ...)  # Error!

# Correct: Different tags
spec1 = EncodingChunkSpec(tag='features', ...)
spec2 = EncodingChunkSpec(tag='targets', ...)

Error: “range[0] >= range[1]”#

Problem: You get a ValueError about invalid range.

Cause: The range_ tuple has start >= end.

Solution: Ensure range_[0] < range_[1]:

# Wrong
spec = EncodingChunkSpec(tag='input', range_=(5, 3), ...)  # Error!

# Correct
spec = EncodingChunkSpec(tag='input', range_=(-5, 3), ...)

Error: AssertionError in ChunkExtractor.extract()#

Problem: AssertionError when extracting chunks: start_time_index + chunk_min_t >= 0.

Cause: Trying to extract a chunk that requires data before index 0.

Solution: Ensure your data is long enough or adjust the range:

# If you have 100 time steps and range_=(-20, 0)
# You can only extract from index 20 onwards

# Option 1: Use longer data
# Option 2: Reduce the encoding window
spec = EncodingChunkSpec(tag='input', range_=(-10, 0), ...)  # Smaller window

# Option 3: Pad your data at the beginning
# (Not recommended, consider using range_ that doesn't go negative)

Error: Column not found in DataFrame#

Problem: KeyError when creating chunk specifications.

Cause: Column names in names parameter don’t match DataFrame columns.

Solution: Verify column names match exactly:

# Check available columns
print(data.columns.tolist())

# Ensure names match exactly (case-sensitive)
spec = EncodingChunkSpec(
    tag='input',
    names=['target', 'feature1'],  # Must match DataFrame columns exactly
    ...
)

Error: Dataset length is 0 or too small#

Problem: Dataset has fewer samples than expected.

Cause: Data is too short for the specified chunk ranges.

Solution: Check data length and chunk requirements:

# Check data length
print(f"Data length: {len(data)}")

# Check chunk length requirement
chunk_specs = model.make_chunk_specs()
min_required = max(-spec.range[0] for spec in chunk_specs)
print(f"Minimum data length needed: {min_required}")

# Ensure: len(data) >= min_required + decoding_length

Error: Shape mismatch in model forward pass#

Problem: Model receives data with unexpected shape.

Cause: Chunk specifications don’t match model expectations.

Solution: Use model’s make_chunk_specs() method:

# Always use model's chunk specs
chunk_specs = model.make_chunk_specs()
dataset = TimeSeriesDataset(data, chunk_specs=chunk_specs)

# Don't manually create specs unless you know what you're doing

Error: ChunkInverter cannot find tag#

Problem: ValueError: tag not found in chunk_specs when using ChunkInverter.

Cause: Tag doesn’t match any chunk specification.

Solution: Use correct tag format:

# Check available tags
tags = [spec.tag for spec in chunk_specs]
print(f"Available tags: {tags}")

# Use full tag or core tag
inverter.invert('head.target', tensor)  # Full tag
inverter.invert('target', tensor)        # Core tag (also works)

Performance Issues#

Problem: Data loading is slow.

Solutions:

Use DataLoader with multiple workers:

dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,  # Parallel data loading
    pin_memory=True  # Faster GPU transfer
)

Preprocess data once and save:

# Preprocess and save
data_transformed = transformer.fit_transform(data)
data_transformed.to_parquet('preprocessed_data.parquet')

# Load preprocessed data later
data_transformed = pd.read_parquet('preprocessed_data.parquet')

Reduce chunk window size if possible (smaller windows = faster extraction)

Data Type Issues#

Problem: Unexpected data types or precision loss.

Solution: Explicitly set dtype in chunk specifications:

# For neural networks, use float32 (standard)
spec = EncodingChunkSpec(
    tag='input',
    names=['value'],
    range_=(-10, 0),
    dtype=np.float32  # Explicitly set
)

# For high precision requirements
spec = EncodingChunkSpec(
    tag='input',
    names=['value'],
    range_=(-10, 0),
    dtype=np.float64  # Higher precision, more memory
)

Tips and Best Practices#

Data Validation: - Ensure your DataFrame has no missing values before creating the dataset - Check data types are appropriate (numeric for most models) - Verify column names match chunk specification names
Index Handling: - The dataset uses integer indices internally - If your DataFrame has a datetime index, store it separately or reset it - Use return_time_index=True if you need to track time points
Memory Efficiency: - For large datasets, use DataLoader with appropriate num_workers - Consider preprocessing and saving to disk - Use float32 instead of float64 when possible
Preprocessing: - Always fit transformers on training data only - Transform both training and validation/test data with the same transformer - Save transformers for inference time
Chunk Specifications: - Use model’s make_chunk_specs() when possible - Ensure chunk ranges don’t exceed your data length - Consider the minimum data length requirement: len(data) >= max(-range[0]) + max(range[1])
Debugging: - Use dataset.plot_chunks() to visualize chunk structure - Inspect a single sample: sample = dataset[0] - Check shapes: print({k: v.shape for k, v in sample.items()}) - Verify time indices if using return_time_index=True
Multiple Time Series: - Ensure all DataFrames have the same column structure - Apply consistent preprocessing to all series - Consider series length differences when interpreting results

Data Input/Output#

Data Format#

Loading Data#

Multiple Time Series#

Data Preprocessing#

Basic Usage#

Using Transformer Dictionary#

Inverse Transform#

Understanding Chunk Specifications#

What are Chunk Specifications?#

Types of Chunk Specifications#

Understanding the range_ Parameter#

Visual Representation#

Tag Naming Convention#

Choosing the Right Data Type#

Creating Dataset#

How ChunkExtractor Works#

Data Extraction Mechanism#

Understanding start_time_index#

Multiple Chunk Specifications#

The return_time_index Option#

TimeSeriesDataset Deep Dive#

Dataset Length Calculation#

Multiple DataFrames Handling#

Indexing Behavior#

Memory Efficiency#

Model-Based Chunk Specifications#

Converting Model Outputs#

Basic Usage#

Converting Multiple Outputs#

Tag Matching#

Practical Examples#

Complete Workflow Example#

Example: Multiple Time Series with Different Preprocessing#

Example: Custom Chunk Specifications#

Example: Working with Time Index#

Example: Handling Missing Data#

Example: Visualizing Chunk Structure#

Troubleshooting Common Issues#

Error: “Tags are duplicated”#

Error: “range[0] >= range[1]”#

Error: AssertionError in ChunkExtractor.extract()#

Error: Column not found in DataFrame#

Error: Dataset length is 0 or too small#

Error: Shape mismatch in model forward pass#

Error: ChunkInverter cannot find tag#

Performance Issues#

Data Type Issues#

Tips and Best Practices#

This Page