Data Input/Output ================= This guide explains how to load, preprocess, and transform data in DeepTimeSeries. Data Format ----------- DeepTimeSeries uses pandas DataFrame as the primary data format. Each DataFrame represents a time series where: - Rows correspond to time steps - Columns correspond to features (target variables and non-target features) - The index can be used for time information, but is not required Loading Data ----------- Since DeepTimeSeries works with pandas DataFrames, you can load data from various sources using pandas: .. code-block:: python import pandas as pd # Load from CSV data = pd.read_csv('data.csv') # Load from Excel data = pd.read_excel('data.xlsx') # Load from Parquet data = pd.read_parquet('data.parquet') # Create from NumPy array import numpy as np data = pd.DataFrame({ 'target': np.sin(np.arange(100)), 'feature1': np.cos(np.arange(100)), 'feature2': np.random.randn(100) }) Multiple Time Series -------------------- You can work with multiple time series by passing a list of DataFrames to ``TimeSeriesDataset``: .. code-block:: python import pandas as pd import deep_time_series as dts # Create multiple time series series1 = pd.DataFrame({'value': np.arange(100)}) series2 = pd.DataFrame({'value': np.arange(50)}) # Pass as a list dataset = dts.TimeSeriesDataset( data_frames=[series1, series2], chunk_specs=chunk_specs ) Data Preprocessing ------------------ The ``ColumnTransformer`` class provides a convenient way to apply transformations to specific columns of your DataFrame. Basic Usage ~~~~~~~~~~~ .. code-block:: python from sklearn.preprocessing import StandardScaler, MinMaxScaler import deep_time_series as dts # Create transformer transformer = dts.ColumnTransformer( transformer_tuples=[ (StandardScaler(), ['target', 'feature1']), (MinMaxScaler(), ['feature2']) ] ) # Fit and transform data_transformed = transformer.fit_transform(data) # Or fit and transform separately transformer.fit(data) data_transformed = transformer.transform(data) Using Transformer Dictionary ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can also use a dictionary to specify transformers: .. code-block:: python from sklearn.preprocessing import StandardScaler transformer = dts.ColumnTransformer( transformer_dict={ 'target': StandardScaler(), 'feature1': StandardScaler(), 'feature2': StandardScaler() } ) data_transformed = transformer.fit_transform(data) Inverse Transform ~~~~~~~~~~~~~~~~~~ To convert transformed data back to original scale: .. code-block:: python # Transform data_transformed = transformer.transform(data) # Inverse transform data_original = transformer.inverse_transform(data_transformed) Understanding Chunk Specifications ----------------------------------- Chunk specifications define how time series data is extracted and organized for model training and prediction. They are the core concept for understanding how DeepTimeSeries handles temporal data. What are Chunk Specifications? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A chunk specification (``ChunkSpec``) defines: - **What features** to extract (via ``names``) - **When to extract** them (via ``range_``) - **How to label** them (via ``tag``) - **What data type** to use (via ``dtype``) Types of Chunk Specifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DeepTimeSeries provides three types of chunk specifications: 1. **EncodingChunkSpec**: Defines the input window for the encoder - Used for historical data that the model uses to understand patterns - Typically has negative range values (e.g., ``range_=(-10, 0)``) 2. **DecodingChunkSpec**: Defines the input window for the decoder - Used in autoregressive models where decoder needs previous predictions - Can have positive range values (e.g., ``range_=(0, 5)``) 3. **LabelChunkSpec**: Defines the target window for prediction - Used for ground truth labels during training - Typically has positive range values (e.g., ``range_=(0, 5)``) Understanding the `range_` Parameter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``range_`` parameter is a tuple ``(start, end)`` that defines the relative time window: - **Negative values**: Refer to past time steps relative to the current position - **Zero**: Represents the current time step - **Positive values**: Refer to future time steps relative to the current position Examples: .. code-block:: python import numpy as np import deep_time_series as dts # Extract 10 past time steps (t-10 to t-1, excluding current time) encoding_spec = dts.EncodingChunkSpec( tag='input', names=['target', 'feature1'], range_=(-10, 0), # From 10 steps ago to current (exclusive) dtype=np.float32 ) # Extract 5 future time steps (t to t+4) label_spec = dts.LabelChunkSpec( tag='output', names=['target'], range_=(0, 5), # From current to 4 steps ahead dtype=np.float32 ) # Extract overlapping window (t-5 to t+5) window_spec = dts.EncodingChunkSpec( tag='context', names=['target'], range_=(-5, 6), # 5 steps before to 5 steps after dtype=np.float32 ) Visual Representation ~~~~~~~~~~~~~~~~~~~~~~ For a time series with data points at indices 0, 1, 2, ..., and a chunk specification with ``range_=(-3, 2)``: :: Time indices: ... -3 -2 -1 0 1 2 3 ... Relative to t: ... -3 -2 -1 [t] 1 2 3 ... range_=(-3, 2) extracts: [t-3, t-2, t-1, t, t+1] (5 time steps total) Tag Naming Convention ~~~~~~~~~~~~~~~~~~~~~ Tags are used to identify different chunks in the data dictionary. They follow a prefix pattern: - ``EncodingChunkSpec``: Automatically adds ``'encoding.'`` prefix - ``DecodingChunkSpec``: Automatically adds ``'decoding.'`` prefix - ``LabelChunkSpec``: Automatically adds ``'label.'`` prefix Example: .. code-block:: python spec = dts.EncodingChunkSpec(tag='my_feature', ...) # spec.tag will be 'encoding.my_feature' spec = dts.LabelChunkSpec(tag='target', ...) # spec.tag will be 'label.target' Tags must be unique within a list of chunk specifications. When you access data from ``TimeSeriesDataset``, you'll use these tags as dictionary keys. Choosing the Right Data Type ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``dtype`` parameter controls the NumPy data type used for the extracted chunks: - **np.float32**: Recommended for most cases (memory efficient, sufficient precision) - **np.float64**: Use when high precision is required (uses more memory) - **np.int32**: Use for integer features (e.g., counts, categories) Example: .. code-block:: python # For continuous values continuous_spec = dts.EncodingChunkSpec( tag='features', names=['temperature', 'humidity'], range_=(-10, 0), dtype=np.float32 # Standard for neural networks ) # For integer counts count_spec = dts.EncodingChunkSpec( tag='counts', names=['sales_count', 'visitor_count'], range_=(-10, 0), dtype=np.int32 ) Creating Dataset ---------------- After preprocessing, create a ``TimeSeriesDataset`` with appropriate chunk specifications: .. code-block:: python import deep_time_series as dts from deep_time_series.chunk import EncodingChunkSpec, LabelChunkSpec import numpy as np # Define chunk specifications encoding_spec = dts.EncodingChunkSpec( tag='input', names=['target', 'feature1', 'feature2'], range_=(-10, 0), # 10 time steps before current time dtype=np.float32 ) label_spec = dts.LabelChunkSpec( tag='output', names=['target'], range_=(0, 5), # 5 time steps ahead dtype=np.float32 ) # Create dataset dataset = dts.TimeSeriesDataset( data_frames=data_transformed, chunk_specs=[encoding_spec, label_spec], return_time_index=True # Optional: include time index in chunks ) How ChunkExtractor Works ------------------------- The ``ChunkExtractor`` class is responsible for extracting data chunks from a DataFrame according to chunk specifications. Understanding its behavior helps you debug and optimize your data pipeline. Data Extraction Mechanism ~~~~~~~~~~~~~~~~~~~~~~~~~ When you create a ``TimeSeriesDataset``, it internally creates a ``ChunkExtractor`` for each DataFrame. The extractor: 1. **Preprocesses the DataFrame**: Converts specified columns to the required data type 2. **Calculates window boundaries**: Determines the minimum and maximum time indices needed 3. **Extracts chunks**: For each sample, extracts the specified time windows Example of extraction process: .. code-block:: python import pandas as pd import numpy as np from deep_time_series.chunk import ChunkExtractor, EncodingChunkSpec, LabelChunkSpec # Create sample data df = pd.DataFrame({ 'target': np.arange(20), 'feature': np.arange(20) * 2 }) # Define specifications encoding_spec = EncodingChunkSpec( tag='input', names=['target', 'feature'], range_=(-3, 0), # Last 3 time steps dtype=np.float32 ) label_spec = LabelChunkSpec( tag='output', names=['target'], range_=(0, 2), # Next 2 time steps dtype=np.float32 ) # Create extractor extractor = ChunkExtractor(df, [encoding_spec, label_spec]) # Extract chunk at time index 5 chunk = extractor.extract(start_time_index=5, return_time_index=False) # chunk contains: # - 'encoding.input': shape (3, 2) - 3 time steps, 2 features # - 'label.output': shape (2, 1) - 2 time steps, 1 feature Understanding `start_time_index` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``start_time_index`` parameter determines where in the time series to start extraction: - **start_time_index = 0**: Extract from the beginning of the series - **start_time_index = 5**: Extract starting from index 5 - The extractor ensures that ``start_time_index + chunk_min_t >= 0`` For the example above with ``range_=(-3, 0)`` for encoding: - At ``start_time_index=5``, it extracts indices [2, 3, 4] (5-3 to 5-1) - At ``start_time_index=0``, it cannot extract because 0 + (-3) < 0 Multiple Chunk Specifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When multiple chunk specifications are provided, the extractor: 1. Finds the overall window: ``[min(range[0]), max(range[1])]`` 2. Extracts data for the entire window 3. Then slices each specification's portion Example: .. code-block:: python # Specification 1: range_=(-5, 0) # Specification 2: range_=(0, 3) # Overall window: range_=(-5, 3) - extracts 8 time steps total spec1 = EncodingChunkSpec(tag='past', names=['target'], range_=(-5, 0), dtype=np.float32) spec2 = LabelChunkSpec(tag='future', names=['target'], range_=(0, 3), dtype=np.float32) extractor = ChunkExtractor(df, [spec1, spec2]) # chunk_min_t = -5, chunk_max_t = 3, chunk_length = 8 chunk = extractor.extract(start_time_index=10) # Extracts indices [5, 6, 7, 8, 9, 10, 11, 12] # Then slices: # - 'encoding.past': indices [5, 6, 7, 8, 9] (relative -5 to -1) # - 'label.future': indices [10, 11, 12] (relative 0 to 2) The `return_time_index` Option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When ``return_time_index=True``, the extractor also includes time index information: .. code-block:: python chunk = extractor.extract(start_time_index=5, return_time_index=True) # chunk contains: # - 'encoding.input': the data array # - 'encoding.input.time_index': array([2, 3, 4]) - actual DataFrame indices # - 'label.output': the data array # - 'label.output.time_index': array([5, 6]) - actual DataFrame indices This is useful when you need to track which time points correspond to predictions. TimeSeriesDataset Deep Dive ---------------------------- The ``TimeSeriesDataset`` class wraps ``ChunkExtractor`` to provide a PyTorch-compatible Dataset interface. Dataset Length Calculation ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The length of the dataset depends on: 1. **DataFrame length**: Total number of time steps 2. **Chunk length**: Maximum window size needed (``chunk_max_t - chunk_min_t``) 3. **Minimum start index**: Ensures valid extractions (``max(0, -chunk_min_t)``) Formula: ``length = len(df) - chunk_length + 1 - min_start_time_index`` Example: .. code-block:: python df = pd.DataFrame({'value': np.arange(100)}) # 100 time steps spec = EncodingChunkSpec( tag='input', names=['value'], range_=(-10, 0), # chunk_length = 10 dtype=np.float32 ) dataset = TimeSeriesDataset(df, [spec]) # Length = 100 - 10 + 1 - 0 = 91 samples # Can extract from indices 0 to 90 # At index 0: extracts [0:10] # At index 90: extracts [90:100] Multiple DataFrames Handling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When multiple DataFrames are provided, the dataset: 1. Creates a separate ``ChunkExtractor`` for each DataFrame 2. Concatenates samples from all DataFrames 3. Uses cumulative indexing to map dataset index to DataFrame index Example: .. code-block:: python series1 = pd.DataFrame({'value': np.arange(50)}) # 50 time steps series2 = pd.DataFrame({'value': np.arange(30)}) # 30 time steps dataset = TimeSeriesDataset([series1, series2], chunk_specs) # series1 contributes: 50 - chunk_length + 1 samples # series2 contributes: 30 - chunk_length + 1 samples # Total length = sum of contributions # dataset[0] to dataset[N-1]: from series1 # dataset[N] to dataset[M-1]: from series2 Indexing Behavior ~~~~~~~~~~~~~~~~~ The ``__getitem__`` method: 1. Determines which DataFrame the index belongs to using cumulative sum 2. Calculates the relative index within that DataFrame 3. Adjusts for ``min_start_time_index`` to get the actual start position 4. Calls the appropriate ``ChunkExtractor.extract()`` Memory Efficiency ~~~~~~~~~~~~~~~~~ The dataset stores: - **DataFrames**: Original data (can be large) - **ChunkExtractors**: Lightweight objects that reference the DataFrames - **Metadata**: Lengths, indices (minimal memory) Data is extracted on-demand during ``__getitem__``, so memory usage is efficient. However, for very large datasets, consider: - Using ``DataLoader`` with ``num_workers > 0`` for parallel loading - Preprocessing and saving to disk in chunk format - Using data streaming for extremely large datasets Model-Based Chunk Specifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Models can automatically generate chunk specifications: .. code-block:: python from deep_time_series.model import MLP # Create model model = MLP( encoding_length=10, decoding_length=5, target_names=['target'], nontarget_names=['feature1', 'feature2'], # ... other parameters ) # Get chunk specifications from model chunk_specs = model.make_chunk_specs() # Create dataset dataset = dts.TimeSeriesDataset( data_frames=data_transformed, chunk_specs=chunk_specs ) The ``make_chunk_specs()`` method creates appropriate specifications based on: - ``encoding_length``: Creates ``EncodingChunkSpec`` with ``range_=(-encoding_length, 0)`` - ``decoding_length``: Creates ``DecodingChunkSpec`` and ``LabelChunkSpec`` with ``range_=(0, decoding_length)`` - ``target_names`` and ``nontarget_names``: Determines which features to include Converting Model Outputs ------------------------ After model prediction, use ``ChunkInverter`` to convert tensor outputs back to DataFrame format: Basic Usage ~~~~~~~~~~~ .. code-block:: python import torch from deep_time_series.chunk import ChunkInverter # Create inverter (use the same chunk_specs as dataset) inverter = ChunkInverter(chunk_specs) # Model outputs are dictionaries with tag keys # Example output from model outputs = { 'head.target': torch.randn(32, 5, 1) # (batch, time, features) } # Convert to DataFrame df_output = inverter.invert('head.target', outputs['head.target']) # Returns DataFrame with columns ['target'] and MultiIndex (batch_index, time_index) Converting Multiple Outputs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Convert entire output dictionary outputs_dict = inverter.invert_dict(outputs) # Returns dictionary of DataFrames Tag Matching ~~~~~~~~~~~~ The inverter can match tags in different formats: - Full tag: ``'head.target'`` - Core tag: ``'target'`` - Tag without prefix: ``'label.target'`` → matches ``'target'`` Practical Examples ------------------ Complete Workflow Example ~~~~~~~~~~~~~~~~~~~~~~~~~ Full example from data loading to prediction: .. code-block:: python import pandas as pd import numpy as np import torch import pytorch_lightning as pl from torch.utils.data import DataLoader from sklearn.preprocessing import StandardScaler import deep_time_series as dts from deep_time_series.model import MLP from deep_time_series.chunk import ChunkInverter # 1. Load data data = pd.read_csv('timeseries_data.csv') # 2. Preprocess transformer = dts.ColumnTransformer( transformer_tuples=[ (StandardScaler(), data.columns) ] ) data_transformed = transformer.fit_transform(data) # 3. Create model model = MLP( hidden_size=64, encoding_length=10, decoding_length=5, target_names=['target'], nontarget_names=['feature1', 'feature2'], n_hidden_layers=2, ) # 4. Create dataset dataset = dts.TimeSeriesDataset( data_frames=data_transformed, chunk_specs=model.make_chunk_specs() ) dataloader = DataLoader(dataset, batch_size=32) # 5. Train model trainer = pl.Trainer(max_epochs=10) trainer.fit(model, train_dataloaders=dataloader) # 6. Make predictions model.eval() batch = next(iter(dataloader)) with torch.no_grad(): predictions = model(batch) # 7. Convert predictions to DataFrame inverter = ChunkInverter(model.make_chunk_specs()) predictions_df = inverter.invert('head.target', predictions['head.target']) # 8. Inverse transform if needed predictions_original_scale = transformer.inverse_transform( predictions_df.reset_index(level='time_index', drop=True) ) Example: Multiple Time Series with Different Preprocessing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When working with multiple time series that need different preprocessing: .. code-block:: python import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler, RobustScaler import deep_time_series as dts # Load multiple series series1 = pd.read_csv('series1.csv') series2 = pd.read_csv('series2.csv') # Different transformers for different series transformer1 = dts.ColumnTransformer( transformer_tuples=[(StandardScaler(), series1.columns)] ) transformer2 = dts.ColumnTransformer( transformer_tuples=[(RobustScaler(), series2.columns)] ) # Transform separately series1_transformed = transformer1.fit_transform(series1) series2_transformed = transformer2.fit_transform(series2) # Create model model = dts.model.MLP( encoding_length=20, decoding_length=10, target_names=['target'], nontarget_names=['feature1', 'feature2'], # ... other parameters ) # Create dataset with multiple series dataset = dts.TimeSeriesDataset( data_frames=[series1_transformed, series2_transformed], chunk_specs=model.make_chunk_specs() ) Example: Custom Chunk Specifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Creating custom chunk specifications for advanced use cases: .. code-block:: python import numpy as np import deep_time_series as dts from deep_time_series.chunk import ( EncodingChunkSpec, DecodingChunkSpec, LabelChunkSpec ) # Define custom specifications # Encoding: last 30 days encoding_spec = EncodingChunkSpec( tag='historical', names=['sales', 'price', 'promotion'], range_=(-30, 0), dtype=np.float32 ) # Decoding: autoregressive input (previous predictions) decoding_spec = DecodingChunkSpec( tag='previous_pred', names=['sales'], range_=(-5, 0), # Last 5 predictions dtype=np.float32 ) # Label: next 7 days label_spec = LabelChunkSpec( tag='forecast', names=['sales'], range_=(0, 7), dtype=np.float32 ) # Create dataset dataset = dts.TimeSeriesDataset( data_frames=data, chunk_specs=[encoding_spec, decoding_spec, label_spec] ) # Access chunks sample = dataset[0] # sample['encoding.historical']: shape (30, 3) # sample['decoding.previous_pred']: shape (5, 1) # sample['label.forecast']: shape (7, 1) Example: Working with Time Index ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using time index information for tracking predictions: .. code-block:: python import pandas as pd import numpy as np import deep_time_series as dts # Create data with datetime index dates = pd.date_range('2024-01-01', periods=100, freq='D') data = pd.DataFrame({ 'value': np.random.randn(100) }, index=dates) # Create dataset with return_time_index=True dataset = dts.TimeSeriesDataset( data_frames=data, chunk_specs=chunk_specs, return_time_index=True ) # Get a sample sample = dataset[10] # Access time indices encoding_times = sample['encoding.input.time_index'] label_times = sample['label.output.time_index'] # Convert to actual datetime if needed encoding_dates = dates[encoding_times] label_dates = dates[label_times] print(f"Encoding period: {encoding_dates[0]} to {encoding_dates[-1]}") print(f"Forecast period: {label_dates[0]} to {label_dates[-1]}") Example: Handling Missing Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Preprocessing data with missing values: .. code-block:: python import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler import deep_time_series as dts # Load data with missing values data = pd.read_csv('data_with_missing.csv') # Handle missing values before creating transformer # Option 1: Forward fill data_ffill = data.fillna(method='ffill') # Option 2: Interpolation data_interp = data.interpolate(method='linear') # Option 3: Drop rows with missing values data_clean = data.dropna() # Then apply transformer transformer = dts.ColumnTransformer( transformer_tuples=[(StandardScaler(), data_clean.columns)] ) data_transformed = transformer.fit_transform(data_clean) # Create dataset dataset = dts.TimeSeriesDataset( data_frames=data_transformed, chunk_specs=chunk_specs ) Example: Visualizing Chunk Structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Visualizing how chunks are extracted: .. code-block:: python import pandas as pd import numpy as np import matplotlib.pyplot as plt import deep_time_series as dts # Create sample data data = pd.DataFrame({ 'value': np.sin(np.arange(100) * 0.1) }) # Create specifications encoding_spec = dts.EncodingChunkSpec( tag='input', names=['value'], range_=(-10, 0), dtype=np.float32 ) label_spec = dts.LabelChunkSpec( tag='output', names=['value'], range_=(0, 5), dtype=np.float32 ) # Create dataset dataset = dts.TimeSeriesDataset( data_frames=data, chunk_specs=[encoding_spec, label_spec] ) # Visualize chunk structure dataset.plot_chunks() # Shows the chunk ranges # Get a sample and visualize sample = dataset[50] plt.figure(figsize=(12, 4)) plt.plot(data['value'], label='Full series', alpha=0.3) # Plot encoding window encoding_data = sample['encoding.input'] encoding_start = 50 - 10 plt.plot(range(encoding_start, encoding_start + 10), encoding_data.flatten(), 'o-', label='Encoding window') # Plot label window label_data = sample['label.output'] plt.plot(range(50, 50 + 5), label_data.flatten(), 's-', label='Label window') plt.legend() plt.title('Chunk Extraction Visualization') plt.show() Troubleshooting Common Issues ------------------------------- This section covers common errors and how to resolve them. Error: "Tags are duplicated" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: You get a ``ValueError`` saying tags are duplicated. **Cause**: Multiple chunk specifications have the same tag. **Solution**: Ensure each chunk specification has a unique tag: .. code-block:: python # Wrong: Same tag used twice spec1 = EncodingChunkSpec(tag='input', ...) spec2 = EncodingChunkSpec(tag='input', ...) # Error! # Correct: Different tags spec1 = EncodingChunkSpec(tag='features', ...) spec2 = EncodingChunkSpec(tag='targets', ...) Error: "range[0] >= range[1]" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: You get a ``ValueError`` about invalid range. **Cause**: The ``range_`` tuple has ``start >= end``. **Solution**: Ensure ``range_[0] < range_[1]``: .. code-block:: python # Wrong spec = EncodingChunkSpec(tag='input', range_=(5, 3), ...) # Error! # Correct spec = EncodingChunkSpec(tag='input', range_=(-5, 3), ...) Error: AssertionError in ChunkExtractor.extract() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: ``AssertionError`` when extracting chunks: ``start_time_index + chunk_min_t >= 0``. **Cause**: Trying to extract a chunk that requires data before index 0. **Solution**: Ensure your data is long enough or adjust the range: .. code-block:: python # If you have 100 time steps and range_=(-20, 0) # You can only extract from index 20 onwards # Option 1: Use longer data # Option 2: Reduce the encoding window spec = EncodingChunkSpec(tag='input', range_=(-10, 0), ...) # Smaller window # Option 3: Pad your data at the beginning # (Not recommended, consider using range_ that doesn't go negative) Error: Column not found in DataFrame ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: ``KeyError`` when creating chunk specifications. **Cause**: Column names in ``names`` parameter don't match DataFrame columns. **Solution**: Verify column names match exactly: .. code-block:: python # Check available columns print(data.columns.tolist()) # Ensure names match exactly (case-sensitive) spec = EncodingChunkSpec( tag='input', names=['target', 'feature1'], # Must match DataFrame columns exactly ... ) Error: Dataset length is 0 or too small ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: Dataset has fewer samples than expected. **Cause**: Data is too short for the specified chunk ranges. **Solution**: Check data length and chunk requirements: .. code-block:: python # Check data length print(f"Data length: {len(data)}") # Check chunk length requirement chunk_specs = model.make_chunk_specs() min_required = max(-spec.range[0] for spec in chunk_specs) print(f"Minimum data length needed: {min_required}") # Ensure: len(data) >= min_required + decoding_length Error: Shape mismatch in model forward pass ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: Model receives data with unexpected shape. **Cause**: Chunk specifications don't match model expectations. **Solution**: Use model's ``make_chunk_specs()`` method: .. code-block:: python # Always use model's chunk specs chunk_specs = model.make_chunk_specs() dataset = TimeSeriesDataset(data, chunk_specs=chunk_specs) # Don't manually create specs unless you know what you're doing Error: ChunkInverter cannot find tag ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: ``ValueError``: tag not found in chunk_specs when using ``ChunkInverter``. **Cause**: Tag doesn't match any chunk specification. **Solution**: Use correct tag format: .. code-block:: python # Check available tags tags = [spec.tag for spec in chunk_specs] print(f"Available tags: {tags}") # Use full tag or core tag inverter.invert('head.target', tensor) # Full tag inverter.invert('target', tensor) # Core tag (also works) Performance Issues ~~~~~~~~~~~~~~~~~~ **Problem**: Data loading is slow. **Solutions**: 1. **Use DataLoader with multiple workers**: .. code-block:: python dataloader = DataLoader( dataset, batch_size=32, num_workers=4, # Parallel data loading pin_memory=True # Faster GPU transfer ) 2. **Preprocess data once and save**: .. code-block:: python # Preprocess and save data_transformed = transformer.fit_transform(data) data_transformed.to_parquet('preprocessed_data.parquet') # Load preprocessed data later data_transformed = pd.read_parquet('preprocessed_data.parquet') 3. **Reduce chunk window size** if possible (smaller windows = faster extraction) Data Type Issues ~~~~~~~~~~~~~~~~ **Problem**: Unexpected data types or precision loss. **Solution**: Explicitly set dtype in chunk specifications: .. code-block:: python # For neural networks, use float32 (standard) spec = EncodingChunkSpec( tag='input', names=['value'], range_=(-10, 0), dtype=np.float32 # Explicitly set ) # For high precision requirements spec = EncodingChunkSpec( tag='input', names=['value'], range_=(-10, 0), dtype=np.float64 # Higher precision, more memory ) Tips and Best Practices ----------------------- 1. **Data Validation**: - Ensure your DataFrame has no missing values before creating the dataset - Check data types are appropriate (numeric for most models) - Verify column names match chunk specification names 2. **Index Handling**: - The dataset uses integer indices internally - If your DataFrame has a datetime index, store it separately or reset it - Use ``return_time_index=True`` if you need to track time points 3. **Memory Efficiency**: - For large datasets, use ``DataLoader`` with appropriate ``num_workers`` - Consider preprocessing and saving to disk - Use ``float32`` instead of ``float64`` when possible 4. **Preprocessing**: - Always fit transformers on training data only - Transform both training and validation/test data with the same transformer - Save transformers for inference time 5. **Chunk Specifications**: - Use model's ``make_chunk_specs()`` when possible - Ensure chunk ranges don't exceed your data length - Consider the minimum data length requirement: ``len(data) >= max(-range[0]) + max(range[1])`` 6. **Debugging**: - Use ``dataset.plot_chunks()`` to visualize chunk structure - Inspect a single sample: ``sample = dataset[0]`` - Check shapes: ``print({k: v.shape for k, v in sample.items()})`` - Verify time indices if using ``return_time_index=True`` 7. **Multiple Time Series**: - Ensure all DataFrames have the same column structure - Apply consistent preprocessing to all series - Consider series length differences when interpreting results