Preprocessing

The preprocessing can be used independently as a standalone feature. To start the process, click the green button Check scenario in scenarios under check and run in the navigation bar. This procedure includes comprehensive data checks, data enrichments, availability simulations for components, data aggregations, and the creation of reference and preprocessed datasets.

Data check

The data check includes a series of quality gates, duplicate detections, and consistency checks. If issues are found, the frontend displays a list that may require user review to ensure that only valid data is loaded into the simulation.

The data check is automatically performed when initiating a simulation run or a dedicated check. It validates model input data across several quality dimensions, including:

Configuration consistency
Unique component naming
Input file completeness and format compliance
Verification of data type, validity, completeness, necessity, and consistency
Consistency, solvability, and suitability of optimization problems

A successful data check ensures that subsequent simulation runs proceed smoothly. If errors are found, the frontend displays a detailed issue list, providing targeted solution proposals. Passing the data check is required to generate reference and preprocessed datasets and to start simulation runs.

Enrichment

During input data enrichment, any empty sub-datasets are filled with assumptions based on look-ups, depending on the available data (best guess). This ensures the highest possible simulation accuracy with the given parameters. The enrichment also facilitates the use of data from various sources with differing coverage, such as varying resolutions.

The enriched dataset can be looked up in the reference dataset.

Availabilities

Like all components of the electricity utility system, power plants, consumers, storages, and grid elements are subject to temporary unavailability due to unforeseen outages or planned revisions. During these periods, a component’s ability to generate, store, transfer, or consume electricity is either reduced or unavailable, affecting unit commitment, exchanges, and prices.

To ensure realistic results in electricity wholesale market simulations, these unavailability events must be considered, as they directly impact the dispatch of a component. The primary causes of unavailability are:

Revisions: Scheduled maintenance activities, typically planned years in advance, which occur mainly during seasonal fluctuations, especially in summer in Europe. These are relatively predictable and have standard durations.
Outages: Unplanned disruptions caused by operational or technical failures. These can occur unpredictably at any time and may vary significantly in duration.

The overall availability of a component is determined by the combination of revisions and outages. Since long-term simulations may lack precise revision dates, random unavailability events are generated during preprocessing. Timestamps and durations for both revisions and outages are determined using a random process: the start time is based on a uniform distribution, and the duration is drawn from a normal distribution, as illustrated in the following figure.

Figure 0: Revision and outage event simulation (partly known as double outage drawing)

Relative generation and consumption potentials (as percentages per time range) and event time span distributions (in hours per event) are specified in input files labeled availability. The determination of revisions and outages is carried out during preprocessing in two steps. First, revisions are generated, followed by the determination of outages. Throughout this process, must-runs, revisions, and outages in the input data are preserved. The final event drawings are treated as partial non-availability events to ensure the target availability is met precisely.

To generate revision and outage events, a pseudo seed based on the availability cluster is used. Each cluster is defined by component type, technology, time range, and availability type (revision or outage). This method ensures that revision and outage events can be reproduced with the same input data. Reproducibility is maintained as long as the component lists, drawing parameters, and configurations remain unchanged. While the drawing process is stochastic and event collisions may occur, the specified availability for each cluster is guaranteed. Alternatively, a Mersenne Twister 19937 method can be applied, though this approach is not reproducible. To use this method, disable the parameter preprocessing_availability_drawing_reproducible in the project configuration.

Aggregation

During the preprocessing, aggregation methods can be applied for following components.

Built-in aggregations can be used optionally for various use cases, including:

Fast runs for rough estimates within minutes
Monte Carlo runs for probabilistic analyses
Resolving numerical issues in solver algorithms that use interior-point methods

Battery aggregation

Battery aggregation consolidates clustered batteries within each cluster into a single representative battery. The parameter battery_aggregation_opt in 90_grid_bidding_zones.csv defines the aggregation level for batteries within a bidding zone, allowing for flexible configurations based on the following values:

Value	Description
0	None (default)
1	Batteries with the same technology
2	All batteries

Setting battery_aggregation_opt to for example 2 results in a single aggregated battery storage in the associated bidding zone.

During aggregation, associated input data such as must-runs, outages, revisions and states-of-charge are merged, ensuring that the storage capacity, charging and discharging capacities, and metrics such as power-weighted average efficiencies and work costs remain consistent as much as possible.

Hydro aggregation

The hydro power plant aggregation begins by identifying clusters, or sub-networks, of components based on the network topology. Each sub-network is then aggregated, and the resulting replacement components are integrated into the input data model. This iterative process ensures that each aggregation step builds on the previous one, capturing the topology’s details in a reproducible and deterministic way without randomization. Once aggregation is complete, the replacement components are saved to the preprocessed dataset.

The parameter hydro_aggregation_opt in 90_grid_bidding_zones.csv sets the aggregation level for all hydro power plants in a given bidding zone. The aggregation levels for hydro-connected networks are as follows:

Value	Description
0	None (default)
1	Parallel hydro paths
2	Serial hydro paths
3	Parallel hydro paths with same lower reservoirs
4	Parallel hydro paths with same upper reservoirs
5	Serial and parallel hydro paths

For higher levels, hydro-separated networks are merged as follows:

Value	Description
6	Hydro networks without inflows
7	Hydro networks with inflows
8	All hydro networks

Setting hydro_aggregation_opt to for example 8 results in a single aggregated pumped-storage unit, including inflows. The following figure illustrates an example of hydro network aggregation for each level.

Figure 0: Exemplary identification and aggregation of a hydro power network

Hydro aggregations also merge linked input data like reservoir inflows, reservoir filling levels or plant must-runs, outages and revisions, and assume in total equal power sums, equal generation and consumption amount possibilities as well as same power-weighted average efficiencies and work costs as much as possible.

Thermal aggregation

The thermal power plant aggregation process begins by identifying clusters, or component groups, based on shared plant characteristics. Each group is then aggregated and replaced in the input data model with a single, aggregated component representing the group. This iterative process captures detailed information from the original model at each level, ensuring that aggregation builds consistently and systematically on the previous level. The aggregation is deterministic and reproducible, avoiding randomization by relying solely on specific characteristics. Finally, the aggregated replacement components are written out in the preprocessed dataset.

The thermal_aggregation_opt parameter in 90_grid_bidding_zones.csv defines the level of aggregation for thermal power plants within a bidding zone. Available options for this parameter are:

Value	Description
0	None (default)
1	Thermal power plants with the same combination of technology and fuel type
2	Thermal power plants with same fuel type
3	All thermal power plants

For example, setting thermal_aggregation_opt to 3 aggregates all thermal plants into a single replacement component in the associated bidding zone.

Thermal aggregations also consolidate associated input data such as fuel costs, efficiencies, must-runs, outages, revisions, fuel restrictions, as well as fuel and emission limits, while maintaining power sums, generation capacities, and average generation costs as much as possible.

DSR aggregation

The demand-side response (DSR) aggregation merges clustered DSR consumers into a single representative consumer per cluster. These clusters are defined by different capacities in load adjustment, such as shifting, increasing, decreasing, or a combination of increasing and decreasing.

The dsr_aggregation_opt parameter in 90_grid_bidding_zones.csv controls the aggregation level for DSR consumers in each bidding zone, with the following options:

Value	Description
0	None (default)
1	DSR consumers with identical technology
2	All DSR consumers

For example, setting dsr_aggregation_opt to 2 aggregates all DSR consumers in the associated bidding zone in up to four replacement components: load-shift, load-increase, load-decrease, and combined load-increase and load-decrease.

DSR aggregations also merge associated input data, such as must-runs, outages, revisions, and work restrictions, while ensuring the same total power, generation, and consumption capacities, as well as equivalent power-weighted average efficiencies and operational costs as much as possible. The weighting power is determined by the total sum of load-increase and load-decrease values.

Grid aggregation

Grid aggregation applies only to Flow-Based Market Coupling (FBMC) capacities. Unlike generator, consumer, and storage aggregations, grid aggregation is based directly on the accuracy of parameters.

Grid aggregation rounds Power Transfer Distribution Factors (PTDFs) to a specified precision. It then clusters Critical Network Elements and Contingencies (CNECs) with identical PTDF values, assigning each cluster a Remaining Available Margin (RAM) value (defaulted to the average). Each CNEC cluster is then replaced by one CNEC with the rounded PTDF values and specified RAM, and replacement components are recorded in the preprocessed dataset.

The grid_aggregation_opt parameter in 90_grid_bidding_zones.csv specifies the grid aggregation level, with these options:

Value	Description
0	None (default)
1	PTDFs rounded to 4 decimals
2	PTDFs rounded to 3 decimals
3	PTDFs rounded to 2 decimals
4	PTDFs rounded to 1 decimal

Setting the grid aggregation level to 4 would result in the highest reduction of CNECs. The PTDFs of all replacement components would then have one digit.

Grid aggregations also merge linked Advanced Hybrid Couplings (AHCs) accordingly.

Reference

During the check, the application generates a reference dataset containing the loaded and enriched input data. This dataset is saved in the subfolder output/reference and is formatted identically to the input files to allow for direct comparison and reimport if needed. The reference dataset represents the data model immediately after input loading.

Preprocessed

During the check, the application generates an output of the preprocessed input dataset. This dataset reflects applied aggregations, availability simulations, and other optional preprocessings and simplifications based on the project configuration. This dataset is saved in the subfolder output/preprocessed and is formatted identically to the input files to allow for direct comparison and reimport if needed. The preprocessed dataset represents the data model used in subsequent procedure steps like the linear optimization.

Procedure Optimization