Are robust estimates of type I Earth system predictability possible using the CESM2 multi-year prediction system (CESM2-MP)? How effective are time series in global and regional climate change predictions?

Here, we present a new seasonal-to-multiyear Earth prediction system, CESM2-MP, based on the Community Earth System Model version 2 (CESM2). A 20-member ensemble that assimilates oceanic temperature and salinity anomalies provides the initial conditions for 5year predictions from 1960 to 2020. We analyze skills using pairwise ensemble statistics, calculated among individual ensemble members (IM), and compare the results with those obtained from the more commonly used ensemble mean (EM) approach. This comparison is motivated by the fact that an EM of a nonlinear dynamical system generates, unlike reality, a heavily smoothed trajectory, akin to the evolution of a slow manifold. However, for most autonomous nonlinear systems, the EM does not even represent a solution of the underlying physical equations, and it should therefore not be used as an estimate of the expected trajectory. The IM-based approach is less sensitive to ensemble size than EM-based skill computations, and its estimates of attainable prediction skills are closer to the actual skills. Using IM-based statistics helps to unravel the physics of predicted patterns in the CESM2-MP and their relationship to ocean-atmosphere-land interactions and climate modes of variability. Furthermore, the IM-based method emphasizes predictability of the 1stkind, which is associated with initial error sensitivity. In contrast, the EM-based method is more sensitive to the predictability of the 2ndkind, which is associated with the external forcing and time-varying boundary conditions. Calculating IM-based skills for the CESM2-MP provides new insights into the sources of predictability originating from ocean initial conditions, helping to delineate and quantify the forecast limits of internal climate variability.

CAPSULE (BAMS only)

Applying individual-based ensemble statistics to a new seasonal-to-multiyear climate prediction system reveals new sources of multiyear Earth system predictability of the 1st kind.

The purpose of this study is to introduce a new seasonal-to-multiyear Earth prediction system and present a robust method for determining predictable patterns associated with ocean initial conditions and internal climate variability. The conventional ensemble-mean-based approach utilizes smoothed information and captures a larger portion of the predictability related to external forcing and time-varying boundary conditions. We introduce the individualmember-based approach, which uses unfiltered information and emphasizes the predictability originating from the initial conditions. Therefore, the individual-member-based predictability provides more systematic and confident information associated with the ocean-atmosphereland interaction. Our results suggest a method for quantifying forecast limits related to internal climate variability for the prediction community. Additionally, our work offers practical and reliable guidance for end-users in their decision-making.

The field of climate prediction is advancing towards Earth system nowcasts on seasonal to decadal timescales (Dunstone et al. 2022; Hermanson et al. 2022). In June 2024, the World Meteorological Organization released a global annual to decadal climate update, reporting an 80% probability that the annual mean global surface temperature will exceed 1.5 °C above the pre-industrial level for at least one year in the next five years (World Meteorological Organization 2024). Indeed, 2024 was recorded as the first calendar year to be more than 1.5 °C above the preindustrial level, while long-term global warming, averaged over decades, remains below 1.5 °C (World Meteorological Organization, 2025). Given the climate risks from the combination of long-term warming due to enhanced greenhouse gas emissions and internal climate variability, there is a growing demand for decision-relevant and evidence-based nearterm climate information to support disaster management, adaptation choices, and mitigation strategies. In response to societal demand, significant scientific progress has been made in identifying the predictive power and sources of predictability of Earth system components across various time scales (e.g., Merryfield et al. 2020b; Brunet et al. 2023). Moreover, skillful physical climate predictions can be applied in the context of marine/terrestrial ecosystem changes, water availability, and human health, thereby providing valuable information for decision-making processes (Li et al. 2016; Bonan and Doney 2018; Park et al. 2019; Payne et al. 2022; Chikamoto et al. 2023).

Earth system predictability is based on the ability to predict the externally forced response (predictability of the 2nd kind) and internally generated climate variability (predictability of the 1st kind). The former is associated with the model’s sensitivity to the prescribed – albeit uncertain - forcings, which can be estimated by ensemble averaging of uninitialized forced simulations (Brune and Baehr 2020; Rodgers et al. 2021). The skill enhancement due to the information contained in the initial conditions can be calculated by removing the skill of uninitialized runs from initialized predictions (Yeager et al. 2018; Chikamoto et al. 2019; Brune and Baehr 2020). Since the degradation in forecast skill is mainly associated with the predictable limit of the internally generated component (Boer et al. 2013) and with the accumulating model errors (e.g., drift induced by the bias between observed and simulated climate state), objective and physically consistent estimates of this skill improvement are critical to investigate the growth pattern of forecast errors and provide actionable and reliable prediction information to stakeholders.

To provide robust estimates of the predictability of the 1st and 2nd kinds on seasonal to multi-year timescales, we compare two different skill assessment methods with each other. The first one exploits relationships among ensemble members to capture the fact that the real multiyear climate evolution is also just a realization of a high-dimensional ensemble of interacting multi-timescale processes. The second assumption is that climate evolution is slow (in terms of decorrelation time) relative to rapid stochastic weather fluctuations, and that the predictable component can be extracted by calculating the ensemble mean (EM) of a large ensemble size. This approach has two drawbacks for nonlinear dynamical systems: i) the ensemble mean does not represent a solution of the equations, and ii) the most probable climate state differs from the ensemble mean state in nonlinear systems (Hasselmann 1976).

In our study we i) introduce a new system for multiyear earth system prediction based on CESM2 (CESM2-MP) in 1o horizontal resolution, ii) assess its fidelity in terms of predicting large-scale modes of climate variability (e.g., El Niño Southern Oscillation (ENSO) and Atlantic Multidecadal Variability (AMV)) and iii) present a robust method to determine the predictable patterns of internal climate variability on seasonal to interannual timescales, vis-àvis externally forced signals.

Our analysis is based on a series of simulations conducted with the Community Earth System Model (CESM2) at a horizontal resolution of 1 °. These include A) a 20-member ensemble of coupled simulations with 3-dimensional ocean temperature and salinity anomaly data assimilation using a total of 2 different observational datasets and covering the period 1960-2020, B) a 20-member ensemble of 5-year initialized hindcasts initiated every January, and C) a 50-member ensemble of uninitialized historical simulations, conducted as part of the CESM2-Large Ensemble (CESM2-LENS), using historical greenhouse gas and aerosol and volcanic forcings (Rodgers et al. 2021).

To conduct initialized Earth system predictions on seasonal-to-decadal timescales, it is essentialto properly initialize the coupled model with observational estimates, Additionally, potential model drift must be reduced, as it could occur when the observational mean state is inconsistent with the simulated one and may lead to a quick deterioration of predictive skill beyond the system’s predictability limits. This issue was recently addressed by Chikamoto et al. (2019) using a bias-adjusted 3-dimensional temperature and salinity assimilation for the CESM1 model as an efficient method to minimize artificial forecast drifts. This method is applied to the CESM2 model by assimilating bias-adjusted monthly ocean temperature and salinity from EN4.2 for the period 1950-2021 and from the Japan Meteorological Agency's ProjDV7.3 for the 1955-2021 period into the POP2 ocean model using the Incremental Analysis Update scheme. The bias-adjusted ocean data are calculated by the following procedure (Chikamoto et al. 2019): the observational datasets are decomposed into a linear combination of internal variability ( 𝑂!"# ), externally forced components ( 𝑂$%# ), and climatological mean state (𝑂"). The monthly climatology for both observations and the model is calculated as the average over the 1971–2000 period. Observed and model-simulated externally forced components 𝑀$%# are estimated from the first mode of Singular Value Decomposition (SVD) analysis between the observational data and the uninitialized run for the 1960–2005 period. The first SVD mode is calculated separately for temperature and salinity anomalies at each ocean layer. After 2005, 𝑂$%# and 𝑀$%# are estimated by projecting onto the spatial pattern of these SVD modes. Finally, the bias-adjusted observations ( 𝑂∗ ) are reconstructed by linearly combining the observed internal variability, the model-simulated externally forced component, and the model-simulated climatology: 𝑂∗ = 𝑂!"# + 𝑀$%# 📷. This bias-adjusted observation is then assimilated into the CESM model to account for both mean state biases (e.g., Fig. 1a) and externally forced components.

Figure 1. (a) Spatial map of the model mean state bias in annual mean sea surface temperature (SST) climatology. SST bias is calculated as the difference between the ensemble mean of the uninitialized largeensemble simulation (CESM2-LENS) and COBE2 SST for the 1971–2010 period. Positive values indicate that the model overestimates observed SST, while negative values indicate an underestimation. (b) Time series of SST averaged between 60°S and 60°N, obtained from ocean assimilation runs (red), initialized hindcast runs (blue), and uninitialized large-ensemble simulations (gray) using CESM2. The blue lines are shown at 5-year intervals for clarity. Two bias-adjusted ocean analysis datasets (EN4.2 and ProjDv7.3) are represented by green circles and cross markers, respectively. Two observational datasets, HadISST and COBE SST2, are shown as green dashed and solid lines, respectively.

Assimilation simulations are then conducted with the fully coupled CESM2. Starting with five randomly selected initial conditions from the CESM2-LENS simulations, we assimilate the bias-adjusted temperature and salinity data into the POP2 ocean model (excluding sea ice regions), while also applying historical greenhouse gas, aerosol, and natural forcings from 1960-2014 and SSP3-7.0 forcing for the period 2015-2024 (2015–2024; van Marle et al. 2017). The aerosol forcing due to biomass burning does not follow the Coupled Model Intercomparison Project Phase 6 (CMIP6) protocol, but uses smoothed emissions instead, to avoid spurious drifts (Fasullo et al. 2022; Kim et al. 2023). The entire process described above is repeated for two different target assimilated products and 10 different ocean initial states, generating a 20-member ensemble of CESM2 assimilation simulations. Within 10 different ocean initial states, we applied two different ocean assimilation strengths for the first five and the other five members.

This procedure effectively removes initial condition biases between the model and observations from the assimilation runs. This is illustrated in Fig. 1b which shows the globally averaged SST (60ºS-60ºN) for COBE SST2 (Hirahara et al. 2014; green solid), HadISST (Rayner et al. 2003; green broken), the bias-adjusted products (observational anomaly + model climatology and externally forced signal; used for anomaly assimilation) in EN4.2 (circle) and ProjDv7.3 (cross), the uninitialized CESM2-LENS simulations (gray), and the CESM2 assimilation runs (red). In addition to the mean climatological biases, CESM2-LENS exhibits a slightly larger long-term linear trend than the observations, likely due to its higher climate sensitivity (Danabasoglu et al. 2020). After adjusting for these biases, the assimilation runs effectively capture internal variability and remain computationally stable, even during extended simulations.

Initialized on January 1stevery year (1960-2020) and using the corresponding ocean and atmospheric initial conditions of the 20 CESM2 assimilation runs, we then conducted 5-yearlong hindcast experiments. All parameter settings for the assimilation and hindcast runs, including sea ice albedo and marine biogeochemical parameters, are identical to those of the uninitialized historical simulations (CESM2-LENS), which serve as a benchmark to quantify the skill benefit of the initialized forecasts. In this study, the predictive skill of climate

185 anomalies in the hindcast runs will be compared with that of the assimilation runs (hereafter 186 referred to as the “attainable skill”) and with observational estimates (representing the “actual” 187 or “practical” skill). The observational benchmarks used for skill verification are as follows: 188 ERSSTv5 (Huang et al. 2017) for sea surface temperature (SST), Copernicus Marine Service 189 L4 gridded product (https://doi.org/10.48670/moi-00148) for sea surface height (SSH),

190 Copernicus-GlobColour (Garnesson et al. 2019) for marine net primary productivity (NPP), 191 CRU-TS4.05 (Harris et al. 2020) for surface temperature over land, ERA5 reanalysis (Hersbach

192 et al. 2020) for sea level pressure (PSL), GPCC version 2.3 (Schneider et al. 2014) for land 193 precipitation, NOAA Climate Prediction Center soil moisture (Fan and van den Dool 2004) for

194 total water storage, VODCA2GPP (Wild et al. 2022) for land gross primary productivity (GPP) 195 and the MODIS Fire_cci BA product (Chuvieco et al. 2018) for fire burned areas.

197 For skill assessment, this study uses the anomaly correlation coefficient (ACC), a scale198 invariant metric of the linear relationship commonly applied in weather and climate prediction 199 research. For a given lag 𝜏the ACC between a simulated time-discretized climate anomaly 200 variable 𝑋’((𝑡!)of a hindcast ensemble index 𝑗(20 members) and 𝑂)( (𝑡!) of a benchmark 201 ensemble index 𝑘 (20 members) is given by:

where the sum is taken over all discretized time points 𝑡! (56 initialized set 𝑖 for 1965–2020). 203 This anomaly-based metric assesses the relative association, independent of the mean value, 204 and is employed to evaluate the skill, focusing on the reproducibility of variations. Throughout

our study, we calculate anomalies as deviations from the temporal means for the evaluation 206 period. When assessing the skill of an ensemble forecast, the EM is commonly used (Seidman 1981). For linear systems, it represents the most probable predictable state. However, for nonlinear systems with non-Gaussian properties, it does not (Hasselmann 1976). By removing stochastic noise, the EM usually generates a higher ACC forecast skill than individual-member212 based metrics (Kirtman et al. 2014; Barnston et al. 2019). This overestimation has led to several 213 controversies (Scaife et al. 2014; Weisheimer and Palmer 2014; Shi et al. 2015) regarding the 214 applicability of EMs in end-user decision-making. Furthermore, suppose a signal can be decomposed into internal variability and a forced 216 component (a typical case for decadal climate predictions). In that case EM trajectories tend to filter out the less predictable internal variability component, while emphasizing predictability 218 of the 2nd kind due to time-evolving boundary conditions. This can lead to substantial 219 misrepresentation of skill, as the real signal is merely a realization in an ensemble of 220 possibilities, rather than an EM. This aspect also matters for reference (or benchmark) data. Therefore, alternative methods are required to provide robust estimates of predictability in 222 noisy environments (Kumar 2009). This study uses three different approaches to estimate prediction skills. The first approach is based on an EM < ⋯ >’ for both assimilation and While it has been widely used, this approach tends to lead to spurious predictability by (1) reducing the number of degrees of freedom in hindcast and assimilation runs, (2) removing unpredictable noise in hindcast runs and (3) potentially violating linearity assumptions for non230 Gaussian variables, such as e.g., rainfall in arid and semi-arid regions. The second approach is 231 to use individual members for assimilation runs, whereas an ensemble is used for hindcast runs 232 (EM-IM-based ACC), hereafter. In this case, there are 20 pairs of ACCs with 20 reference 233 simulations in this study, and the statistics (i.e., median or mode) of 20 pairs can be used as a 234 suitable indicator of attainable skill. Given that the observation is only one realization, 235 calculating

is more consistent with an actual forecast skill assessment than the EM-based ACC approach. The third method is to adopt the individual member-based (IM-based) ACC, calculated by the median of pairwise individual ACC sets (equation (1))𝐴𝐶𝐶’,)(𝜏) to consider both predictable and unpredictable components in the evaluation process (See “Conceptual Comparison of Ensemble Mean and Individual Member Based Skill Metrics” section). Attainable Prediction skill in our CESM2-MP can be estimated as the median of the 400 IM-based ACC values between the 20 assimilation and the 20 hindcasts runs. The attainable skill of uninitialized runs (50 members) corresponds to the median of 1000 pairs (20 x 50).

In our calculations, the forecast lead time is defined as the interval between the initialization time and the end of the target year (or month). The same evaluation period (1965–2020) was used regardless of the lead time to avoid dependence on the verification period. Statistical significance is assessed from the 𝑝-value of the Student’s t-test with effective degrees of freedom 𝑛∗ (Bretherton et al. 1999) to ensure a more robust evaluation:

Where 𝑟,and 𝑟+ are the lag-1 autocorrelation of two timeseries 𝑋and 𝑂 used for ACC, and a correlation is significant if the two-tailed 𝑝-value is less than 𝛼 = 0.1 (90% confidence level).

𝑇- = 𝐴𝐶𝐶 @1 𝑛−∗ 𝐴𝐶𝐶+B (5)

The significance of the skill difference is calculated by the method presented by Steiger (1980) to consider dependency between predictions (Siegert et al. 2017). We define the auxiliary quantity 𝑅 and test statistic 𝑇+as

𝑅 = (1 − 𝐴𝐶𝐶.)+ − 𝐴𝐶𝐶!)+ − 𝑟.!+ ) + (2 𝐴𝐶𝐶.) 𝐴𝐶𝐶!) 𝑟.!) (6)

T+ = (𝐴𝐶𝐶!) − 𝐴𝐶𝐶.))P2 Q𝑛𝑛∗∗ −− 13R 𝑅 + 14 (𝐴𝐶𝐶.) + 𝐴𝐶𝐶!))+(1 − 𝑟.!)0 (7)

255 Where 𝐴𝐶𝐶!" and 𝐴𝐶𝐶#" represent the uninitialized and initialized prediction skills, respectively, 256 relative to the benchmarks. The term 𝑟!# denotes the correlation between two predictions. Under 257 the null hypothesis of no difference in prediction skill, the test statistic 𝑇+ follows a Student’s 𝑡 258 distribution with 𝑛 − 3degrees of freedom. Note that cases where the test statistic cannot be 259 defined due to a negative 𝑅 (generally accounting for less than 0.1% of cases) are indicated by 260 hatching in the figures. Conceptual Comparison of Ensemble Mean and Individual Member Based Skill Metrics

The Lorenz ’63 model (Lorenz 1963) has served as a valuable tool for understanding the evolution of errors in weather and climate prediction (Palmer 1993, 1999) within a simple dynamical system framework. This nonlinear system exhibits chaotic behavior, where slightly different initial states can evolve into considerably divergent outcomes. Due to this chaotic nature, the forecast loses skill rapidly within a finite time interval, which is characterized by the inverse of the Lyapunov exponent. To investigate the difference between EM and IM in 269 assessing forecast skill, we run an ensemble of simulations using the nonlinear Lorenz ‘63 model (Eq. 8).

𝑥̇ = −𝜎𝑥 + 𝜎𝑦

𝑦̇ = −𝑥𝑧 + 𝑟𝑥 − 𝑦 (8)

𝑧̇ = 𝑥𝑦 − 𝑏𝑧

A dot above each variable denotes a derivative with respect to the dimensionless time 𝜏 while 𝜎 = 10, 𝑟 = 28, and 𝑏 = 8/3. We conducted a control simulation as a surrogate observation (O) from the initial condition (𝑥 = 0, 𝑦 = 10, 𝑧 = 0) to 𝑡 = 150 with 𝑑𝑡 = 0.01. The spin-up period was assumed to consider a fully evolved chaotic structure for forecast evaluation. From 𝑡 = 50 ~ 200, we performed a 1,500 set of 100-member ensemble forecast 276 𝑥/for 5 𝜏, initializing it by the control model value with small random perturbation. Figure 2 illustrates the trajectories of the control simulation (a), ensemble mean forecast 278 (b), and one of the individual forecasts (c). While the control simulation and individual forecast

exhibit chaotic regime transitions, the EM converges to a centered point due to the filtering of unpredictable chaotic characteristics. Clearly, in this case, the EM solution does not even represent a mathematically realizable solution. Figure 2(d) depicts the EM-based ACC (black line) and IM-based ACC (red line) as a function of the lead time (𝜏). As expected, the ACC skills decrease with lead time and eventually lose all memory of the initial conditions to become completely uncorrelated from the reference. Notably, the EM-based ACC is consistently larger than the IM-based ACC for all lead times, because it eliminates noise compared to individual members’ skill (Leith 1974; Murphy 1988; Leutbecher and Palmer 2008; Kirtman et al. 2014; Barnston et al. 2019). The fact that the EM is not even a solution of the underlying equations (Fig. 2b) suggests that the higher ACC values are spurious and should not be considered an indication of increased predictive skill. Moreover, as the filtered noise becomes much larger than the ensemble mean signal, the EM-based ACC cannot represent the skill for the whole system with high confidence (Scaife et al. 2014; Weisheimer and Palmer 2014; Shi et al. 2015). We can further compare the EM-based ACC with the Brier score (Brier 1950), a metric commonly used in operational forecasting to characterize the phase (sign) of a signal (𝑥here). For the Lorenz ’63 model the Brier score reaches a value of 0.5 (no skill) for lead times of 𝜏=3; but the EM-based ACC still shows nonzero values. This is because the Brier score uses unfiltered information, whereas the EM-based ACC takes filtered information. If we use the IM-based ACC, which includes both predictable and unpredictable components, it shows a skill decay pattern similar to the probabilistic Brier score-based evaluation. Higher EM-based ACC values do not necessarily imply improved forecasts. Additionally, when comparing the dependence of the EM-based ACC and the IM-based ACC on ensemble size (red boxes in Fig. 2e), we find an increase in the former and only a slow saturation towards larger ensemble sizes, whereas the latter shows robust values even for small ensembles.

The standard Lorenz ’63 model trajectory of control simulation (a), 100-member ensemble mean (b), and one of the individual members (c). All the ensemble members’ trajectories are represented as gray areas in (b). Lower panel (d) denotes initialized prediction ACC skills according to the different evaluation methods. The black line with dots represents skill using the ensemble-mean-based ACC. The boxes and whiskers indicate ACC skills for the minimum, maximum, and 25th, 50th, and 75th percentiles of the 100-member ensemble of forecasts. Red line indicates the skill evaluated by individual-based ACC. The green dashed line represents the significance threshold at 90% confidence (N = 1500). Magenta line denotes the Brier score for the phase (sign) of 𝑥 and orange dashed line indicates 0.5 which is the meaningless value for the Brier score for two possible phases (positive, negative) here. The last panel (e) represents ACC skills according to the number of ensemble members used. ACC skills are evaluated 1,000 times by repetitively subsampling for each box. Red boxes indicate ensemble mean ACC skill, and blue boxes represent individual-based ACC skill. The boxes and whiskers show the minimum, maximum, and the 25th, 50th, and 75th percentiles of the 1,000 ACC skills.

The choice of evaluation strategy depends on the research objective. In cases with large, predictable signals (e.g., responses to strong external forcings or the ENSO) and relatively small noise components, the EM-based ACC can provide valuable information. In this study, we focus on evaluating multi-year predictability using a comprehensive Earth system model, which is initialized with observational estimates. We anticipate that internal variability could be of sufficient magnitude to influence the variance at 5-year lead times. To quantify how well our model reproduces the real climate trajectory (which is not an EM), it is advisable to use the IM-based ACC, which is applicable regardless of the signal-to-noise ratio. Estimating Earth system predictability

a. Illustrating differences between EM- and IM-based ACC with SST predictions

In the following, we determine the patterns of multiyear SST predictability using the EMbased, EM-IM-based, and IM-based ACC skill metrics calculated for the 20-member hindcast simulations and the 50-member uninitialized CESM2-LENS simulations, relative to the 20member ocean data assimilation runs. For both the initialized and uninitialized simulations, we observe noticeable skill for 1-year EM forecasts across large areas of the world’s oceans (Fig. 3a, e), which can be attributed to the time-evolving boundary conditions (predictability of the 2ndkind), including volcanic eruptions, greenhouse gas, and aerosol concentrations. However, unlike for the hindcasts, the uninitialized simulations lack predictability in the tropical Pacific and in the Atlantic subpolar gyre (Fig. 3e, f). These areas stand out as regions with high levels of internal variability associated with the ENSO and the AMV, respectively, consistent with other studies (Chikamoto et al. 2019; Bethke et al. 2021; Bilbao et al. 2021; Nicolì et al. 2023). To further highlight the effect of initialization (predictability of the 1st kind), we can subtract the uninitialized skill from the initialized skill (Fig. 3i, j). However, we need to bear in mind that ACC samples are, in general, non-Gaussian variables (Fisher and Jeans 1928), and correlation differences should not be overinterpreted. EM-IM-based ACC estimates (Fig. 3b, f, j) assume that individual assimilation members can be regarded as potential realizations. The corresponding skill patterns, which are mostly dominated by external forcings, are similar to those obtained from the EM-based ACC. The skill difference between initialized and uninitialized highlights again the ENSO and AMV regions. We also observe a slight skill gain in other ocean regions (negative values in Fig. 3i, j). Interestingly, there appear to be areas where initialization introduces even a weak deterioration of skill, as indicated by the negative values in the Pacific Warm Pool region and the Benguela upwelling area.

Attainable prediction skill calculated as anomaly correlation coefficient (ACC) of annual SST from initialized hindcast (a–d), uninitialized run (predictability of the 2nd kind; e–h) relative to assimilated run. Third row (i–l) denotes skill improvement by initialization (predictability of the 1st kind) determined by subtracting skills of uninitialized run (second row) from initialized hindcast (first row). The first column (a, e, i) is the attainable skill assessed from the EM-based ACC against the ensemble mean of the assimilation run for lead year 1. The second column (b, f, j) is EM-IM based ACC, which is the median of the ensemblemean-based ACC against individual assimilated runs (20 members). The third (c, g, k) and last (d, h, l) columns are the attainable skills computed from IM-based ACC for lead years 1 and 2–5, respectively, represented by a median of pairwise ACC (400 pairs for initialized hindcast and 1000 pairs for uninitialized run). Hatched areas are locally insignificant under 90% confidence.

The pattern of the predictability estimated from IM-based ACC (Fig. 3c, g) is similar to that obtained from the EM-based ACC (Fig. 3a), but with substantially reduced magnitude. Other differences can be found in coastal polar regions (Fig. 3f, g), where we excluded ocean data assimilation, and in the eastern tropical Pacific, where large swaths of the uninitialized runs exhibit insignificant skill. The latter can be explained by the fact that for a 1-year lead, the internal variability in each ensemble member greatly exceeds the forced component. Examining the skill differences for the IM-based method (Fig. 3k), we observe that, apart from the tropical Pacific and subpolar Atlantic, all other regions also exhibit considerable skill improvements from the initialization. We can see that the skill difference for the IM-based method is generally higher than that for the EM-based method in Fig. 3. This implies that the EM-based ACC emphasizes the predictability of the 2nd kind. In contrast, the IM-based ACC is more sensitive to the predictability of the 1stkind. Moving to longer-term lead times (years 2-5, averaged within a 4-year window), the IM-based attainable skill shows an overall increase in skill (Fig. 3d) relative to the 1-year lead horizon (Fig. 3c, g), despite the longer lead time and the expected loss of predictability. This is due to the reduction in interannual variance resulting from temporal averaging. However, the temporal averaging also leads to particularly low effective degrees of freedom in the uninitialized runs, rendering most regions statistically insignificant (Fig. 3h). The comparison between initialized and uninitialized simulations reveals some further improvements in the subpolar North Atlantic and the subtropical eastern Pacific. This Atlantic signal was also identified in previous studies (Yeager et al. 2018; Bethke et al. 2021), and it can be partly linked to the realistic initialization of the Atlantic thermohaline circulation (Yeager and Robson 2017). Other areas show very little longer-term SST predictability of the 1st kind for 2-5-year forecasts. The actual skill (comparison against observed SST products, such as ERSSTv5) is very similar (Fig. ES1) to the attainable skill, because the observational SST anomalies are already included in our ocean data assimilation runs. Since reality is not an ensemble mean but rather an ensemble member, the IM-based ACC assessment should be considered a more useful metric than the EM-based skill.

b. Prediction skill of other ocean variables

Slow, dynamic, and thermodynamic adjustments characterize ocean processes. This makes them particularly promising candidates for seasonal-to-decadal-scale predictions (Chikamoto et al. 2015; Meehl et al. 2021). Figure 4 illustrates the attainable prediction skill based on the IM-based ACC for selected ocean physical and biogeochemical variables .Attainable prediction skill (odd columns) and skill improvement by initialization (even columns) for oceanic variables estimated by IM-based ACC, which is the median of pairwise ACC. 400 (20 assimilation * 20 initialized runs) and 1000 (20 assimilation * 50 uninitialized runs) pairs of ACC are used for initialized and uninitialized runs, respectively. First, second and third rows indicate Sea Surface Height (SSH; a–d), surface nitrate (NO3-; e–h) and 100m integrated marine Net Primary Productivity (NPP; i–l), respectively. The first two and last two columns represent skills for lead years 1 and 2–5, respectively. Hatched areas are locally insignificant at the 90% confidence level.

Since SSH is sensitive to the changes in temperature and salinity linked to the density and bottom pressure, and because predictable large-scale ocean dynamical processes such as Rossby waves contribute to slow sea level SSH adjustments (Chikamoto et al. 2019), the initialization considerably enhances the predictability from tropical to subarctic regions relative to the uninitialized runs with lead year 1 (Fig. 4b). High predictability in the polar areas (Fig. 4a) but low initialization skill improvement (Fig. 4b) can be explained by the fact that we do not apply data assimilation in this region and that the polar regions in general experience a stronger radiatively forced response (Stuecker et al. 2018; Chung et al. 2021). With longer lead times (Fig. 4c), there is a loss of predictability in low-mid latitudes beyond lead year 1, primarily due to the predictability limit of ENSO (Timmermann et al. 2018). Nevertheless, predictability persists in mid- and high-latitude regions for lead years 2 through 5.

Our initialized hindcast exhibits considerable predictability and initialization skill for surface nitrate (NO3-) throughout the global ocean for lead times of 1 year. The highest skill levels are observed in the tropical and southeastern Pacific, as well as the tropical and subtropical South Atlantic (Fig. 4e, f). A Substantial improvement in NO3- skill by initialization might be a consequence of changes in thermal structure, mixed layer depth and ocean circulation that contribute to NO3-transport and distribution (Chikamoto et al. 2015). The majority of high skill in the Arctic region can be attributed to external forcing, which leads to a reduction of nitrate concentrations. For longer lead times (2-5 year average), the considerable predictability in the subpolar North Atlantic associated with external forcing stands out (Fig. 4g). In addition, high predictability emerges in the Antarctic coastal region because highfrequency variability is removed in lead years 2–5.

The initialized runs show substantial predictability for the column-integrated marine NPP, in low and mid latitudes as well as the Arctic region (Fig. 4i). Interestingly, the spatial pattern of the marine NPP predictability (Fig. 4i) is highly correlated with the improvement made by the initialization (Fig. 4j) due to data assimilation. This suggests that the initialized physical properties and the associated constraints in ocean circulation provide an important source for marine NPP predictability. An exception is high latitude regions where the initialization of physical variables is not effective (Fig. 4j). The absence of skill in high latitudes is in agreement with previous studies (Séférian et al. 2014; Fransner et al. 2020; Frölicher et al. 2020; Krumhardt et al. 2020; Bethke et al. 2021), and phytoplankton growth is typically limited by light availability or Fe deposition from the atmosphere, which cannot be constrained by initialized ocean physical variables.

The predictability and initialization skill improvement for SSH, NO3-, and marine NPP based on the EM-based ACC against the ensemble mean of the assimilation runs show high values across most of the areas for lead year 1, since they are estimated based solely on predictable signals and average out stochastic unpredictable components (Fig. ES2). Even in the subpolar North Pacific – a region of strong atmospheric variability – the EM-based ACC indicates high levels of skill for marine NPP variations due to the suppression of noise. Furthermore, uncertain negative initialization skill improvement emerges for lead years 2–5 (e.g., Southern Ocean for marine NPP; Fig. ES2l).

Actual skills of ocean variables with the IM- and EM-based ACCs are represented in Fig. ES3 and Fig. ES4, respectively. Nitrate was not evaluated due to the absence of reliable longterm global observation data. Due to the relatively short periods covered by other observational datasets (SSH: 1993~2020; marine NPP: 1998~2020) and the corresponding low number of effective degrees of freedom, the actual skills for lead years 2–5 are insignificant across most of the world’s oceans. Additionally, although multiyear predictability is based on lowfrequency variability such as AMV, the observational period is not long enough to adequately capture this low-frequency variability.

c. Prediction skill of atmospheric and terrestrial variables

Most of atmospheric internal variability is essentially unpredictable beyond a couple of weeks (Simmons and Hollingsworth 2002). However, the initialization of the ocean with slowly varying ocean heat content due to its thermal and dynamical inertia can induce some signals in the atmosphere with predictability far beyond the weather timescale (Ji et al. 1998; Cassou et al. 2018; Brune and Baehr 2020; Merryfield et al. 2020a). In our model set-up, we do not apply atmospheric data assimilation. Thus, all the atmospheric predictability, which is not related to the external anthropogenic forcings, originates from the oceanic initial conditions and climate variability modes. To better isolate the longer-term predictability of the 1st kind of atmospheric processes, we focus our analysis on the IM-based ACC.

illustrates the attainable skill and initialization skill improvement of selected atmospheric variables: 2 m temperature (T2m), PSL, and precipitation for lead year 1. T2m exhibits considerable predictability globally driven by external forcing (Fig. 5a). Additional skill from initialization can be found in the tropical region, namely the ENSO influenced region, the Amazon, parts of India, Central Africa, Alaska and Southeastern Asia (Fig. 5b). For PSL and precipitation (Fig. 5 c-f), we also find increased skill in tropical areas, which can be explained by the typical response of the Walker Circulation to ENSO-related SST anomalies and the corresponding global ENSO teleconnections in precipitation (Ropelewski and Halpert 1987). The patterns of atmospheric predictability due to initialization are consistent with previous findings (Wu and Kirtman 2006; Jia and DelSole 2012). In fact, the results in Fig. 5 are dominated by the impact of predictable ENSO variability on the atmospheric circulation and rainfall patterns (Dai and Wigley 2000). In contrast, the EM-based ACC fails to identify these characteristic patterns because it does not reflect the contribution of internal variability filtered by ensemble mean in the extratropical region (Fig. ES5). The actual and attainable skill patterns for the IM-based ACC are very similar because the skills were evaluated against the unfiltered data (Fig. 5, Fig. ES6). Interestingly, for the EM-based skill enhancement due to initialization, we find many areas of little increase or even skill deterioration (negative differences; Figs. ES7). This can be explained by the fact that the EM-based method underestimates the predictability of the 1st kind.

Attainable prediction skill (left column) and skill improvement by initialization (right column) for atmospheric variables estimated by IM-based ACC which is the median of pairwise ACC. 400 (20 assimilation * 20 initialized runs) and 1000 (20 assimilation * 50 uninitialized runs) pairs of ACC are used for initialized and uninitialized runs, respectively. First, second and third rows indicate 2m temperature over land (T2m; a–b), sea level pressure (PSL; c–d) and precipitation (e–f) for lead year 1. Hatched areas are locally insignificant under 90% confidence.

For precipitation (Fig. 5e-f), the initialized runs exhibit promising predictability over the tropical regions associated with ENSO predictability (Chen and Cane 2008). The Sahel region (350–30°E, 5–15°N) shows significant attainable skill, which is consistent with Yeager et al. (2018). The Maritime continent (100–150°E, 10°S–10°N), Southern Africa (20–30°E, 25– 35°S), Florida (270–280°E, 25–30°N), and Northeastern Brazil (295–325°E, 10°S–5°N) also show significant skill. We find pronounced initialization skill improvements in Maritime continent and Northeastern Brazil as well. Precipitation shows substantial differences between IM-based ACC or EM-based ACC estimates (Fig. 5e, f, ES5e, f). Even though initialized ocean influences prediction skill globally, most of the benefits are insignificant under consideration of the dominant atmospheric noise (Fig. ES5f, Fig. 5f). In addition, for lead years 2–5, initialization skill improvements calculated by the individual-based ACC are insignificant in most of the regions (not shown).

Recent studies (Chikamoto et al. 2017; Chikamoto et al. 2020; Olson et al. 2021) have demonstrated that land hydrological processes integrate stochastic variability in precipitation and evaporation, which leads to a reddening of spectra (Hasselmann 1976; Delworth and Manabe 1988) of societally relevant land variables, such as soil moisture, NPP, runoff and even fire occurrences. This natural filtering process can suppress atmospheric noise, leading to lower-frequency variations in land variables and therefore increased land surface predictability.

Our IM-based attainable skill evaluation of land variables (Fig. 6) reveals large areas of predictability of the 1st kind, even in areas where the predictability of precipitation is insignificant (Fig. 5e, f). More specifically, substantial skill in total water storage with lead year 1 is observed in various local regions such as the dust belt (350–90°E, 20–40°N), Central

(345–45°E, 10°S–20°N) and Southern (15–35°E, 20–35°S) Africa, Maritime continent (100– 150°E, 10°S–10°N), parts of the United States (235–285°E, 20–45°N) and Northern Brazil (280–325°E, 10°S–10°N; Fig. 6a). The noticeable prediction skill, along with substantial precipitation skill over parts of the United States, is consistent with the second-year boreal summer skill reported in a previous study (Yeager et al. 2022). Predictability in the Sahel region may be associated with the response of the African summer monsoon to interdecadal oceanic variations (Giannini et al. 2003), and ENSO is likely the key source of predictability for variations in total soil moisture over the Maritime continent, North America and Northern Brazil (Chikamoto et al. 2017; Esit et al. 2021). The EM-based ACC exhibits high attainable skill and initialization skill improvement over the world because the skill is dominated by the ensemble-mean signal regardless of the signal-to-noise ratio (Fig. ES8). Even though initialization skill improvement of the IM-based ACC is consistent with the one for actual skill (Fig. ES9), the EM-based initialization skill improvement pattern for attainable skill substantially differs from actual skill (Fig. ES8, 10).

. Attainable skill (left column) and skill improvement by initialization (right column) for terrestrial variables estimated by IM-based ACC which is the median of pairwise ACC. 400 (20 assimilation * 20 initialized runs) and 1000 (20 assimilation * 50 uninitialized runs) of ACC are used for initialized and uninitialized runs, respectively. First, second and third rows indicate land total water storage (TWS; a–b), gross primary productivity (GPP; c–d) and burned area by fire (e–f) for lead year 1. Hatched areas are locally insignificant under 90% confidence.

GPP responds to variations in temperature, solar radiation, CO2 levels, and land water storage (Collalti et al. 2020; Dunkl et al. 2023). Given the close link between GPP and external factors such as atmospheric CO2 rise and global warming (Lovenduski et al. 2019), variability in most regions is dominated by the increasing trend of temperature (Fig. 6c), which also serves as the primary source of predictability. Some regions, such as Philippines (120–127°E, 5 – 20°N) and Northeastern Brazil (280–330°E, 10°S –10°N), show high predictability and overlap with added value of significant initialization skill improvement regions for both GPP and total water storage. And this can be linked to predictability from modes of internal climate variability. In other words, ENSO and other internal modes of climate variability may be the primary sources of predictability (Zhang et al. 2019). In addition, initialization skill improvement patterns for actual skill (Fig. ES9d) are consistent with initialization skill improvement of attainable skill.

Predicting wildfire is complex due to various driving factors such as ignition source (lightning strikes, anthropogenic factors), weather conditions (temperature, wind, precipitation, relative humidity), and available fuels. Wildfire predictability has previously been linked to large-scale climate patterns (Westerling et al. 2003). In the initialized runs, burned area is predictable in low-mid latitude regions for the first year of hindcast (Fig. 6e). In high attainable skill regions, such as Central Africa (350–30°E, 5°S –15°N), India (70–90°E, 10–30°N), Indochina Peninsula (90–110°E, 10–20°N) and the Amazon (280–310°E, 20°S–10°N), burned area is mostly determined by external forcing, similar to the GPP case above. Regions with noticeable skill can be found in the tropics, consistent with a recent study (Jones et al. 2022). The benefit of ocean initialization extends to promising regions such as Southern Africa,

Central Africa, Indochina Peninsula, Maritime continent, Southern United States, Northeastern Brazil similar with the pattern of other terrestrial variables in the initialized runs. Many of these regions are influenced by precipitation and drought conditions driven by ENSO (Chen et al. 2017; Taufik et al. 2017). Precipitation seasonality may also play a significant role in predicting burned areas in Central Africa (Archibald et al. 2009). In addition, the burned area in the Southern United States may be linked to low frequency variability of soil moisture induced by Atlantic/Pacific SST gradient (Chikamoto et al. 2017). In spite of some promising attainable skill, the actual skill is insignificant in most areas (Fig. ES9). It might be due to the fact that CESM2 does not include an interactive lightning scheme and that real human-induced fires are accounted for in the CESM2-MP. Moreover, considerable observational uncertainties need to be acknowledged for the relative short MODIS satellite data set.Integrating attribution and prediction information for modes of variability

Major modes of climate variability such as ENSO (Tang et al. 2018) or the AMV (Grötzner et al. 1999) are important sources of climate predictability (e.g., Palmer et al. 2004; Wang et al. 2015; Merryfield et al. 2020b). As demonstrated above, their corresponding predictive skill due to initialization may translate into other societally relevant physical and biogeochemical variables. Contrary to our findings above that IM-based skill is a more robust metric for predictability of the 1stkind, we find many applications of EM-based statistics to characterize the skill of ENSO forecasts (e.g., Barnston et al. 1999; Choi and Son 2022). Moreover, the probability distribution of ENSO is highly skewed (Timmermann 1999; Burgers et al. 2005), which is an indication of underlying nonlinear dynamics, in particular for very strong events. This results in higher predictability for El Niño than La Niña and normal events (Timmermann et al. 2018). It has been documented that at least three different types of nonlinearities need to be considered for ENSO: 1) nonlinear dynamic heating (Jin et al. 2003), 2) state-dependent noise, which effectively acts as a nonlinearity (Jin et al. 2007), 3) atmospheric nonlinearities (Timmermann et al. 2018; Takahashi et al. 2019). These nonlinearities highlight that EM-based

ACC estimates may not properly characterize the most likely evolution of ENSO. For the AMV, the situation may be the opposite, because some studies have suggested the impact of external forcings, including aerosols on the observed evolution of multidecadal Atlantic SST variability (Watanabe and Tatebe 2019; Qin et al. 2020). In this case, EM-based metrics might better characterize the actual predictability.

Here we further evaluate the differences of EM- and IM-based skill metrics in the context of ENSO and AMV prediction using the CESM2-MP. We introduce a zero-dimensional simple Ornstein Uhlenbeck process as a benchmark prediction model (Hasselmann 1976). The discretized version of this model yields a first-order autoregressive model (AR(1)). Its predictability (often referred to as damped persistence) can be estimated as the lag-1 autocorrelation (1 − 𝜆) (autocorrelation; e.g., Jin et al. 2018), where 𝜆is the inverse linear damping timescale. The lag-autocorrelation of the EM of the null hypothesis model can then be expressed as 𝑒12 4, where 𝜏 represents the lead time.

a. ENSO

Here we study ENSO predictability in the CESM2-MP using the monthly Niño 3.4 index (SST anomaly averaged over 190o–240oE and 5oS–5oN) initiated from January 1st every year for lead times up to 28 months during 1960–2020. The assimilation runs capture the observed interannual variability of ENSO reasonably well, but with remaining errors (Fig. 7a) due to uncertainties in ocean reanalysis data used for the assimilation and lack of atmospheric assimilation. The uncertainties in the assimilation runs are generally larger during boreal summer but smaller during boreal winter in terms of ACCs between the observation and assimilation runs (Fig. 7b). There is a slightly increasing trend of ENSO SST anomalies simulated by the ensemble mean of uninitialized runs (Fig. 7a). Although limited, the contribution of external forcing estimated from the determination of coefficient (R-squared) is highest during May accounting for around 4% (Fig. 7b) and 9% (Fig. ES11b) of the observed and assimilated ENSO variability, respectively. The EM of initialized runs predict ENSO well for a one-month lead time with an ACC skill of 0.9, but the ensemble spread is quite large even at a short lead (Fig. 7a).

. (a) Monthly Niño3.4 SST index anomaly obtained from observation (OBS, blue), ensemble mean of large-ensemble simulation (LE-EM, red), ensemble mean of ocean data assimilation (ODA-EM, green), ensemble mean of 1-month hindcast (HIND-EM (M1), black), and 20 individual members of 1month hindcast (HIND-IM (M1), grey) from January 1960 to December 2020. (b) Actual ACC skills for Niño 3.4 SST as a function of lead month up to 28 obtained from the multiyear prediction system for 19602020. Boxes and whiskers indicate ACC skills for the minimum, maximum, and 25th, 50th, and 75th percentile of the 20-member ensembles of hindcasts. Solid circles denote the skill for ensemble mean of hindcasts. The dashed blue and solid blue lines indicate the ACC skills for the damped persistence model based on simple autocorrelation (Pers.) and the first-order autoregressive model (AR(1)), respectively, using observed data only. The dashed red and green lines are ACC skills for LE-EM and ODA-EM, respectively, with respect to OBS. The light green shading is the range of minimum and maximum ACC skills for 20 ensembles of assimilated run.

We estimate attainable and actual prediction skill for ENSO based on both the EM and IMbased approaches (Fig. 7b and Fig. ES11). The attainable skill of initialized runs is also calculated with respect to individual members of assimilated runs (e.g., EM-IB-based ACC). The predictabilities are slightly higher with a smaller ensemble spread than the actual skill up to around 4-month lead. However, the actual skills are comparable to the predictabilities at the longer lead. The persistence in assimilation runs estimated by the damped persistence based on simple autocorrelation (Pers.) and as well as AR(1) model is generally slightly higher than that in observations, but their difference is small (Fig. ES11). While the initialized runs outperform the damped persistency after the 2-month lead, only their ensemble mean, and upper percentile have a higher skill than the AR(1) model.

Most studies have mainly focused on EM skills of individual models and the multi-model mean. For example, Choi and Son (2022) showed that the multi-model mean of CMIP6 Decadal Climate Prediction Project (DCPP) prediction for a 3-month running mean of Niño 3.4 SST index is predictable up to 12–14 months with November or January initial conditions during 1961–2020, if ACC of 0.5 is used as a threshold of useful skill. In our study, the ensemble mean prediction for monthly Niño 3.4 SST with January 1st initial condition has a skill of more than 0.5 ACC until the following January, comparable to the results in Choi and Son (2022). However, it is also important to estimate an ensemble spread of ACC skills that includes uncertainties from initial errors and season-dependent internal variability as well as model uncertainties. Even though the ensemble mean skill is high, the ensemble spread may not be negligible even for shorter lead times.

b. AMV

For constructing the AMV index we used the Global Temperature Residual method (G-Res) following Deser and Phillips (2021). Here, the pattern associated with anthropogenic climate change is removed to isolate the natural component of the AMV. The G-Res method is computed for the anomaly data as:

𝐺𝑅𝑒𝑠𝑆𝑆𝑇 𝐺𝑆𝑆𝑇𝑟𝑒𝑔(𝑥, 𝑖), (7)

Where x refers to the geographic location, 𝑡 is time, 𝑖 is the ensemble index, 𝐺(𝑡, 𝑖) is the global-mean (60oN-60oS) of SST, and 𝐺𝑆𝑆𝑇𝑟𝑒𝑔(𝑥, 𝑖) is the linear regression of 𝑆𝑆𝑇(𝑥, 𝑡, 𝑖) onto 𝐺(𝑡, 𝑖). From this 𝐺𝑅𝑒𝑠𝑆𝑆𝑇(𝑥, 𝑡, 𝑖) we constructed the AMV index as the area-weighted mean average of North Atlantic SSTA (0o-60oN and 80oW-0o). We used this simple SSTA based index instead of the widely used method suggested by Trenberth and Shea (2006) as the former method is robust to climate change than the later For example, the lag 0-ACC for the AMV_TS index between the observations and uninitialized runs is 0.45, but that for the AMV index used in this study is 0.31 from 1960 to 2020. We note that the assimilation runs tend to overestimate the contribution from external forcing. The externally forced response explains about 9% of the total variance of the observed AMV index (Fig. 8a) but about 16% of the assimilated AMV index (Fig. ES12a).

Same as Fig. 7 except for annual mean AMV index anomaly. In (a), ensemble mean of 1-year hindcast (HIND-EM (Y1), black), and 20 individual members of 1-year hindcast (HIND-IM (Y1), gray) are shown. In (b), ACC skills for the AMV SST index as a function of lead year are shown. The black boxes, whiskers, and solid circles represent the results with raw prediction, while those in grey represent the results with a 11-year backward moving average.

The assimilation runs capture the observed interannual to interdecadal variability of AMV well, with an EM-based ACC of 0.9 and an IM-based ACC range of 0.84-0.91. Still, there are notable deviations, especially during the 1960s (Fig. 8).

The persistent nature of the assimilated AMV is comparable to the observations, with the lag-1 autocorrelation of around 0.5 (Fig. ES12). The attainable ACC skill for the ensemble mean of initialized runs with respect to assimilation is about 0.7 at the 1-year lead and about 0.55 at the 5-year lead. The ensemble mean of initialized runs has the actual ACC skill of 0.55 at the 1-year lead and 0.52 at the 5-year lead. It outperforms the damped persistence based on both AR(1) and simple autocorrelation, particularly at longer leads, but the ensemble spread is quite large, even at the 1-year lead.

For the decadal prediction systems, previous studies mainly focused on the averaged AMV skill at the 2-5 lead year or the skill after applying a moving average. Our initialized prediction has a comparable skill of 0.79 at the 2-5 lead year to many other decadal prediction systems, with the skill ranges from -0.25 to 0.82 (Xin et al. 2024). When an 11-year backward moving average is applied to filter interannual variability as noise, the ACC skill of our initialized runs reaches 0.8–0.9. However, our target is prediction on seasonal to multiyear horizons, and we avoid filtering interannual variability, which should still be an important source of predictability. Discussion and conclusion

CESM2-MP, the new seasonal to multiyear prediction system developed jointly in collaboration between ICCP, Utah State University, and the Bjerknes Centre, incorporates ocean temperature and salinity anomaly assimilation to reduce unexpected forecast errors from model drift in long-term integrations. Alternatively, instead of relying on anomaly assimilation, full-field initializations using forced ocean–sea ice (FOSI) simulations are also employed to initialize prediction systems (Karspeck et al. 2015; Yeager et al. 2018; Yeager et al. 2022; Yeager et al. 2023). Because the strengths and limitations of each initialization strategy differ in multiple aspects, it is not easy to directly compare or rank their performance in a fully quantitative manner. Instead of employing ocean data assimilation, FOSI is driven by observed atmospheric heat fluxes and wind stress. While this approach successfully captures ocean variability, it introduces a noticeable initialization shock due to biases in the atmospheric forcing and systematic model errors (Yeager et al. 2018; Meehl et al. 2024). While anomaly assimilation can reduce initialization shock, its effectiveness depends on the quality and coverage of available gridded observational products. For instance, the limited number of observations in the subsurface Indian Ocean can lead to substantial uncertainties (Good et al. 2013). Furthermore, in regions with strong systematic model biases, including thermocline, halocline, and pycnocline depths, initial drifts may still occur despite anomaly assimilation, due to discrepancies between model variability and observed variability. To better understand the prediction process and improve forecast skill in CESM-based systems, it is necessary to further compare and evaluate initialization strategies in future studies, with a focus on their prediction performance.

Combining a pre-existing large ensemble (50 members) of uninitialized historical and SSP3-7.0 simulations with CESM2-MP enables us to determine the 1st (ability to predict the internal variability) and 2nd (ability to predict the externally forced response) kinds of predictability of key climate variables and modes of variability. We demonstrated that IM based ACC is a robust, ensemble-size independent measure of predictability of the 1st kind, which agrees with the Brier score metric. In contrast EM-based ACC estimates capture more the predictability of the 2nd kind. The IM-based skill benefit from initialization for atmospheric and terrestrial variables exhibits a higher consistency between attainable and actual skills compared to the EM-based ACC. The former therefore provides more systematic and confident information associated with slow oceanic dynamics and related atmospheric teleconnections. Although taking differences in ACC is intuitive and straightforward, this approach can lead to misinterpretations, suggesting that one prediction is categorically more skillful than another. Complementary methods for evaluating skill differences, such as formal statistical tests, can improve the reliability of the interpretation (DelSole and Tippett 2014).

Our results, based on IM-based ACC, should encourage the multiyear prediction community to provide more systematic and confident information about the prediction capacity for decision-making than the conventional deterministic EM approach. Moreover, consistent predictability patterns between atmospheric and terrestrial variables serve as a starting point for understanding ocean-originated predictability mechanisms and their limits due to strong interactions in the coupled ocean-atmosphere-land system. Additionally, CESM2-MP, with minimized drift and improved biogeochemical representation, offers further opportunities to deepen our scientific understanding of multiyear predictability for ecosystems, carbon sinks, and extreme phenomena such as heatwaves and heavy rains.

  • Similar topics
  • Heat
More Abbas Kashani's questions See All
Similar questions and discussions