Effects of random forest modeling decisions on biogeochemical time series predictions
Peter Regier and Matthew Duggan contributed equally to this study.
Author Contribution Statement: M.D. and P.R. co-led the manuscript and contributed equally. M.D., A.M.-P., and P.R. designed the study approach. M.D. and P.R. conducted data collection, processing, and led statistical analyses, and constructed models and code. P.R. led drafting of the manuscript, and all authors contributed to editing.
Abstract
Random forests (RF) are an increasingly popular machine learning approach used to model biogeochemical processes in the Earth system. While RF models are robust to many assumptions that complicate deterministic models, there are several important parameterization decisions for appropriate use and optimal model fit. We explored the role that parameter decisions, including training/testing data splitting strategies, variable selection, and hyperparameters play on RF goodness-of-fit by constructing models using 1296 unique parameter combinations to predict concentrations of nitrate, a key nutrient for biogeochemical cycling in aquatic ecosystems. Models were built on long-term, publicly available water quality and meteorology time series collected by the National Estuarine Research Reserve monitoring network for two contrasting ecosystems representing freshwater and brackish estuaries. We found that accounting for temporal dependence when splitting data into training and testing subsets was key for avoiding over-estimation of model predictive power. In addition, variable selection, the ratio of training to testing data, and to a lesser degree, variables per split and number of trees, were significant parameters for optimizing RF goodness-of-fit. We also explored how model parameter decisions influenced interpretation of the relative importance of predictors to the model, and model predictor-dependent variable relationships, with results suggesting that both data structure and model parameterization influence these factors. Because much of the current RF literature is written for the computational and statistical science communities, the primary goal of this study is to provide guidelines for aquatic scientists new to machine learning to apply RF techniques appropriately to aquatic biogeochemical datasets.
High-frequency water quality information is important for understanding biogeochemical and ecological dynamics in aquatic systems, and accurately informing management and policy decisions (Kirchner et al. 2004; Krause et al. 2015). The development and wide-spread adoption of in situ sensors capable of collecting water quality data relevant to aquatic biogeochemical function at sub-hourly time-scales has dramatically improved our ability to capture such high-frequency information (Kirchner et al. 2004; Rode et al. 2016). A wide range of sensors are now commercially available and used in a wide variety of applications, including large-scale ecological monitoring networks like the National Ecological Observatory Network and the National Estuarine Research Reserve (NERR) network, compliance monitoring (Xu et al. 2020), and characterizing temporally ephemeral aquatic disturbances and their environmental and management implications (Goodman et al. 2015; Kennish 2019; Regier et al. 2020, 2021; Ball et al. 2021).
Despite the increasingly widespread use of in situ sensors, many key water quality metrics, like nutrients, metals, and biological activity, remain difficult to measure in situ at ecologically relevant concentrations (e.g., Mahmud et al. 2020). There are many reasons why a specific analyte may be difficult to measure in situ, including cost of the sensor, labor, and cost associated with calibration, maintenance, and consumables, and measurement of interferences (Downing et al. 2012; Pellerin et al. 2013; Snazelle 2018; Snyder et al. 2018; Khandelwal et al. 2020). Because of this, many key water quality parameters are still primarily measured manually via grab samples, which limits our ability to spatiotemporally resolve relevant aquatic biogeochemical processes.
An attractive alternative to increasing the temporal resolution of grab sampling is to develop robust relationships between basic water quality parameters measured at high frequency via in situ sensors and time series of grab samples, in order to create high-frequency predictions. As an example, this approach has been successfully applied across a wide range of analytes and ecosystem types to calculate nutrient and pollutant loads from temporally sparse nutrient or pollutant measurements and high-frequency water quality measurements (Pellerin et al. 2014; Regier and Jaffé 2016; Melcher and Horsburgh 2017; Vaughan et al. 2017; Robertson et al. 2018). However, common features of water quality datasets, including multi-collinearity, non-linear relationships, irregular sampling intervals, noise, and data gaps can complicate the use of standard statistical tools like multiple linear regression or auto-regressive moving average models to establish such relationships.
One potential solution to overcome these complications is machine learning algorithms. In particular, the random forest (RF) algorithm has gained significant attention in the aquatic sciences community in recent years as a simple yet effective modeling tool for multivariate water quality datasets (Bisht et al. 2018; Tyralis et al. 2019; Castrillo and García 2020; Codden et al. 2021; Green et al. 2021; Harrison et al. 2021; Maguire et al. 2022). RF, developed by Breiman (2001), constructs a forest of independent decision trees where each predicts the dependent variable based on a randomized subset of predictors, and using a random subset of the dataset. The model then predicts the dependent variable using the weighted average prediction of all trees. Although RF has many advantages over traditional statistical approaches, including robustness to noisy, or sparse data, collinear predictors, and the ability to create non-linear relationships (Tyralis et al. 2019), they assume that samples are independent and identically distributed. This assumption is not met for temporally dependent data, like time series. In addition, there are model parameter decisions that can influence model performance, including how training and testing datasets are split, algorithms used to construct RF models, and tuning of hyperparameters (parameters that control the learning process).
While there are numerous studies comparing different machine learning algorithms for water quality applications (Olyaie et al. 2017; Bisht et al. 2018; Castrillo and García 2020; Chen et al. 2020; Xu et al. 2020), an in-depth look at how parameter decisions and temporal dependence influence RF models, written for the aquatic science community, is not currently available. Here, we use publicly available long-term time-series datasets of nutrients, meteorology, and water quality to understand how different parameter decisions made when constructing RF models can influence model performance and interpretation. We selected nitrate as our dependent variable, which is a key driver of aquatic biogeochemistry, and the focus of recent machine learning studies in the aquatic sciences (Green et al. 2021; Harrison et al. 2021; Maguire et al. 2022), but our general approach and findings are applicable to virtually any biogeochemical parameter of interest with sufficient data, as well as more generally to modeling of any time- or space-dependent datasets across the biogeochemical and ecological sciences.
Materials and procedures
Data collection
Data were collected from two sites maintained by the National Estuarine Research Reserve (NERR) system, which is operated by the National Oceanographic and Atmospheric Administration (NERR 2021): a brackish estuarine tributary to Chesapeake Bay located along the York Town River (CBV representing Chesapeake Bay—Virginia), a brackish estuary in southern Chesapeake Bay, and Old Woman Creek (OWC), a freshwater estuarine tributary to western Lake Erie (Supplementary Fig. S1). Each site contains several monitoring stations organized along an upstream-downstream gradient (Supplementary Fig. S1). We chose these sites as endmembers of the estuarine salinity spectrum, with mean specific conductivity (SpCond) values of 16.7 and 0.5 mS cm−1 for CBV and OWC datasets, respectively. We also selected these contrasting ecosystems to represent a range of nitrate regimes (Supplementary Fig. S2) and the potential for different dominant drivers between sites (i.e., different predictor importance).
The CBV dataset covers a tributary to the largest estuary in the United States (i.e., Chesapeake Bay), while the OWC dataset represents one of two Great Lakes interfaces represented within the NERR network. Datasets used in this study were collected at both sites starting in 2002 and ending in 2019. A total of 2091 and 3197 nitrate values were available for modeling for CBV and OWC, respectively, although sample sizes vary based on predictor variable sets because rows missing a value for any dependent or predictor variable are removed prior to model construction. Both estuaries drain intensively developed watersheds where nutrient inputs are primarily anthropogenic (Robertson et al. 2018; Ator et al. 2020). They are also influenced by disturbances (both natural and anthropogenic), including storms (such as hurricanes and heavy rainfall) and heavy wind events that drive variability in water level (including seiches at the OWC site) (Miller et al. 2006; Loken et al. 2016).
All NERR data presented are publicly available, along with additional details on data collection and quality assurance, on NERR's data portal (http://cdmo.baruch.sc.edu/). These data are collected at sites representing all geographical regions of the continental US and characterize abiotic, biotic, and land use features of each site (Mills et al. 2008). Briefly, water quality parameters (temperature [Temp], specific conductivity [SpCond], dissolved oxygen [DO], depth, pH, and turbidity) were measured using either 6- series or EXO multiparameter sondes (Yellow Springs Instruments) deployed at multiple stations along each transect. Meteorological data (air temperature, relative humidity, barometric pressure, wind speed and direction, and photosynthetically active radiation) were measured at a single point for each site. Grab samples for nitrate were either collected approximately monthly or daily at select stations for certain periods by auto-samplers; no in situ nitrate sensor data were available or used at these sites.
Model construction
All models used nitrate as the dependent variable, and models were constructed for each site using all stations rather than station-specific models to increase sample size and allow inclusion of station as a predictor variable. All data preparation, modeling, and model analysis were conducted in R version 4.0.5 (R Core Team 2021). Data (all predictors and dependent variables) were normalized to a mean of 0 and standard deviation of 1 prior to modeling, which improved model fits, helped avoid overfitting higher end values while underfitting lower end values, and helped data conform to the assumption that variables are identically distributed. Models were constructed using either the randomForest or ranger R packages (Liaw and Wiener 2002; Wright and Ziegler 2017). Goodness-of-fit and error metrics were calculated using the hydroGOF R package (Zambrano-Bigiarini 2013) based on the fits obtained from modeling the test data. Our model workflow and parameterization decision points are summarized graphically in Fig. 1. In this study, we define “parameter decisions” as the decision points made prior to and during model construction that (1) could influence model performance and interpretation, and (2) that we can manipulate.
Predictor variables
We selected three groups of predictors to construct RF models to understand broadly how the number of predictors influences model power, and specifically if the combination of meteorological and water quality data is better for predicting nitrate compared to water quality data alone. We calculated all predictor values as the mean of a time-window between the date and time each nutrient sample was collected and 24 h prior, to incorporate antecedent conditions. In addition to the variables described above, we included sine-transformed day-of-year (Green et al. 2021, abbreviated as DOY and calculated as rather than as published) and station as predictors for all models to incorporate the seasonal and spatial attributes of the datasets, respectively. We note that all variables used in our models except for station are continuous.
Model structure
We selected two parameters to manipulate the structure of the model: the model package and the split ratio between training and testing datasets. We selected two popular R packages (randomForest and ranger), which are based on the same RF algorithm, but with slightly different implementations and default parameters (Liaw and Wiener 2002; Wright and Ziegler 2017), where randomForest is based on the original RF algorithm (Breiman 2001), while ranger uses a variety of RF implementations (Wright and Ziegler 2017). We constructed all models using a tidymodels workflow (Kuhn et al. 2022). We selected three different training:testing data split ratios: 0.5, 0.6, 0.7, 0.8, 0.9, and 0.95. A split ratio of 0.7 indicates that of the full dataset, 70% is used to train the model, and 30% is reserved for testing how well nitrate is predicted (test data are never seen by the model during construction, see Fig. 1). In addition, we examined how the sampling method used to split the dataset influenced model performance to understand if the temporal structure of our datasets over-estimated the model's ability to predict. To do this, we followed procedures presented in Kakouei et al. (2022) where models constructed using random sampling and non-random sampling were compared by goodness-of-fit.
Hyperparameters
We also explored two hyperparameters that control how the trees in the RF are created. First, we selected three values for the number of variables used in each split, referred to as “mtry” by both packages. In essence, mtry controls how many the predictor variables are used for a given tree, where variables are selected at random. We selected mtry values of 2–4 to represent the range of default mtry values for the number of predictors in each of our predictor sets (“wq_predictors” and “met_predictors” = 8, “all_predictors” = 14), and also included mtry values of 1, 5, and 6 to explore how values below and above default values influenced model performance. The ranger package calculates default mtry values by rounding down the square root of the number of variables, while randomForest divides the number of variables by 3, then rounds down (Liaw and Wiener 2002; Wright and Ziegler 2017). We also manipulated the total number of trees (“ntree”) created to construct each RF model. Larger numbers of trees improve model performance at the expense of computational time, a trade-off that is common across many types of machine learning models.
Model evaluation
We examined four different model evaluation metrics to understand how well models predicted nitrate, including two measures of goodness-of-fit (the coefficient of determination: R2, and Nash–Sutcliffe Efficiency [NSE]) and two measures of error (root mean square error: RMSE, and mean absolute error: MAE). R2 is a metric of the variance associated with an ordinary least squares line, while NSE is a metric of the variance associated with a 1:1 line. For this reason, we focused our evaluation of model goodness-of-fit on NSE as it is a more appropriate metric for assessing how well a model accurately predicts the dependent variable. However, we also included R2 values for context as a more commonly used goodness-of-fit metric in many fields. In addition to metrics measuring how well the models predict nitrate, we examined useful information gleaned from the models, including the relative importance of each predictor (“feature”) calculated via the Gini index to a model (“feature importance”), and the relationship between individual predictors and nitrate developed by each model (“partial dependency”), to understand how parameterization decisions influence model interpretation. Finally, although we constructed all models presented in this study using the three groups of predictor variables explained above, we assessed variable selection algorithms for wq_predictors variable set using the VSURF (Genuer et al. 2015) and Boruta (Kursa and Rudnicki 2010) R packages. Briefly, VSURF proposes a set of variables by first eliminating irrelevant variables then interpreting the response of variables, and finally refining selected variables to remove redundancy (Genuer et al. 2015). Boruta is a wrapper algorithm that selects variables by duplicating the dataset, shuffling values in each column, then training and evaluating changes in model importance (Kursa and Rudnicki 2010).
Statistics
Comparisons of means between two groups (e.g., when comparing goodness-of-fit between two parameter choices across all models) were conducted using Wilcoxon tests, while comparisons of means between more than two groups were conducted using Kruskal–Wallis tests, all calculated in R. Both tests are designed for data with non-parametric distributions. We used a significance threshold of p < 0.05, and report exact p values for p > 0.0001 except when multiple p values are involved.
Assessment
Temporal dependence and dataset splitting
We explored the role of temporal dependence on RF performance by splitting training and testing data randomly and non-randomly prior to model construction, following (Kakouei et al. 2022). Fig. 2A,B visualizes how these two sampling strategies split a dataset into training and testing subsets. Random sampling ignores any temporal structure when splitting, meaning the model is trained on portions of all years within the dataset, and therefore may not be capable of accurately predicting future behavior (i.e., Kakouei et al. 2022). In contrast, non-random sampling explicitly saves a cohesive subset of data at the end of the time-series for testing the model (e.g., Fig. 2B), and therefore represents a more accurate representation of the model's ability to predict future behavior.
We quantified the impact of splitting strategy on model performance by constructing 1296 model permutations (see Table 1 for parameter decisions) using random and non-random splitting for each dataset, then compared goodness-of-fit using NSE values (Fig. 2C). Average NSE values were significantly lower (p < 0.0001 for both CBV and OWC datasets) for non-random splitting than random splitting, indicating that temporal dependence within these datasets led to over-estimation of the models' predictive power using a random splitting strategy. For the CBV dataset, median NSE for models constructed using random splitting was 0.74, while median NSE for models constructed using non-random splitting was 0.45 (Supplementary Table S1). For the OWC dataset, median NSE for models constructed using random splitting was 0.63, while median NSE for models constructed using non-random splitting was 0.12 (Supplementary Table S1). Equivalent values for R2, MAE, and RMSE are presented in Supplementary Table S1, and show similar patterns, where non-random splitting yields a decrease in goodness-of-fit and increase in model error.
Parameter | Levels | Rationale |
---|---|---|
Predictor variables | Water quality, meteorology, both | Random forests only use a subset of predictors for each tree |
Model package | randomForest, ranger | Different packages use different implementations and default parameters |
Training : testing ratio | 0.5, 0.6, 0.7, 0.8, 0.9, 0.95 | The ratio can influence over-fitting or under-fitting |
Variables per split (mtry) | 1, 2, 3, 4, 5, 6 | The number of variables for each split influence what data a tree receives |
Number of trees (ntree) | 10, 50, 100, 500, 1000, 5000 | Represents a trade-off between model performance and computational time |
Because random sampling significantly improved model fits, we determined that splitting strategy played an important role in model construction. We note that for applications where the goal is developing high-frequency estimates for an existing dataset, random splitting is appropriate and a substantially more powerful approach. However, for time series, where prediction is often the goal, a non-random splitting strategy provides a more accurate representation of model power for future time-periods. Thus, all models for the remainder of this study were constructed using non-random splitting only to account for the temporal dependence inherent in our datasets.
The importance of model parameter choices
To assess the relative importance of the five parameter decisions listed in Table 1, we conducted non-parametric Kruskal–Wallis tests to look for significant differences between NSE for each level of each parameter within each dataset (Fig. 3). For predictor variables, we found that models constructed with water quality predictors (“wq_predictors”) significantly outperformed models constructed with meteorological predictors (“met_predictors”) models based on NSE values for both datasets (p < 0.0001). In addition, “wq_predictors” models significantly outperformed a combination of “wq_predictors” and “met_predictors” (“all_predictors”) models for both datasets (p < 0.0001). We did not observe any significant differences between model packages used (Fig. 3).
We also found that the proportion of training to testing data was an important consideration (Fig. 3). CBV dataset models had highest NSE values for training : testing ratios of 0.8 and 0.9 (0.59 for both), and significantly outperformed 0.5 (all p values < 0.0008) and 0.6 (all p values < 0.0008), while 0.9 significantly outperformed both 0.7 and 0.95 proportions (p values = 0.0004 and 0.003, respectively). For OWC, all combinations of training:test ratios were significantly different, with highest median performance for 0.95 (median NSE = 0.54) followed by 0.9 (median NSE = 0.51).OWC dataset models using a 0.9 ratio significantly outperformed models using a 0.8 ratio (p = 0.0315) but not a 0.7 ratio (p = 0.1900). It is important to note that the training : testing ratio controls training dataset size, where more data available to train a model on a wider range of conditions likely results in a stronger model.
For the hyperparameters, we observed significant differences for variables per split (mtry) for CBV (p < 0.0001) but not for OWC, while significant differences between the number of trees (ntree) were observed for both datasets (p < 0.01 for CBV, p < 0.001 for OWC). We observed strongest model performance for CBV using the maximum mtry values of 5 and 6 (median NSE = 0.585 and 0.586, respectively), while strongest OWC model performance was observed for mtry values of 4 (median NSE = 0.532) followed closely by mtry = 6 (median NSE = 0.531). For number of trees (ntree), we observed strongest model performance using ntree = 100 for both datasets (median NSE = 0.575 and 0.526 for CBV and OWC, respectively). However, we note an apparent plateau in model performance based on increasing ntree values, where all models with ntree >= 100 had median NSE values > 0.57 and 0.52 for CBV and OWC, respectively.
The other metric we used to assess goodness of fit (R2) agreed with NSE, with significant differences between variable predictors sets for both datasets, and significant differences between train : test ratios for the CBV dataset but not the OWC dataset (Supplementary Fig. S3). The two metrics used to assess model error (MAE and RMSE) also agreed with NSE results, where predictor variables choices yielded significantly different error for both datasets. Error metrics did deviate from patterns observed for both goodness-of-fit metrics for training : testing ratios, where both error metrics found significant differences between training : testing ratios for the OWC dataset, but not the CBV dataset (Supplementary Fig. S3).
The importance of model parameter choices on feature importance
One of the primary advantages of RFs relative to other machine learning algorithms, which often function essentially as “black-box” models, is the relative transparency of these algorithms and their output (Visser et al. 2022), which allows us to estimate the relative importance of each predictor to the model outcomes (“feature importance”). We calculated feature importance for the 864 model parameterizations using water quality predictors for each dataset to assess how similar feature importance was across two equivalent datasets collected in contrasting environments, and how much importance overlapped between predictor variables (features).
We found different orders of variable importance between CBV and OWC dataset models, as expected based on their different estuarine dynamics. In the models constructed for the CBV dataset, SpCond was consistently the most important variable (median: 34%), consistent with the importance of tidal dynamics in the system, followed by Temp (median: 19%) and pH (median: 14%). For models constructed with the OWC dataset, Temp was the most important variable (median: 19%), followed closely by Turbidity (median: 18%) and DOY (median: 16%). We note a contrast between datasets, where the difference in median importance between the top three OWC variables is 3% compared to 20% for CBV (Fig. 4).
For both datasets, we observed overlap in feature importance between predictors, as represented by error bars. However, predictors from the OWC dataset models exhibited grouping of parameters (e.g., similar average importance for Temp and Turbidity, Depth and DO, and SpCond and pH), while predictors from the CBV dataset models decreased consistently in median importance. We note that these patterns are visually interpreted because feature importances between all variables were significantly different (Wilcoxon pairwise tests, p < 0.0001) for both datasets. Relatively larger differences in median importance between adjacent predictors in Fig. 4A suggest that, although error bars overlap, the order of variables is fairly consistent across CBV dataset models. In contrast, grouping behavior for OWC dataset models (Fig. 4B) had several coupled parameters indicating that different models resulted in a different order of the feature importance of predictors. Because the importance of predictor variables provides quantitative assessment of the dominant drivers of the dependent variable, changes in the order of feature importance influence biogeochemical interpretation of RF models.
The importance of model parameter choices on partial dependency
We also explored how model parameter choices influenced model-derived relationships between nitrate and select predictor variables using partial dependency plots constructed from all model parameterizations using water quality variables (as explained in the previous section). We selected the most important predictors for each dataset (SpCond and Temp for CBV and OWC datasets, respectively), as well as DOY to explore how consistent predictor-dependent relationships were based on RF model parameter choices.
Models for the CBV dataset predicted a very consistent decreasing relationship between nitrate and SpCond, with the largest variation between models at lower SpCond values (Fig. 5A). In contrast, models for the OWC site predicted a wide range of different nitrate values (Fig. 5A), which is consistent with the low predictive power of SpCond in OWC models (Fig. 4). For Temp, consistent decreases in nitrate were observed with increasing temperature, where CBV exhibited a similar pattern to SpCond, while OWC nitrate more rapidly decreases at temperatures > ~ 17°C (Fig. 5B). For DOY, CBV models predicted a fairly consistent seasonal pattern with lower nitrate in the summer (DOY values near 1), and higher nitrate values in the winter (DOY values near 0). In contrast, OWC models predicted higher nitrate values in the winter than the shoulder seasons, but a marked increase of nitrate during the summer. The consistency (lack of variability between models) of relationships between SpCond and nitrate for CBV and Temp and nitrate for OWC support findings of median predictor importance in Fig. 4.
Discussion
Temporal dependence and RFs
Temporal dependence is a common feature of water quality datasets, and an important consideration when constructing RF models. By investigating two data splitting strategies, one of which accounted for the temporal structure of our time series (non-random sampling) and one which did not (out-of-bag sampling), we showed that temporal dependence can substantially impact model performance and that the impact of temporal dependence varied across different types of sites/ecosystems (Fig. 2). Several approaches for incorporating temporal dependence into RF models exist, including addition of lagged variables (e.g., Kane et al. 2014), blocked bootstrapping (e.g., Hornung and Wright 2019; Goehry 2020), and time-aware dataset splitting (e.g., Kakouei et al. 2022). Because of large gaps in our datasets and irregular temporal spacing between nitrate samples, which are both common features of environmental time series, we could not easily apply lagged variables without gap filling, and blocked bootstrapping could be misleading (i.e., blocking across gaps). Our results indicate that models constructed from our time series using random splitting of training and testing datasets over-estimated the ability of RFs to predict nitrate concentrations (Fig. 2C). While this may not be an issue if the purpose of an RF is to develop high-frequency estimations for historical data, any attempt to forecast using random sampling is likely problematic. For the CBV site, which has consistent and simple seasonal patterns in nitrate (Supplementary Fig. S2), the drop in NSE between random and non-random splitting is smaller (Fig. 2). We attribute this to the ability of random samples to accurately characterize the seasonal patterns present in nitrate (Figs. 5B, S2), and more accurately predict them in the testing dataset. In contrast, nitrate variability at the OWC site has a more complex seasonal pattern (Figs. 5B, S2), which may help explain why the drop in NSE between random and non-random splitting is much larger for OWC models than CBV models (Fig. 2C). These findings suggest that as the complexity of system behaviors associated temporal patterns increase, it becomes increasingly important to adequately address temporal dependence when evaluating model performance. We note that these findings are not only relevant to temporally dependent data, but are also important to consider for other types of data with autocorrelation (e.g., spatially correlated data).
Variable selection is key for optimizing model fits
Of the five parameters we manipulated when constructing our RF models, we found predictor variables to be the most significant (Fig. 3). Interestingly, water quality predictors alone outperformed water quality and meteorological predictors combined. This becomes intuitive when thinking about how RFs are constructed, where each tree is built on a subset of variables (controlled by mtry). It is clear that meteorological variables are weaker predictors of nitrate compared to water quality variables (Fig. 3A), and it follows that, in essence, the dataset with all predictors is “diluted” with these less important variables, which are used to construct some of the trees. It is also important to note that some meteorological variables (e.g., rainfall, air temperature, and solar radiation) likely drive nitrate on larger spatial or temporal scales than what is represented in our modeling framework.
Because our initial model fits only used specific groups of variables (Table 1), we explored if all water quality variables were useful by rerunning models with a common parameter set (package = ranger, mtry = 3, ntree = 500, train:test split proportion = 0.9), but for all possible combinations of 1–8 predictor variables. We present the results in Fig. 6, grouped based on if a given model included the most important predictor for that dataset, as determined by Fig. 4 (SpCond and Temp for CBV and OWC, respectively). The results provide three important considerations for determining the optimal number of predictors for an RF model. First, based on median goodness-of-fit values, more variables means stronger average model performance. Second, based on maximum goodness-of-fit values, individual models with 3–7 predictors are capable of out-performing the 8-predictor model across both datasets (Fig. 6). This agrees with results presented in Fig. 3A, which indicated that adding more predictors (i.e., meteorological variables) decreased model goodness-of-fit, and reiterates the importance of carefully selecting input variables a priori.
Third, the relationship between models including and excluding the most important predictors differed somewhat between datasets. For both datasets, models including the most important predictor variable generally improve model performance, as would be expected. However, for CBV, including SpCond dramatically improves model performance for CBV models (e.g., for models with 3–7 predictors, average NSE values for models including SpCond are almost double the average NSE values for models excluding SpCond in Fig. 6A), while including Temp in OWC models more modestly average increases NSE values (Fig. 6B). We attribute this difference in dataset model performance to the difference between the most important predictor and the second-most predictor, which is smaller for OWC models compared to CBV models (Fig. 4), and may be linked to the different mechanisms that SpCond and Temp represent in these systems. In CBV, SpCond likely represents both the seasonal patterns of freshwater inputs, and also tidal (sub-diurnal to monthly) variations. In contrast, Temp in OWC primarily represents broad seasonal trends, which may also be partially captured by other variables in the dataset.
Variable selection for RFs remains a topic of discussion (Genuer et al. 2010; Degenhardt et al. 2019; Probst et al. 2019), and there are several programmatic options in R to dynamically select the optimal set of predictor variables. We used two popular R packages (VSURF, Genuer et al. 2015; and Boruta, Kursa and Rudnicki 2010) to select important variables, with Boruta indicating all eight predictors were important for both datasets, while VSURF selected six variables for the CBV site and eight variables for the OWC site. This is relatively consistent with the median NSE values in Fig. 6, which indicate that using eight predictors gives us the highest average goodness-of-fit. However, if the goal is maximum goodness-of-fit for a single model, the more computationally intensive approach of constructing models with all possible combinations of predictors (e.g., 3–7 predictors outperformed all eight predictors for individual models in Fig. 6) appears to be a better choice.
Parameter decisions and model interpretations
Parameter decisions (see Fig. 2; Table 1) influenced both feature importance and partial dependency plots, with dataset-specific impacts (Figs. 4, 5). For the CBV dataset, a clear order of variable importance is observed, indicating that variables generally maintain the order of importance independent of model parameter decisions. In addition, the importance of SpCond in the model where it was least important (16%) was more important than the maximum importance of DOY, Depth, Turbidity, and Station (Fig. 4A). For OWC, while Temp is the most important predictor on average, the minimum importance of Temp in any OWC model (12%) is lower than maximum importance of all variables except SpCond and Station (Fig. 4B). Thus, our results suggest that the structure of a dataset influences the stability of the order of variable importance across the spectrum of parameter decisions examined in this study.
Partial dependency plots in Fig. 5 support feature importance results in Fig. 4, where the variability in responses matches median feature importance patterns. For example, the clear and consistent SpCond-nitrate relationships across all CBV models match its role as the most important variable for the dataset, while the variability in SpCond-nitrate relationships for OWC matches its relatively low importance for modeling OWC nitrate. Likewise, we observe similar variability in Temp-nitrate relationships between CBV and OWC (Fig. 5B), matching the similarity in median predictor importance of Temp for CBV (19.4%) and OWC (18.8%).
Partial dependency patterns also highlight the differences in ecological drivers between these datasets, where SpCond-nitrate relationships match primary ecosystem drivers: tidal dilution of freshwater nitrate sources for the CBV dataset, compared with higher conductivity in runoff enriched in nitrate in the OWC site relative to the lower conductivity waters of Lake Erie (Fig. 5A). Similarly, seasonal nitrate patterns by DOY show depletion of nitrate during the summer for CBV, potentially attributed to increased biological uptake during the summer (Fisher et al. 1992, 1999), while nitrate spikes during spring and early summer for OWC (Fig. 5B), potentially linked to seasonal agricultural runoff dynamics or in situ denitrification and hydrologic dynamics (McCarthy et al. 2007).
In general, we observe that parameter decisions do impact how models are interpreted, both in terms of how important individual predictors are to the model (Fig. 4) and in terms of the relationships between the dependent variable and individual predictors (Fig. 5). However, using the range of parameter decisions in Table 1, models for both datasets generally converged on similar variable importance and partial dependency relationships.
Limitations
While the models we constructed illustrate how parameter decisions impacted the importance of individual predictors and the relationships between dependent variables, they also have a number of limitations that warrant further exploration. First, while the ecosystems explored represent contrasting ecosystems and data structures (e.g., gap frequency/size and ranges of values), they come from a common monitoring strategy and are only two complete time-series datasets. Applying the same exploration of parameter importance to a wider variety of datasets across different monitoring programs, particularly with a wider range of potential predictors, would provide a more robust generalization of “best practices” when using RF models for water quality and biogeochemical modeling across systems and scales. Second, we focused on two RF packages that work “out-of-the-box” with a tidymodels workflow. Alternative RF adaptations, including rangerts (Goehry et al. 2021), a package that allows block bootstraps, and iRF (Basu et al. 2017), a package that constructs iterative random forests that weight trees and can detect variable interactions, hold promise for improving model performance for data with complex structural features like temporal dependence, gaps, and irregular intervals.
There are also additional steps that could better represent the characteristics in our dataset which were beyond the scope of this study. Using the ranger package, it is possible to control the weight of a variable when variables are randomly selected, which could be used to force time (e.g., DOY) into all trees of a model. While this may better incorporate the temporal dependency in a dataset, it also biases the random selection of variables, and may interfere with model performance/interpretation. In addition, we treat station as an ordered factor while incorporating it as a random effect may be more appropriate, and we note that there are extant adaptations of RF that may be useful for expanding our approach (e.g., Capitaine et al. 2021).
Recommendations
Based on the datasets and parameterization choices we explored, the single most important decision was splitting strategy, particularly when working with temporally dependent data. We recommend comparing random and non-random splitting strategies to understand temporal dependence (sensu Kakouei et al. 2022) prior to making other model decisions. Second, we found that predictor variables had significant impacts on model goodness-of-fit, and that “less is more” in the case of our datasets, where water quality predictors alone outperformed a combination of water quality and meteorological predictors (Fig. 3A), and that specific subsets of water quality predictors outperformed the full predictor set (Fig. 6). Automated approaches to variable selection like VSURF and Boruta are valuable tools to remove variables that do not significantly contribute to models, but our results suggest that following automated steps with a comprehensive screening of all variable combinations (which could also be easily automated) can further improve model performance. With the limitations mentioned above in mind, we suggest that the overall approach described here, and summarized in Fig. 1, is a valuable starting point for making decisions on how to parameterize RF models for predicting time series or other time- or space-dependent datasets.
Open Research
Data availability statement
Data and metadata are publicly available at https://cdmo.baruch.sc.edu/, and all scripts for models and figures are publicly available at https://github.com/COMPASS-DOE/rf-synthesis.
References
Acknowledgments
The authors would like to thank Ben Bond-Lamberty for helpful feedback on an earlier version of this manuscript and the NOAA-NERR program for producing the datasets used for our analyses. Data were collected via the NERR System-wide Monitoring Program (https://cdmo.baruch.sc.edu/) for OWC and CBV reserves. This research is based on work supported by COMPASS-FME, a multi-institutional project supported by the U.S. Department of Energy, Office of Science, Biological and Environmental Research as part of the Environmental System Science Program. The Pacific Northwest National Laboratory is operated for DOE by Battelle Memorial Institute under contract DE-AC05-76RL01830.
Conflict of Interest
The authors declare no conflict of interest.