Volume 8, Issue 3 p. 419-452
Data Article
Open Access

MacroSheds: A synthesis of long-term biogeochemical, hydroclimatic, and geospatial data from small watershed ecosystem studies

Michael J. Vlah

Michael J. Vlah

Duke University, Durham, North Carolina, USA

Search for more papers by this author
Spencer Rhea

Spencer Rhea

Duke University, Durham, North Carolina, USA

Search for more papers by this author
Emily S. Bernhardt

Emily S. Bernhardt

Duke University, Durham, North Carolina, USA

Search for more papers by this author
Weston Slaughter

Weston Slaughter

Duke University, Durham, North Carolina, USA

Search for more papers by this author
Nick Gubbins

Nick Gubbins

Colorado State University, Fort Collins, Colorado, USA

Search for more papers by this author
Amanda G. DelVecchia

Amanda G. DelVecchia

Duke University, Durham, North Carolina, USA

Search for more papers by this author
Audrey Thellman

Audrey Thellman

Duke University, Durham, North Carolina, USA

Search for more papers by this author
Matthew R. V. Ross

Corresponding Author

Matthew R. V. Ross

Colorado State University, Fort Collins, Colorado, USA

Correspondence: [email protected]

Search for more papers by this author
First published: 26 April 2023
Citations: 2
Associate editor: Jordan Stuart Read

Author Contribution Statement: ESB and MRVR originated the project and defined its scope and goals. MJV, MRVR, and SR designed the data processing system architecture. MJV, SR, WS, and NG developed the data processing system, with routine feedback from MRVR, ESB, and all other authors. Visualizations associated with this paper and the MacroSheds portal were also designed by the full team. SR, MJV, WS, and NG implemented the macrosheds R package. MJV, ESB, MRVR, and SR wrote the paper with edits from the team. MJV and SR generated the figures.

Data availability statement: Data and metadata are available on the Environmental Data Initiative repository at https://portal.edirepository.org/nis/mapbrowse?scope=edi&identifier=1262


The US Federal Government supports hundreds of watershed monitoring efforts from which solute fluxes can be calculated. Although instrumentation and methods vary between studies, the data collected and their motivating questions are remarkably similar. Nevertheless, little effort toward their compilation has previously been made. The MacroSheds project has developed a future-friendly system for harmonizing daily time series of streamflow, precipitation, and solute chemistry from 169+ watersheds, and supplementing each with watershed attributes. Here, we describe the breadth of MacroSheds data, and detail the steps involved in rendering each data product. We provide recommendations for usage and discuss when other datasets might be more suitable. The MacroSheds dataset is an unprecedented resource for watershed science, and for hydrology, as a small-watershed supplement to existing collections of streamflow predictors, like CAMELS and GAGES-II. The MacroSheds platform includes a web dashboard for visualization and an R package for data access and analysis.

Scientific Significance Statement

Watershed ecosystem monitoring has been underway for more than five decades, producing hundreds of long-term records of streamflow, water chemistry, and their environmental controls. Within the last decade, data synthesis efforts have provided a basis for continental-scale hydrologic analysis of watersheds larger than 10 km2. However, to date there has been no synthesis of small-watershed hydrology and water chemistry that would allow for comparison of chemical concentration and flux on a similar scale. MacroSheds is an ongoing synthesis of small-watershed datasets that enables the search for general principles describing functional capacity across watersheds, including relative rates of weathering and chemical processing, and responses to climate change.

URL of the dataset and metadata with permanent identifier: https://portal.edirepository.org/nis/mapbrowse?scope=edi&identifier=1262

Code URL with permanent identifier: https://zenodo.org/record/7633926

Measurement(s): 185 distinct stream chemistry variables and 63 distinct watershed attributes (climate, hydrology, geology, terrain, vegetation, soil, landcover).

Technology type(s): remote sensing, long-term dataset synthesis.

Temporal range: 1963–2022.

Frequency or sampling interval: daily.

Spatial scale: Watershed site-based data synthesized for 169 gauged stream sites; water chemistry and/or streamflow for 495 sites primarily across North America at the time of this publication.

Background and motivation

Watershed ecosystem science began in the late 1960s, when Herb Bormann and Gene Likens began estimating precipitation inputs and stream water exports for small gauged watersheds in the Hubbard Brook Experimental Forest (Bormann et al. 1968, 1969). These input and output fluxes and their differences were used to detect trends in air pollution, climate, rates of chemical weathering, nutrient limitation, and nutrient saturation, and to detect the magnitude, duration, and severity of disturbance on ecosystem element retention and loss (Likens 2013). All of these insights were gained from the consistent comparison of precipitation and streamflow volumes and chemistry conducted over long time scales. The simplicity of the watershed ecosystem approach and the magnitude of its scientific impact has led to similar watershed ecosystem studies being conducted in thousands of watersheds around the globe.

Altogether, hydrology labs and experimental forests operated by the US Forest Service, Department of Energy, and the National Science Foundation's Long Term Ecological Research, National Ecological Observatory Network (NEON), and Critical Zone Collaborative Network (CZNet, formerly CZO) programs, support hundreds of small watershed studies around the United States (Fig. 1). Each of these programs collects nearly identical types of data. Yet to date, there has been no attempt to collate these datasets into a synthetic data platform that would facilitate comparison across sites. The notable examples where cross-site analyses have been performed (Williard et al. 1997; Kaushal et al. 2014; Zhang et al. 2017) have been limited in spatial scope or applied to only one element (like N) or general water balance. Each of these individual efforts required significant supplemental funding and data expertise to enable synthesis. Kaushal et al. (2014) found that processing and retention of carbon and nitrogen varied significantly on a scale of kilometers, stressing the need for more studies across spatial scales. Synthesis work by Zhang et al. (2017) yielded important insights on cross-scale hydrologic response to forest changes using routine statistical tests, but synthesizing data used in those tests required a much larger effort. Differences in data structure, access method, time and location representation, and other challenges inherent to merging even relatively consistent datasets have ultimately limited the scale of inference in watershed ecosystem science.

Details are in the caption following the image
Locations of watershed biogeochemical records included in version 1 of the MacroSheds dataset. Colors represent EPA ecoregions. Additional sites in Sweden and Antarctica are not associated with EPA ecoregions and are not shown. Please visit macrosheds.org for an interactive map of sites.

Indeed, watershed scientists have become increasingly self-critical, recognizing the failure of our community to develop generalities and theories that apply across scales (McDonnell et al. 2007; Kirchner 2009; Lohse et al. 2009). Much of recent watershed science has focused on gaining ever finer detail on the spatial and temporal heterogeneity of flow paths, water residence times, and biogeochemical processes (McClain et al. 2003; Bernhardt et al. 2017). This fine-scale focus has identified many unique idiosyncrasies of individual watersheds but has not helped us develop general theories about watershed dynamics that can be applied at regional to global scales. It is a fair critique to suggest that most watershed ecosystem studies remain rather parochial, involving detailed studies of individual or paired watersheds, or surveys of a small set of attributes across multiple watersheds. Macroscale watershed science, or the search for general principles that describe functional capacity and behavior across watersheds, has been limited. A major reason for this lack of large-scale focus is the challenge of data access and integration across sites. New requirements for data sharing have made it possible to access most National Science Foundation (NSF)-funded watershed science data, yet individual datasets are rarely interoperable across research sites, even when stored in the same repositories.

We find inspiration for harmonizing large datasets in the hydrology community, where there are two major modern efforts to synthesize records of discharge, precipitation, and watershed/catchment attributes: GAGES-II and CAMELS (Falcone 2011; Newman et al. 2014; Addor et al. 2017). GAGES-II provides geospatial data and classifications (reference vs. nonreference) for the watersheds of 9322 US Geological Survey (USGS) stream gages. The CAMELS dataset builds on progress from GAGES-II by identifying 671 minimally disturbed watersheds, compiling their precipitation and runoff time series, and generating watershed attributes for each. Though preeminent examples of data aggregation and distribution, these datasets are limited in their scope to physical hydrology, mostly in watersheds too large to meet the assumptions of the watershed ecosystem concept, that is, uniform geology and a minimally permeable base of rock or permafrost (Fig. 2; Bormann and Likens 1969). Whereas, with the conditions of the watershed ecosystem concept satisfied, it is possible to construct budgets of inputs, outputs, and net loss or gain for countless solutes of ecological importance. Still, CAMELS and GAGES-II provide a roadmap for synthesizing analysis-ready data for macroscale watershed ecosystem work. With 500 combined citations, they also demonstrate the value of such syntheses to the hydrology community. These datasets have enabled foundational shifts in the ways we make predictions at scale, especially through recent machine-learning advances in rainfall-runoff modeling (Kratzert et al. 2018, 2022). MacroSheds opens this landscape of opportunity to the biogeochemistry community.

Details are in the caption following the image
Comparison of watershed areas as represented in the MacroSheds, CAMELS, and GAGES-II datasets. Each vertical bar represents a single watershed, but note that pink and blue bars have been widened for visibility. The tail of the pink arrow marks the upper limit of MacroSheds watershed areas. The MacroSheds dataset fills out two orders of magnitude at the small end, with 122 watersheds under 10 km2 and 68 under 1 km2. For CAMELS, these numbers are 8 and 0, respectively. For GAGES-II, they are 207 and 2. Only those MacroSheds sites for which discharge data are publicly available are included in this figure.

Our primary goal in developing the MacroSheds dataset is to merge all US federally funded watershed ecosystem studies into a common platform, and to use that platform to develop a classification of watershed ecosystems that identifies differences in watershed functional traits (sensu McDonnell et al. 2007). Understanding these functional traits will allow us to predict how watershed biogeochemical cycles will respond to changing patterns of climate and element deposition. Ultimately, we hope that macroscale watershed science can build a mechanistic understanding of how variation in soil chemistry and biological demand for elements will alter the stoichiometric ratios of watershed outputs relative to inputs (in deposition and weathering). Merging records from hundreds of watershed ecosystem studies into a common format is the first step in developing macroscale watershed science. With this feat accomplished, a nearly limitless number of questions can be asked by researchers across the disciplines of hydrology, climate science, and ecology. We aim to facilitate these analyses through the MacroSheds dataset, R package, and web portal, which together constitute an open data platform.

In the MacroSheds dataset, we have unified publicly available data records of precipitation, streamflow, precipitation chemistry, and stream chemistry from watershed ecosystem studies that meet a requirement of at least monthly stream chemistry sampling. We used a common procedure to delineate the watersheds of any gauged stream sites without published boundaries, and daily, gridded climate data from PRISM (Daly et al. 2008) and Daymet (Thornton et al. 2020) to provide standardized estimates of precipitation, air temperature, and other climatological parameters within each watershed boundary. For each delineated watershed we summarized publicly available, gridded products encompassing topography, geology, soil, vegetation, and landcover attributes. A subset of watershed summary statistics and climate forcings included with the MacroSheds dataset are immediately commensurable with those of the published CAMELS dataset. MacroSheds therefore functions secondarily as a supplement to CAMELS, enhancing the predictive power of the combined set, especially for small watersheds.

Data description

Access methods and dataset contents

The MacroSheds dataset and all associated documentation can be found on the Environmental Data Initiative (EDI) data portal, at https://portal.edirepository.org/nis/mapbrowse?scope=edi&identifier=1262. This URL will always point to the most recent dataset version, and at the time of this writing is synonymous with https://portal.edirepository.org/nis/mapbrowse?scope=edi&identifier=1262&revision=1 (Version 1). When new versions are published, the old versions will still be accessible by appending a version number to the end of the base URL in the above fashion. Throughout our current funding cycle, we intend to update this dataset annually with newly available data.

The dataset can also be downloaded through the “macrosheds” package for R (https://github.com/MacroSHEDS/macrosheds; Rhea et al. 2023a), or explored without downloading, through the visualization platform at macrosheds.org. An interactive data catalog is available under the Data tab on macrosheds.org. See Table 1 for terms used throughout the following sections.

Table 1. Common terms as used within the MacroSheds dataset and this paper.
Term Definition
Watershed All land area contributing runoff to a point of interest along a stream, regardless of contributing area. Does not necessarily account for inputs from subsurface flow or human-constructed diversions. The terms “catchment” and “basin” are sometimes used in this way.
Site An individual gauging station or stream sampling location and its watershed.
Domain One or more sites under common management.
Network One or more domains under common funding/leadership.
Product A collection of data, possibly including multiple datasets/tables. Primary sources may separate products by temporal extent/interval, scientific category, detection method, and/or sampling location. MacroSheds products are detailed in Table 2.
Site-product The collection of all data for a single MacroSheds product, available at a single site.

This dataset is derived from data already published in public repositories, primarily from US federally funded watershed studies, and in compliance with existing grant requirements. We report combined discharge and chemistry for 169 watershed studies (Fig. 1). The core dataset consists of seven data products (Table 2) grouped into two components, referred to below as “time series” and “watershed attributes.” Each of these components of the core dataset has a supplementary counterpart in which data structure, variables, and methods parallel the CAMELS dataset, to maximize interoperability between MacroSheds and CAMELS.

Table 2. MacroSheds data products. All but watershed attributes constitute time-series products.
Product Definition
Discharge Streamflow; water volume over time; reported in L s−1.
Stream chemistry Concentration of chemical constituents in stream water; reported in mg L−1 or mEq L−1.
Stream flux Mass of chemical constituents in stream water, per watershed area, over time; reported in kg ha−1 d−1.
Precipitation Rainfall, snowfall, or both combined; reported per watershed in mm.
Precipitation chemistry Concentration of chemical constituents in precipitation; reported in mg L−1 or mEq L−1; averaged across watershed area.
Precipitation flux Mass of chemical constituents in precipitation, per watershed area, over time; reported in kg ha−1d−1.
Watershed attributes Areal watershed summary statistics, describing climate, hydrology, geology, terrain, vegetation, soil, and landcover.

Time-series data

For the time-series component, we harmonized both physical hydrology and stream chemistry variables, capturing tremendous variation in hydrologic regimes and solute concentrations. MacroSheds data span a wide range of mean annual runoff (three orders of magnitude; Fig. 3). The distribution of flows is quite variable, with high frequency of high flow events as is typical of small, steep catchments that dominate this dataset. A significant fraction of streams goes completely dry in the average year with baseflow index ranging from 0 to 0.9. Water quality varies greatly among streams in the MacroSheds dataset, with pH covering almost the full range of that reported for natural waters (3–8; Wetzel 2001). Dissolved phosphorus and nitrogen concentrations are generally low compared with previously published data compilations (Falcone 2011; Newman et al. 2014), largely because this dataset is dominated by undisturbed, smaller watersheds. In contrast, dissolved organic carbon (DOC) ranges from near detection to > 30 mg L−1—nearly black water—reflecting a wide variation in wetland habitat. The MacroSheds dataset does include some site-specific sample collection biases in that fewer than half of sites routinely collect total suspended solids (TSS), dissolved inorganic carbon (DIC), and alkalinity data (Fig. 4). These water quality patterns arise because of geologic variation, incoming precipitation chemistry, vegetation cover, and patterns of ecosystem productivity (Fig. 5). The specific variables available within each watershed study vary widely, but the MacroSheds dataset includes at a minimum stream discharge or major stream ion concentrations (Ca, Mg, K, SO4, etc.) for each site. In all, the MacroSheds dataset contains 185 stream and precipitation variables, including concentrations of nutrients, metals, photosynthetic pigments, and dissolved gases, temperature, turbidity, and other common water quality metrics where available. The total numbers of sites with discharge and chemistry data are 181 and 484, respectively. The total number with both is 169. A breakdown of data availability and data sources by domain is given in Table 4, but for a complete list of variables by site and temporal range, consult variables_timeseries.csv on EDI or visit the interactive data catalogs under the Data tab at macrosheds.org.

Details are in the caption following the image
Distributions of hydrologic conditions across MacroSheds sites, computed on site-years with at least 85% temporal coverage, or on ≥ 50% maximum coverage for polar or arid sites where a full year of flow is never measured. Each vertical bar represents a single site. Krycklan (Sweden) and McMurdo (Antarctica) domains appear as black bars.
Details are in the caption following the image
Distributions of chemical properties across MacroSheds sites. Each vertical bar represents a single site. For every panel except “pH,” values are log10 transformed to increase the visibility of the bar colors. Krycklan (Sweden) and McMurdo (Antarctica) domains appear as black bars.
Details are in the caption following the image
Distributions of 24 watershed attributes across MacroSheds sites. Each vertical bar represents a single site. Inset letter codes stand for attribute categories: Climate, Landcover, Vegetation, Parent material, Terrain. In all, 83 summary attributes and 185 temporally explicit attributes are available. Dep., deposition; Dur., duration; Frac., fraction; GPP, gross primary productivity, NPP, net primary productivity; Perm., permeability. Krycklan (Sweden) and McMurdo (Antarctica) domains appear as black bars.

MacroSheds time-series data are tiered by domain according to the restrictiveness of licensing and intellectual rights (IR) terms associated with their primary sources. Tier 1 domains have minimal restrictions, requiring at most standard attribution, while Tier 2 domains require some additional action on the part of data users. Data tiers and license/IR information are detailed in our Data Use Agreements (data_use_agreements.docx on EDI), and citations for all MacroSheds primary sources are included in Tables 4 and 6 of this document. A full compendium of attribution, contact, and legal information can be found in our documentation on EDI (attribution_and_intellectual_rights CSV files). The “Data Use and Recommendations for Reuse” section of this document contains instructions on efficiently achieving license/IR compliance as a user of MacroSheds data.

MacroSheds time-series data are provided as CSV files, separated by domain and indexed by date, site code, and variable. The column structure is laid out in Table 3 and later referred to as “MacroSheds format.”

Table 3. Structure of MacroSheds core time-series data CSV files. Referred to throughout as “MacroSheds format.” Within the macrosheds package for R, the val_err column may be omitted if uncertainty is included with the val column (see section on “Detection Limits and Propagation of Uncertainty”).
Header value Column definition
datetime Date and time in UTC. Time is specified in order to accommodate subdaily time-series data in future updates, though at present all time components are 00:00:00.
site_code A unique identifier for each MacroSheds site. Identical to primary source site code where possible. See sites.csv metadata on EDI or run ms_load_sites() from the macrosheds package for more information.
var Variable code, including sample type prefix (described in “Tracking of Sampling Methods for Each Record” section). see variables_timeseries.csv on EDI or run ms_load_variables() from the macrosheds package for more information.
val The data value.
ms_status QC flag. See “Technical Validation” section. Lowercase “ms” here stands for “MacroSheds.”
ms_interp Imputation flag, described in “Temporal Imputation and Aggregation” section.
val_err The combined standard uncertainty associated with the corresponding data point, if estimable. See “Detection Limits and Propagation of Uncertainty” section for details.
Table 4. MacroSheds time-series data breakdown by domain. End-dates of hydrologic and chemical records (Columns 3 and 4) vary according to primary source publication schedules. Water chemistry sample frequencies (Column 6) are occasionally irregular; up to three of the most common (mode) sample frequencies are shown for each domain.
Domain code Sites Dur. of hydro record Dur. of chem record Solutes Chem sample freqs. Citations
arctic 5 1978–2019 1978–2019 46 Daily, weekly Kling 2016a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s2019; Shaver 2019; Zarnetske 2020; Zarnetske et al. 2020a,b; Bowden 2021a,b,c,d
baltimore 9 1957–2022 1998–2019 17 Weekly, 2 monthly Cary Institute of Ecosystem Studies et al. 2017; Groffman and Martel 2020; Groffman et al. 2020a,b; Welty and Lagrosa 2020
bear 2 1988–2016 1986–2016 19 Daily Patel et al. 2020a,b
bonanza 3 1969–2020 1994–2018 17 Daily, weekly, 2 weekly Chapin et al. 2014, 2018; Jones et al. 2016; Van Cleve et al. 2018; Jones, Chapin, et al. 2020
boulder 4 1996–2022 2008–2020 31 Daily, weekly, 2 weekly Rock and Anderson 2020; Anderson 2021; Anderson and Jensen 2021a,b,c; Anderson and Ragar 2021a,b,c,d; Anderson et al. 2021
calhoun 1 2014–2017 2014–2018 23 ~ monthly Foroughi et al. 2019; Mallard 2020, 2021; Wang et al. 2021
catalina_Jemez 12 2006–2021 2005–2020 66 Daily, weekly, 2 weekly Troch and Abramson 2019, 2020, 2021; Troch et al. 2019, 2020a,b2021; Litvak and Brooks 2020a,b; Chorover et al. 2021a,b; McIntosh et al. 2021a,b; Papuga et al. 2021a,b,c
east_river 11 2014–2020 2014–2020 47 Daily, weekly Carroll et al. 2019, 2021; Carroll and Williams 2019; Dong et al. 2020a,b,c; Newcomer and Rogers 2020; Williams et al. 2020a,b
fernow 9 1951–2019 1983–2019 11 Weekly, 2 weekly Edwards and Wood 2011a,b,c,d
hbef 9 1956–2022 1963–2021 31 Weekly USDA Forest Service 2020, 2021; Hubbard Brook Watershed Ecosystem Record (HBWatER) 2021
hjandrews 10 1949–2019 1968–2019 26 Daily Rothacher 2017; Fredriksen 2019a,b; Johnson et al. 2020
konza 4 1985–2020 1983–2021 10 Daily, 2 daily Dodds 2019, 2020a,b,c, 2021a,b,c,d,e; Blackmore 2020; Blair 2021; Nippert 2021
krew 8 2003–2015 2003–2021 12 Daily, 2 weekly Hunsaker and Safeeq 2017, 2018; Hunsaker and Padgett 2019
krycklan 15 1981–2021 1985–2021 91 Daily, 2 weekly Laudon et al. 2013
luquillo 10 1945–2022 1983–2018 19 Weekly Gonzalez 2015, 2017; Ramirez 2020, 2021; McDowell 2021a,b
mcmurdo 18 1969–2020 1990–2020 18 Daily, weekly Gooseff and McKnight 2021a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q; Gooseff and Lyons 2022a,b,c; McKnight and Gooseff 2022a,b,c
niwot 7 1981–2021 1982–2021 31 Daily, weekly Niwot Ridge LTER and Caine 2018; Caine 2019a,b,c, 2021a,b,c,d,e,f,g; Williams 2019, 2021a,b; Caine et al. 2020a,b,c, 2021; Caine and Niwot Ridge LTER 2021a,b; Morse et al. 2021a,b,c; Williams et al. 2021
plum 4 2001–2015 1993–2019 26 Daily, monthly Giblin 2013a,b,c,d, 2015a,b, 2016, 2017, 2018, 2019, 2020; Hopkinson 2013a,b,c,d,e,f,g,h,i; Wollheim 2013a,b,c,d,e,f,g,h,i,j,k,l,m,n, 2014a,b,c,d,e,f,g,h,i,j,k,l,m,n,o, 2016a,b,c,d,e,f,g,h,i,j,k,l, 2018a,b,c,d, 2019a,b,c,d,e,f; Wollheim and Vorosmarty 2014a,b,c,d,e,f; Wollheim and Green 2018a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p, 2019a,b,c,d,e,f,g,h,i; Wollheim et al. 2019; Wollheim and Plum Island Ecosystems LTER 2019, 2021
santa_Barbara 12 1970–2022 2000–2018 10 Daily Santa Barbara Coastal LTER and Melack 2014a,b,c,d,e,f,g,h,i,j, 2019a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an,ao,ap,aq,ar,as,at, 2020
santee 4 1964–2018 1976–2017 24 Daily USDA Forest Service 2011; 2017; Amatya and Trettin 2012a,b, 2018
shale_hills 4 2006–2021 2006–2015 25 Daily, diverse Li 2018; Brantley 2019
suef 4 1963–2018 1969–1981 21 3 weekly Fredriksen and Johnson 2017a,b; Jones and Rothacher 2019
usgs 1 2009–2022 2009–2022 12 Daily Courtesy of the US Geological Survey
walker_Branch 2 1969–2014 1989–2013 28 Weekly Mulholland and Griffiths 2016a,b,c
Table 5. Structure of MacroSheds temporally explicit watershed attribute data.
Header value Column definition
network MacroSheds network.
domain MacroSheds domain.
site_code A unique identifier for each MacroSheds site. Identical to primary source site code where possible. See sites.csv metadata on EDI or run ms_load_sites() from the macrosheds package for more information.
var Variable code, including prefix with data source and category codes. See variables_ws_attr_timeseries.csv on EDI or run ms_load_variables (var_set = “ws_attr”) from the macrosheds package for more information.
date Calendar date.
val The data value.
pctCellErr Percent of watershed raster cells with missing values. Not currently retrieved for Google Earth Engine products.
Table 6. Watershed attribute datasets included in MacroSheds, and their primary sources. Datasets retrieved from Google Earth Engine, rather than the primary source, are indicated by “GEE.”
Attribute(s) Source Citation
Evapotranspiration reference Gridmet (GEE) Abatzoglou 2012
LAI, fPAR MODIS (GEE) Myneni et al. 2015
NDVI MODIS (GEE) Didan 2015
Vegetation cover MODIS (GEE) Townshend 2016
Atmospheric chemical fluxes NADP NADP Program Office 2022
Landcover classes NLCD (GEE) Dewitz 2021
Soil composition and properties NRCS-gSSURGO Soil Survey Staff 2022
SWE, snow depth NSIDC Broxton et al. 2019
NPP and GPP NTSG (GEE) Robinson et al. 2018
Soil thickness ORNL DAAC Pelletier et al. 2016
Wetness Oxford MAP (GEE) Weiss et al. 2014
Temperature and precipitation PRISM (GEE) Daly et al. 2008
Base flow index USGS Wolock 2003
Bedrock composition and properties USGS Olson and Hawkins 2014
Climate* Daymet (GEE) Thornton et al. 2020
Subsurface permeability, porosity* GLHYMPS Gleeson 2018
Geologic classes* GLiM Hartmann and Moosdorf 2012
Landcover classes* MODIS (GEE) Friedl and Sulla-Menashe 2019
  • * The corresponding attributes are included in the CAMELS-compliant supplement to the core MacroSheds dataset, but not necessarily in the core dataset itself.

In addition to our core time-series dataset, we provide a separate, supplementary collection of “CAMELS-compliant Daymet forcings” that conforms to the Daymet variables and methods used in the CAMELS dataset (Newman et al. 2014; Addor et al. 2017; Thornton et al. 2020).

Watershed attribute data

The core watershed attributes component of the MacroSheds dataset is an extensive spatial summary product, compiled from published, gridded products (Table 6). It describes climate, geology, terrain, vegetation and land cover (Fig. 5). We also provide a separate, supplementary collection of “CAMELS-compliant watershed attributes” that conforms to the variables, data sources and methods used in the CAMELS dataset. Importantly, the MacroSheds dataset covers much smaller watersheds than those included in the CAMELS dataset (Fig. 2). Due to the time cost of delineating watersheds, we elected to summarize attributes only for watersheds with discharge data, as they have substantially higher analytical potential. As an example, MacroSheds sites in Mediterranean California tend to have porous bedrock with high sulfur content, and to receive little nitrogen deposition, while eastern temperate forest sites have the highest geologic nitrogen content and receive the most nitrogen and sulfate deposition (Fig. 5). Concentrations of nitrogen and sulfur species in the stream water of each ecoregion will depend on these and other factors such as mineralization, plant uptake, and erosion rates, and cumulative fluxes will further depend on the long-term hydraulic output of each stream. Not only that, but the relationship between concentration and flux may change with hydraulic regime over the course of seasons or decades.

Watershed attribute data are provided as CSV files in two formats, representing different levels of aggregation. At the coarsest level, gridded spatial data are summarized to a single value per variable per watershed, and provided in wide format. However, some watershed attributes are temporally explicit, and our second format preserves the dates associated with each model estimation or satellite pass. Column structure for this format is given in Table 5.


Criteria for dataset discovery and inclusion in the MacroSheds dataset

Sites included in the MacroSheds dataset were primarily identified through the NSF-funded LTER, LTREB, and CZNet (formerly CZO) programs (113 of 169 sites, as of MacroSheds v1.0). Additional sites funded or managed by the US Geological Survey, Department of Energy, and Forest Service were identified through personal communication, literature search (long*term AND [watershed* OR basin* OR catchment*]), or by perusing government websites. The Krycklan Catchment Study in Sweden is currently the only domain within MacroSheds that is not associated with the federal government of the United States, but it will be joined by other US and international watershed studies as the MacroSheds project expands. NEON provides data products that will be integral to a future version of the MacroSheds dataset. Currently, NEON remains in its early operational phase, and its data products will be included in MacroSheds pending resolution of water quality and continuous discharge data anomalies that require further attention (Rhea et al. 2023b).

To be considered for inclusion in MacroSheds, a site requires either automated monitoring of stream discharge or routine sampling of stream chemistry, for at least a full year (minus periods of freezing or drying), as well as public data hosting. Additional data describing the quantity and chemistry of precipitation are highly valuable, but not required. Watershed boundaries can be delineated and geospatial summaries generated via MacroSheds tools, so these are not required.

Data processing system: Design and overview

The data acquisition and processing routines used to build the MacroSheds dataset comprise a system of cyclical ingestion pipelines (Fig. 6), written entirely in R (R Core Team 2022). Source code is designed functionally and organized hierarchically, mirroring the hierarchy of network-domain-site organization across institutions that manage watershed studies. This allows routines specific to a domain, or shared across a network, to be loaded as modules, minimizing code redundancy and simplifying inclusion of new sites. Improvement of this design is ongoing, and will enable user data contributions, in exchange for watershed boundaries, summary statistics, and derived time-series products, in the near future.

Details are in the caption following the image
Visualization of the four phases of MacroSheds data processing for a single domain: retrieval, harmonization (munging), derivation, and postprocessing, focusing on the evolution of precipitation data (P) from raw to final form as part of a domain dataset. Gold circles represent processing “kernels”—modular and customizable sets of routines that carry out the core steps of the first three phases. Within each phase, zero or more kernels are called in sequence, depending on which products need to be updated, as determined by the progress tracker. In Phase 1, retrieval kernels download primary source data. During Phase 2, kernels are called by one of four “munge engines” (pentagons) depending on whether primary source files are separated by site, by time, by product, or some combination. After Phase 3, time-series and geospatial data are organized into one file for each of the core MacroSheds products (discharge, stream chemistry, precipitation, precipitation chemistry, gauge locations, watershed boundaries). After Phase 4, a complete dataset has been generated for a single domain, and the process repeats for the next domain.

For each domain, time series of discharge, precipitation, and chemistry are first downloaded and saved locally in whatever form and format they are provided. They are then processed by site-product into MacroSheds format. If a watershed boundary is not provided, it is delineated. Additional products are then derived, namely watershed-mean precipitation depth and chemistry (and daily solute flux may be generated via the “macrosheds” R package if desired; see the “Flux Calculation” section). Finally, we generate spatial summary statistics for each watershed.

The processing system is designed insofar as possible to accommodate future deviations from the ways primary sources currently structure and serve their products. Each pipeline is fault-tolerant, so if provider-side changes introduce errors at any stage of data access or processing, the errors are logged, the developers are notified by email, and the system moves on. Any change involving file headers, URL paths, or splitting/combining of datasets requires careful accommodation by the MacroSheds team (and anyone else who directly reuses primary data), so we encourage data providers to maintain structural consistency across dataset versions whenever feasible.

Time-series data access and amenity to harmonization

Among the 25 domains currently included in the MacroSheds dataset, we have identified five distinct tiers of “harmonization amenity,” or the convenience with which we were able to access discharge, precipitation, and chemistry data and unify their idiosyncratic differences within a domain. Harmonization amenity encompasses the core elements of FAIR principles: Findability, Accessibility, Interoperability, and Reusability of data and metadata (Wilkinson et al. 2016), but also whether conceptually adjacent datasets share internal structure, and whether and how revisions are designated. Together, these elements determine the usability of public data, and the long-term practicality of including a source dataset in an ongoing synthesis effort like MacroSheds. Importantly, harmonization amenity tiers say nothing of the quality of a domain's data–only of its data structure and infrastructure. Licensing and IR restrictions are also a separate issue, with a separate tiering system (see data_use_agreements.docx). Our harmonization amenity tiers range from A, the most amenable, to E, the least amenable.

At Tier-E, data access is through personal correspondence only. As such, internal file structure is unpredictable and programmatic version-checking is impractical. We have generally avoided Tier-E domains and make no guarantees about their continued inclusion in MacroSheds, as they require an ongoing time commitment from our developers. We encourage watershed data managers to contribute routinely to public repositories like EDI, DataONE, HydroShare, or ESS-DIVE, so that we can build automated connections to MacroSheds.

Many datasets are hosted as hyperlinked, static files (Tier-D). This way of serving data is standardized only by the rules of transfer protocols (HTTP, FTP, etc.), which do not facilitate reliable file versioning (Postel and Reynolds; Belshe et al. 2015); however, it is possible to use the “last-modified” date in the header of a static file as a proxy for file version, as MacroSheds does. Many USFS and DOE domains, and even some CZNet domains, are Tier-D.

By hosting data in any public data repository that follows FAIR data standards, a domain can easily achieve Tier-C harmonization amenity or higher, meaning related files are naturally grouped or linked in a way that aids discovery. Most repositories permit straightforward versioning of files and file collections; however, in Tier-C the onus is on data managers to establish that an uploaded resource is a new version of some existing resource. Most CZNet domains are housed on CUAHSI's HydroShare, a premier environmental data and code repository that allows for easy creation of new versions of “formally published” resources. However, some CZNet domains have not published their data formally and edit their existing resources rather than creating official new versions. This makes programmatic identification of new file versions at least as difficult as with Tier-D harmonization amenity.

Datasets associated with Tier-B domains are easily found and fully versioned. Within MacroSheds, most domains associated with the LTER network are Tier-B, owing in part to the strict metadata and publishing requirements of the EDI data portal and underlying PASTA+ repository, which all but ensure proper versioning and within-domain findability of related files. Still, for Tier-B domains, neither data hosting architecture nor management dictate the internal structure or naming of files; however, the EDI repository does provide an effective set of recommendations to help contributors adhere to best practices: https://edirepository.org/resources/cleaning-data-and-quality-control.

At the forefront (Tier-A) of harmonization amenity are the USGS and NEON domains—each also networks per se—which provide systematic access and consistent data structure across all the sites they manage. This means, for example, the URL for water quality time series at site X is intuitively related to that for site Y, and that once downloaded, the two datasets are structured and formatted identically. Moreover, NEON and the USGS provide web servicesthrough which to explore, retrieve, and even manipulate their collections programmatically. In R, we conveniently queried these endpoints through official client packages (Lunch et al. 2021; Cicco et al. 2022). Because Tier-A institutions control data collection, storage, and hosting, they are able to establish a consistency of access and internal structure that is much more difficult to achieve post hoc.

Time-series data processing

This section details major steps taken to harmonize disparate chemistry, discharge, and precipitation data into MacroSheds format (see the “Data Description” section) and extract useful metadata. In any harmonization effort, there is a tradeoff between fidelity to the original datasets as they are, and cohesion of the aggregate set. We have endeavored for a MacroSheds dataset that is parsimonious but high in analytical potential, and that assimilates provided metadata where practical.

Each MacroSheds data ingestion pipeline performs a wide variety of basic processing routines. For a technical account of the steps involved in (1) conforming site and variable names, (2) resolving datetime formats and time zones, (3) converting units, and (4) reshaping data tables, consult code_autodocumentation.zip on EDI, and our complete codebase at https://github.com/MacroSHEDS/data_processing. The rest of this section covers assimilation of metadata on sampling methods and detection limits, propagation of uncertainty, and temporal imputation/aggregation.

Tracking of sampling methods for each record

The MacroSheds dataset includes measurements recorded by installed equipment and by hand (grab sample), and end users may wish to filter it accordingly. We further distinguish between measurements made via sensors vs. analytical or visual means. The former distinction is made programmatically with simple heuristics (e.g., inconsistent sample interval precludes autosampling), and the latter by consulting primary metadata. These distinctions are summarized as two-letter “sample regimen” codes prefixed to each MacroSheds variable code: “I” or “G” for “installed” vs. “grab,” and “S” or “N” for “sensor” vs. “non-sensor.” For example, “IS_discharge.”

At present, we do not report specific analytical methods for time-series variables, effectively assuming that commensurate units imply commensurability. We know this to be misleading for some variables—in particular those measured via fluorescence or absorbance—and intend to include more detailed methods for at least these variables (e.g., FDOM, turbidity) in a future release.

Detection limits and propagation of uncertainty

We were able to locate published limits of detection (LODs) for solute concentrations of only 10 of the 24 domains included in Version 1 of the MacroSheds dataset. For the rest, we assumed each variable's LOD to be the minimum LOD for that variable across the 10 domains with reported values. We do not attempt to infer LODs from the data, for example, by assuming they are approximated by the minimum reported absolute value. This risks egregious overestimation wherever measured values never approach the LOD, or underestimation wherever reported values have been transformed or determined via a calibration or rating curve.

Accurate cumulative flux calculations depend on relatively complete data records. It is thus critical that below-detection-limit (BDL) samples be given a numeric value, so they are not confused with records for which a measurement is truly missing, and must be naively imputed. BDL measurements are variously reported by primary sources as ½ LOD, ¼ LOD, LOD, 0, missing, and so on. Some domains do not report BDL measurements. For consistency, we replace any value flagged as BDL with ½ of the reported/estimated LOD and set the corresponding ms_status to 1 (“questionable” vs. 0 for “clean”; see the “Technical Validation” section). Only values explicitly flagged as BDL are replaced in this way. For the rare case in which a value is flagged as BDL, and no LOD is reported for the corresponding variable at the reporting domain or any other domain, we set the value to 0 and the ms_status to 1. Within the MacroSheds dataset, BDL values are not flagged as such, but BDL flags can be reconstructed if necessary by cross referencing any time-series dataset with detection_limits.csv on EDI.

Before the MacroSheds processing system performs any mathematical transformation on raw data, uncertainty is attached to each record. Due to the scarcity of reported measurement or analytical precision/uncertainty, we have chosen not to propagate reported values. Instead, initial uncertainty for each domain-variable is determined by u = 10 p , where p is the precision of the variable's reported LOD, after conversion to MacroSheds standard units. For example, a LOD of 0.008 mg L−1 has a precision of 3 (digits after the decimal), resulting in initial uncertainty of 0.001 mg L−1. For domains that do not report LODs, we set the initial uncertainty for each variable according to the minimum (coarsest) reported p across all domains that do report LODs. For some variables, we have no basis by which to infer initial uncertainty, so we report it as missing. The two exceptions are discharge and precipitation, both required for computing solute flux. For these, we set initial uncertainty to zero. Uncertainty is then propagated through all MacroSheds mathematical transformations via the errors package (Ucar et al. 2018). A table of all known detection limits can be found in our documentation on EDI (detection_limits.csv).

Temporal imputation and aggregation

We currently report all time-series data (not including temporally explicit spatial summary data) at a daily interval. The timestamp associated with each incoming record is floored to midnight (0 h, 0 min, 0 s), and series with a subdaily interval are aggregated across each 24-h span. Precipitation, which is reported in mm, is aggregated by sum, while discharge and chemistry are aggregated by mean. After aggregation, any implicit missing values are made explicit, so that there are no missing timestamps within a series. Linear interpolation is then used to fill gaps of no more than 3 d in each discharge series, and no more than 15 d in each stream chemistry series. Next-observation-carried-backward interpolation is used for precipitation chemistry series. Precipitation volume/depth series are rarely published with missing values during periods of gauge deployment, but when these are encountered, we use source metadata or direct contact to determine whether measured values represent multiday accumulation. If not, we fill gaps with 0 s, indicating no precipitation; if so (we have not yet encountered this), we distribute measured precipitation values evenly across preceding missing values. For precipitation and precipitation chemistry, gaps of up to 45 d are interpolated. In the case of solute flux series provided by primary sources, the maximum gap length we interpolate is 15 d. Gaps larger than the aforementioned maximum lengths retain their missing values, and no extrapolation is performed. Records interpolated by the MacroSheds processing system are given an ms_interp value of 1; otherwise 0. A future version of the MacroSheds dataset may include subdaily records where available.

Watershed attributes retrieval and processing

The MacroSheds dataset includes 185 watershed attributes–spatial summary statistics that may act as drivers of ecohydrological processes. These attributes are derived from modeled and remotely sensed gridded data products from various platforms. Attributes were chosen to capture the range of physical and biological variation seen in natural watersheds, and to allow comparison with other large-scale watershed/catchment descriptor datasets such as StreamCat (Hill et al. 2016) and CAMELS. Note that most of the watersheds in the MacroSheds dataset are too small to appear in the National Hydrography Dataset Plus Version 2 (McKay et al. 2012), and therefore cannot be directly linked to StreamCat metrics.

Attributes are organized into six categories: vegetation, climate, terrain, parent material, landcover, and hydrology. Every spatial variable in the MacroSheds dataset has a two-letter prefix to indicate first the variable category, and second the data source. For example, Leaf Area Index (LAI) variables from the MODIS satellite have a prefix of “v” to indicate the vegetation category and “b” for MODIS, so the median LAI for a watershed in the MacroSheds dataset has the name “vb_lai_median.” Watershed attribute prefix codes are catalogued in variable_category_codes_ws_attr.csv and variable_data_source_codes_ws_attr.csv on EDI.

Gridded products are summarized to watershed boundaries using one of two methods, based on where the source data product is held. For data accessible through Google Earth Engine (GEE), we used the R package “rgee” (Gorelick et al. 2017; Aybar 2021). First watershed boundaries are uploaded to GEE and stored as an asset. Then median and standard deviation values for each watershed at each reported time-step are summarized using the rgee function “reduceRegions.” For products not housed on GEE, gridded data are locally processed using the “terra” package for R (Hijmans 2021). A list of gridded data products and their sources is in Table 6.

Most watershed attributes included in the MacroSheds dataset are temporally explicit, with sampling/modeling intervals varying from daily to decadal. We provide all watershed attributes in their native (as reported by primary source) temporal intervals, and a subset of attributes as averages by site. We do not provide all watershed attributes for all sites, as some gridded products are only available for the contiguous United States.

Derivation of additional products

One of the core aims of the MacroSheds project is to enable engagement with continental-scale questions about whole-watershed solute and hydrologic flux. We do not yet publish stream or precipitation flux estimates, except for a few daily solute flux series that are provided by primary sources, but the next release of this dataset will include cumulative monthly and annual flux estimates for each site. For now, daily flux can be easily computed via the “macrosheds” R package.

Estimation of watershed solute influx and outflux requires information not consistently provided alongside the time-series data described above, namely watershed-mean precipitation and precipitation chemistry, and the watershed boundaries needed to compute them. Below we describe the derivation of these products.

Watershed delineation

For any watershed boundary not already published as a georeferenced spatial file, the MacroSheds processing system performs a delineation from the point of the stream gauge or sampling site (pour point). This process cannot be reliably automated for all pour points, due in part to imperfections in digital elevation models (DEMs), and in part to the fact that stream site locations are usually recorded from the banks nearby. Sometimes the watershed “found” by a delineation algorithm is actually a subset of, or adjacent to, the target watershed, and only visual inspection reveals the error. We rely on a semi-automated, interactive approach that delineates one or more candidate watersheds for each site, starting from one or more unique pour points. DEMs are retrieved using the “elevatr” package (Hollister et al. 2020) for R, and iteratively expanded any time a proceeding delineation meets the DEM edge. Candidate watersheds are presented for visual inspection and topographic comparison via package “mapview” (Appelhans et al. 2021). Hydrologic conditioning, pour point snapping, and delineation leverage the “whitebox” package (Wu 2021). If none of the candidates appears to represent the target watershed, the process can be conveniently repeated using updated parameters. For a detailed discussion of delineation parameters, see the “macrosheds” R package documentation.

Spatial interpolation of precipitation data

Each MacroSheds watershed is rasterized, or gridded, from the DEM used during delineation, or from one so retrieved. Precipitation chemistry is then imputed to each cell of the watershed raster by inverse squared-distance weighted interpolation, or IDW (Shepard 1968), using information from all precipitation gauges associated with the domain. Watershed-mean precipitation chemistry is then computed as the mean across all raster cells, separately for each solute and each day with data.

Due to the orographic effect in mountainous regions, precipitation depth at a given elevation can be estimated from a local, linear relationship (Hevesi et al. 1992). Daily precipitation depth in the MacroSheds dataset is computed as a weighted ensemble of two predictions, one generated by IDW (weight = 1) and the other from the empirical elevation-precipitation relationship among all domain-associated gauges (weight = coefficient of determination). On days for which fewer than three precipitation gauges are in operation, only the IDW prediction is used.

Flux calculation

In Version 2 of the MacroSheds dataset, we will include cumulative monthly and annual solute flux estimates for each site. For now, we provide discharge, precipitation, and concentration data, and allow users to compute daily solute flux or volume-weighted concentration (VWC) via the “macrosheds” R package, using the ms_calc_flux function. Solute flux is computed according to Eqs. 1 and 2,
F s = Q C s A (1)
F p = P C p (2)
where Fs and Fp are solute flux in stream water and precipitation, Q is discharge, P is mean precipitation depth over the watershed, C is solute concentration, and A is watershed area. F is reported in kg ha−1 d−1, and is calculated on each day for which Q or P, and corresponding C, are measured or interpolated. If ms_status or ms_interp are equal to 1 for either factor (i.e., if either record has been flagged as “questionable” or has been interpolated by the MacroSheds processing system), resulting F inherits the same.
VWC is computed according to Eq. 3,
VWC = i = 1 N C i V i i = 1 N V i (3)
where N is the number of days in the aggregation period (e.g., a month or a year), C is solute concentration, and V is daily volume of streamflow or precipitation.

Technical validation

Quality control (QC) practices in watershed ecosystem science are almost as diverse as watersheds themselves; however, there are common currents that run through every QC flag and comment. For example, if a sensor is buried in sediment for a week, that week's data should be omitted from analyses. Likewise with a sensor that is wildly malfunctioning or a water sample that is severely contaminated. Ultimately, when data are analyzed, each record is included, omitted, or included with caution. Thus, we have distilled each domain's QC flags and comments down to either “bad data,” which is excised during processing, “questionable,” or “clean.” If a flag definition or comment makes any mention of insufficient sample volume, minor contamination, sensor drift, or some other condition that could, but does not necessarily, invalidate the corresponding record, we designate it “questionable,” and set its ms_status value to 1. Only if flags and comments are absent, or specify no issues of potential concern, do we designate a record “clean,” and set its ms_status to 0.

Almost every domain reports per-observation QC flags or comments of some kind. When these are restricted to a predetermined set that is well documented, parsing their meanings is straightforward. In some cases, flags and/or comments are free-form and quite difficult to catalog. Like other obstacles to data harmonization, QC flag proliferation can be resolved by using professionally managed data repositories, where metadata standards control flag values and definitions by design. In attribution_and_intellectual_rights_timeseries.csv, MacroSheds data users can find DOIs and source URLs of primary time-series data and metadata, where fully detailed flag information can be found.

The MacroSheds processing system currently performs minimal QC beyond assimilating primary source flags and comments; however, we do filter each time-series record through a very loose “range check,” intended to ensure that physically impossible values that happen to have evaded primary source QC are omitted from our aggregate dataset. Minimum and maximum reasonable values have been chosen so as not to risk any encroachment on the true natural range for each variable. A full list of these filter ranges can be found in range_check_limits.csv on EDI. Beyond range checking, we currently rely on the expertise of primary data providers to publish data that have been vetted. We intend to implement more sophisticated anomaly detection in a subsequent release of the MacroSheds dataset and portal.

Data use and recommendations for reuse

The MacroSheds dataset is intended to provide analytical material for diverse investigations of watershed form and function. It is especially suited to comparing watersheds in terms of inputs and outputs of energy and material. In addition to precipitation, solute chemistry, and streamflow time-series data, it contains a comprehensive set of potentially predictive watershed attributes for each of 177 stream monitoring sites. A visual summary of relationships between watershed attributes and stream solute concentrations reveals strong correlations between land development and major anion concentration in streams, and between bedrock chemistry and inorganic ion concentration, possibly mediated by weathering (Fig. 7). These and other relationships may be used to classify watersheds. They may also be leveraged in the fitting of statistical models, or the training of machine learning algorithms to predict watershed solute outflows from watershed features. To our knowledge, the MacroSheds dataset is the most comprehensive analysis-ready collection of watershed biogeochemical data for North America. As of this writing, there is also a soon-to-be-published CAMELS-Chem dataset, which supplements 506 of the original CAMELS sites with measurements of 18 common stream chemistry constituents (Sterle et al. 2022).

Details are in the caption following the image
Pearson correlations between a subset of MacroSheds watershed attributes and concentrations of major solutes by category. Concentrations were computed as mean annual volume-weighted concentration, in equivalents where applicable. Numbers inside each circle represent the number of sites included in the correlation. Records prior to 01 January 2000 were omitted before computing correlations.

The MacroSheds dataset can also be used as a small-watershed supplement to hydrological datasets like CAMELS and GAGES-II. Note that in addition to the original US-based CAMELS dataset, there are now equivalent products for Chile (Alvarez-Garreton et al. 2018), Great Britain (Coxon et al. 2020), Brazil (Chagas et al. 2020), and Australia (Fowler et al. 2021). These and others have been merged into a single resource called Caravan (Kratzert et al. 2023).

Because MacroSheds time-series data are currently represented at daily intervals, this dataset is not well suited to sub-daily analyses, such as those focused on stormflow dynamics. A future version may include time-series data at 15-min resolution.

To meet acceptable use requirements of the MacroSheds dataset, one must comply with the licensing and IR stipulations of all applicable primary sources. At minimum, this entails citing the MacroSheds dataset (Vlah et al. 2022), which is linked to source datasets through Ecological Metadata Language provenance. However, users must first check section 4.1 of data_use_agreements.docx, where our datasets are tiered according to the restrictiveness of source data licenses, as some sources require additional compliance. In any case, we provide tools that make citation/acknowledgement of all or a subset of MacroSheds data sources trivial, and we recommend acknowledgement/citation of source datasets even where attribution is not required. The first tool, for users of the “macrosheds” R package, is the ms_generate_attribution function, which produces a list of acknowledgements, citations, contact emails, and IR notifications based on a given data.frame in MacroSheds format. We also provide attribution_and_intellectual_rights_timeseries.csv and attribution_and_intellectual_rights_ws_attr.csv, which contain essentially the output of the ms_generate_attribution function, assuming the entire MacroSheds dataset is being used. The content of these documents can be copied and pasted, in whole or in part, depending on how much of the overall dataset is actually used.

Future directions for the MacroSheds project

Future developments will focus on the longevity of the MacroSheds project through targeted outreach and by better enabling community contribution. Outreach efforts will focus on encouraging data managers to leverage the FAIR-by-design standards of professionally managed data repositories like EDI and DataONE and to adopt open data licenses where possible. The long-term success of living, synthetic datasets like MacroSheds depends on consistency of source data and metadata from version to version, or at least predictability of changes (e.g., to file names). The long-term continued growth of MacroSheds will be aided by community contribution, inspired by the success of StreamPULSE (streampulse.org) and other projects that add value to user-uploaded datasets, incentivizing contributions that eventually become public. Toward this end, we plan to adapt the MacroSheds data processing system into an interactive web application complete with QC, which will allow anyone with stream data to delineate and summarize watersheds, estimate flux, and so on, and contribute to the MacroSheds dataset after an optional embargo period.

In the near term, the MacroSheds team will continue to identify and assimilate data from established watershed ecosystem studies. Globally, there are many networks of watershed observatories that we hope to coalesce into a more international MacroSheds dataset. These include ECN (United Kingdom; Lane 1997), SAEON (South Africa; Van Jaarsveld et al. 2007), CERN (China; Fu et al. 2010), TERENO (Germany; Zacharias et al. 2011), TERN (Australia; Karan et al. 2016), OZCAR (France; Gaillardet et al. 2018), eLTER (Europe; Mollenhauer et al. 2018), and ILTER (global; Mirtl et al. 2018).


This work is funded by the National Science Foundation's Macrosystems Biology and NEON-Enabled Science (MSB-NES) program. The steering committee includes Drs. Jill Baron, Nandita Basu, Emma Rosi, Megan Jones, and Kaelin Cawley. The authors are grateful to the countless technicians, data managers, students, scientists, and funders of the watershed studies that make up MacroSheds, without whose efforts this synthesis would not be possible: Bear Brook (Maine); CZNet: Boulder Creek, Calhoun, Catalina-Jemez, Susquehanna Shale Hills; USDOE: East River, Walker Branch; Krycklan Catchment Study; LTER: Arctic, Baltimore Ecosystem Study, Bonanza Creek, Hubbard Brook Experimental Forest, Andrews Experimental Forest, Konza Prairie, Luquillo, McMurdo Dry Valleys, Niwot Ridge, Plum Island Ecosystems, Santa Barbara Coastal; USFS: Fernow, King's River, Santee, South Umpqua Experimental Watersheds; USGS. The authors would also like to thank Dr. Anthony Castronova and CUAHSI for providing server space, Kevin Worthington (CSU) for help setting up a tiling service for interactive web-maps, Drs. Corinna Gries and Mark Servilla of EDI for their assistance with metadata generation and dataset archiving, and Dr. Nick Marzolf and Cody Flagg for feedback on the MacroSheds portal and dataset.