An international laboratory comparison of dissolved organic matter composition by high resolution mass spectrometry: Are we getting the same answer?

High‐resolution mass spectrometry (HRMS) has become a vital tool for dissolved organic matter (DOM) characterization. The upward trend in HRMS analysis of DOM presents challenges in data comparison and interpretation among laboratories operating instruments with differing performance and user operating conditions. It is therefore essential that the community establishes metric ranges and compositional trends for data comparison with reference samples so that data can be robustly compared among research groups. To this end, four identically prepared DOM samples were each measured by 16 laboratories, using 17 commercially purchased instruments, using positive‐ion and negative‐ion mode electrospray ionization (ESI) HRMS analyses. The instruments identified ~1000 common ions in both negative‐ and positive‐ion modes over a wide range of m/z values and chemical space, as determined by van Krevelen diagrams. Calculated metrics of abundance‐weighted average indices (H/C, O/C, aromaticity, and m/z) of the commonly detected ions showed that hydrogen saturation and aromaticity were consistent for each reference sample across the instruments, while average mass and oxygenation were more affected by differences in instrument type and settings. In this paper we present 32 metric values for future benchmarking. The metric values were obtained for the four different parameters from four samples in two ionization modes and can be used in future work to evaluate the performance of HRMS instruments.

High-resolution mass spectrometry (HRMS) has become a central tool in the analysis of dissolved organic matter (DOM), due to seminal work reported over the last three decades (Fievre et al. 1997;Kujawinski et al. 2002;Stenson et al. 2002;Stenson et al. 2003;Dittmar and Koch 2006;Sleighter and Hatcher 2007;Reemtsma et al. 2008;Gonsior et al. 2009) and the detailed biogeochemical insight afforded by molecular compositional patterns in large and diverse sample sets (Flerus et al. 2012;Jaffé et al. 2012;Kellerman et al. 2014;Lechtenfeld et al. 2014;Hertkorn et al. 2016;Drake et al. 2019). Since Fievre et al. collected the first HRMS spectrum of DOM in 1997 (Fievre et al. 1997), there have been large increases in both the number of researchers using HRMS for DOM analysis and the variety of instrument types being employed. This upward trend has been facilitated by the propagation of commercial Fourier transform ion cyclotron resonance (FT-ICR), Orbitrap, and high-resolution quadrupole time of flight (q-TOF) mass spectrometers (Hawkes et al. 2016;Lu et al. 2018;Pan et al. 2020). While the application of HRMS for the molecular-level assessment of DOM continues to grow, leading to improved methods that are adopted by the research community, so do the challenges with data comparison and interpretation that arise from the use of instruments with different resolving powers (mass/[full width at half maximum] at m/z centroid; m/Δm), source conditions, ion optics, and users.
The wide range of potential conditions for analysis, resulting interpretation of the spectra, and the conclusions based on those interpretations poses potential problems with common interpretation across the community using different instruments, laboratories, and practices. As both climate and ecosystems change (Williamson et al. 2015;Drake et al. 2019), it is critical for the community to understand the degree to which archived data can be compared to newer measurements, and if data generated among various research groups can be robustly compared and related to emerging global trends.
In the vast majority of recent DOM research, HRMS has been coupled to electrospray ionization (ESI), as this 'soft' ionization technique allows intact molecular ions to enter the mass spectrometer without fragmentation (Fenn et al. 1989;Henry et al. 1989;Novotny et al. 2014). With ESI-HRMS, signal response is not necessarily linearly related to the concentration of analyte (Kujawinski et al. 2002), or the linear response range may be very narrow. Signal magnitude is rather a function of concentration and ionization efficiency, but ionization efficiency can fluctuate as a function of the sample matrix (for example, the ionic strength, pH, and analytical complexity of the sample) (Tang and Kebarle 1993;Brown and Rice 2000;Oss et al. 2010) and the instrument tune settings. Owing especially to matrix signal suppression, direct infusion ESI-HRMS is considered to be a "qualitative," or at best (under controlled conditions), a "semiquantitative" technique for untargeted analysis of complex mixtures Mopper et al. 2007;Dubinenkov et al. 2015;Liu and Kujawinski 2015;Luek et al. 2018). This implies that, when constructing response vs. concentration relationships, standard reference solutions need to be highly matrixmatched, but absolute analyte concentration cannot be quantified without standards. Even within the data acquisition, the signal response is not uniform across the m/z range. The combination of the user-defined ionization source conditions and HRMS instrument settings for ion optics and detection can often favor low m/z values vs. high m/z values or vice versa-this translates to biases in detection and the potential for variability in relative ion abundance results among instruments (Cao et al. 2016;Kew et al. 2018a). Such biases may confound molecular-level assessments of DOM chemical assignments and interpretations and limit the ability to utilize datasets generated by different research groups, for example, for meta-analyses of ecosystem trends.
A further concern is the lack of "standards" for complex DOM mixtures that limits the possibilities for standardizing reproducible DOM HRMS results and enabling uniform assessment of instrument performance. Currently, the research community relies on 'standard' reference materials provided in large quantities by the International Humic Substances Society (IHSS), like Suwannee River fulvic acid (SRFA). These reference materials, described as a collection of humic acids, fulvic acids, and other isolates, are routinely collected and homogenized, and have been used throughout the history of HRMS, setting the foundation to understand complex DOM mixtures (Fievre et al. 1997;Stenson 2008;Witt et al. 2009;Gaspar et al. 2010;Herzsprung et al. 2015;Kew et al. 2017;Simon et al. 2018). These reference materials are often used as measures of instrument performance in studies that analyze newly collected samples, either as a DOM comparison or to control HRMS settings (Koch et al. 2005;Mangal et al. 2016;Li et al. 2017;Hawkes et al. 2018b;Solihat et al. 2019). Even though the reference samples are highly processed to ensure an appropriate degree of batch-to-batch uniformity, they still contain a high level of chemical complexity, and are considerably more complex than other commercially available laboratory chemicals.
Previous HRMS studies have defined important metrics with respect to instrument performance and capabilities, such as signal to noise ratio, mass measurement accuracy, and resolving power (Marshall et al. 1998;Hawkes et al. 2016;Simon et al. 2018;Smith et al. 2018), have extended optimizations based on IHSS reference materials such as Suwannee River and/or Pony Lake Fulvic Acids (SRFA and PLFA) (Stenson et al. 2003;Koch et al. 2005;D'Andrilli et al. 2013D'Andrilli et al. , 2015Mangal et al. 2016). Currently, however, laboratories have little access to raw data from other research groups, limiting detailed comparison of instrument performance and results.
The objective of this study was to investigate the differences in DOM assessments from the same samples across multiple instruments due to variations in the employed HRMS instrumentation and the optimized settings developed by different laboratories. This study is a first step toward understanding the interpretational variability caused by different laboratories' established protocols, routines, and instrumental differences. The two goals of this study were (1) to evaluate whether different HRMS instruments can identify the same molecular-level trends in chemical composition from a range of samples processed by an identical formula assignment protocol and (2) to provide metrics (identified variables and calculated data ranges) for the HRMS DOM community that can be used for reference when developing new methods or validating new datasets. For this work, a set of IHSS reference materials was used that represents a compositional gradient of DOM commonly found along freshwater-associated environmental continua. The current lack of a seawater reference material prohibited the inclusion of a sample that represented marine systems. Nonetheless, altogether, these IHSS reference materials represent important DOM endmembers that may relate to similar complexities found in marine environments. For example, PLFA and marine environments both represent microbially sourced end-members and therefore may share some DOM production pathways and measurable compositional similarities (Kellerman et al. 2018;Zark and Dittmar 2018).
Four identically prepared DOM samples were sent to 16 laboratories operating 17 commercially available HRMS instruments, including various magnetic field strengths of FT-ICR mass spectrometers and various models of Orbitrap mass spectrometers (Table 1). The samples were SRFA, PLFA, Elliot soil fulvic acid (ESFA), and Suwannee River Natural Organic Matter (SRNOM). Laboratories were instructed to tune their instrument to maximize the total signal for SRFA and then measure the samples in positive-and negative-ion mode ESI using their independently developed optimized settings. The data interpretation was conducted by one centralized method to exclude all molecular formula assignment variability that might otherwise originate from different data processing algorithms (detailed descriptions in Section 1.2).
After calibration and molecular formula assignment, the data (sum-normalized ion abundances) for all instruments were aligned based on formula assignments. The instruments were compared based on the number of ions with an assigned formula and by comparing ion abundance patterns among the four samples using statistical techniques. Further, the average values for the metrics were calculated for each sample in each ESI mode using common ions that were found by the majority of instruments (i.e., relative abundance weighted averages of oxygenation, O/C wa ; hydrogen saturation, H/C wa ; mass to charge ratio, m/z wa , and modified aromaticity index; AI mod wa [Koch and Dittmar 2016]). The calculated ranges of each metric are presented, along with a discussion of outliers relative to average results found in the study. The results can be used by the community to facilitate comparison of HRMS DOM data from both past and future data sets that use similar instruments and any of the same reference materials.

Materials
Sample selection Four DOM reference materials were obtained from the IHSS (http://humic-substances.org/). Suwannee River fulvic acid (SRFA; 2S101F) and Suwannee River Natural Organic Matter (SRNOM; 2E101N) were collected from the blackwater Suwannee River, Georgia, which originates from the Okefenokee peat swamp. The fulvic acid sample was prepared by isolating DOM onto XAD-8 resin from filtered water that was acidified to pH < 2, followed by elution with NaOH, and precipitation of "humic acids" at pH = 1.0. The fulvic acids remain in solution at this low pH. The SRNOM sample was prepared by reverse osmosis from the same site (Green et al. 2015). Pony Lake Fulvic Acid (PLFA; 1R109F) was prepared by the same procedure as SRFA at pH = 2.0 (AE0.1) and originates from Pony Lake, Antarctica, which is characterized by the absence of higher order plants. The site is dominated by autochthonous, microbially derived carbon (Brown et al. 2004). Elliot soil fulvic acid (ESFA; 5S102F) was prepared by the same protocol as SRFA after soil homogenization of a fertile terrestrial soil commonly found in Illinois, Indiana, and Iowa. Detailed information regarding IHSS standard (SRFA, SRNOM, and ESFA) and reference (PLFA) material sample locations, preparation, and chemical data can be found on the IHSS website: http://humic-substances.org/.

Sample preparation
Each powdered sample (2 mg) was added to separate 2 mL combusted amber glass vials and dissolved in 2 mL of HPLC Table 1 List of participating institutions, country of institution, and mass spectrometer instrument type including magnetic strength or model. All FT-ICRs are manufactured by Bruker Daltonics and use an Apollo II ESI source (Bremen, Germany), all Orbitraps are manufactured by Thermo Fisher and use an Ion Max API ESI source (Bremen, Germany).
grade methanol by vigorous shaking, creating stock solutions of individual reference materials. Using sterile pipette tips, 40 μL of each "standard" was pipetted into separate 2 mL combusted amber glass vials, dried under nitrogen gas, and capped (64 samples total). Sample volumes were calculated to achieve~20 mg L −1 of carbon (C) once redissolved in 1 mL solvent (assuming 50% C). One sample set, containing SRFA, ESFA, PLFA, SRNOM, and an empty, sealed 2 mL combusted amber glass vial, was shipped to each of the 16 participating laboratories (80 vials total). Upon sample arrival, each participating laboratory prepared 50-100 mL ultraclean 50% methanol (LCMS grade methanol in 18.2 MΩ ultraclean water, sourced in each laboratory individually) in combusted glassware. One milliliter of clean 50% methanol was added to each sample vial using methanol-washed glass syringes. One milliliter of clean 50% methanol was also added to the empty vial and served as each laboratory's methodological blank for the experiment. DOM samples were sonicated for 5 min to ensure complete dissolution prior to analyses. One Laboratory, M, diluted their samples to 5 mg L −1 C before analysis, while the others analyzed at 20 mg L −1 C. Samples were stored in the dark at 4 C pending MS analysis.

Procedures
Participating laboratory instructions This study included data from 16 HRMS DOM community members in laboratories spanning eight countries of the Northern Hemisphere (Table 1). Each laboratory received a sample set prepared at Montana State University (Bozeman, Montana) and an identical set of instructions (established by Hawkes and D'Andrilli) specific to sample storage and reconstitution, method blank preparation, acquisition in positiveand negative-ion mode, and data exportation.
Each instrument was assigned a code letter from A-G (Orbitraps) and H-Q (FT-ICRs) for laboratory and instrument anonymity. This minimized focus on each instrument's perceived performance, and instead placed emphasis upon the variability or similarity of the data.

Electrospray ionization and HRMS tune settings
The ESI sources were thoroughly cleaned with LCMS grade methanol or ultraclean 1 : 1 methanol : water and tuned to maximize the signal in negative-and positive-ion modes using the SRFA sample and according to each laboratory's own optimization routines. All five samples were analyzed in the m/z range of 150-1000, and the resulting data were exported as mass lists (signal to noise ratio [S/N] ≥ 4 for FT-ICR instruments and all data on Orbitraps) in comma delimited (.csv) format with three to four columns: "m/z," "Intensity," and "Resolving Power," plus "Noise" for the Orbitrap analyzers. Five further data acquisitions of SRFA were carried out on instrument A (Orbitrap) and L (FT-ICR) in order to assess intra-instrument variability.

Data processing
All data processing (mass spectral calibration and molecular formula assignments) was conducted in MATLAB version R2017b with the Statistics and Machine Learning Toolbox (v11.2), Bioinformatics Toolbox (v.4.9) and R (R v.3.6.1) with vegan package at Uppsala University, Sweden. The raw mass lists and data processing code are available in Data S1. Each sample was processed separately to recalibrate the mass axis, remove noise, and then assign molecular formulas to individual peaks (Fig. 1). This post-acquisition processing was executed on all data in an identical manner to minimize differences arising from differing software programs and algorithms used by various laboratories, and also eliminated potential user error, thereby ensuring consistency among the data sets.

Noise removal
Noise was assessed using the concepts from the "KMD slice" method from the R package MFAssignR (Schum et al. 2019), which was modified from Riedel and Dittmar 2014. Briefly, noise level is calculated based on signals with highly improbable mass defects that are likely to arise from electronic noise. In this case, the mass defects selected were calculated as a window of Kendrick mass defects (KMDs). Masses were converted to Kendrick mass (KM) based on CH 2 as KM = mass × 14/14.01565, and the KMD was then computed by subtraction of KM from nominal mass. The KMD window for noise was taken as 0.0011232(KM) + 0.05 to 0.0011232(KM) + 0.2. All peaks in this noise window were collected, and the 99 th percentile of their abundances was taken as the upper limit of noise, in order to allow the most intense 1% of these peaks to be considered as potential analytes. Mass spectra from samples that exhibited fewer than 100 peaks in the KMD window were not subjected to noise reduction treatment. In the rest of the samples, peaks throughout the spectrum below the upper limit of noise were removed ( Fig. 1).

Theoretical formula list
To assign formulas, a theoretical neutral molecule formula list was generated based on the following constraints: C 4-50 , H 4-100 , O 2-40 , N 0-2 , S 0-1 , 13 C 0-1 , 150 < m/z < 1000, 0.3 ≤ H/ C ≤ 2.2, 0 < O/C ≤ 1.2, KMD ≤0.4 or ≥ 0.9, valence neutral (nitrogen rule), and double bond equivalents minus oxygen (DBE-O) ≤ 10 (Herzsprung et al. 2014). Beyond C c H h O o containing molecular formulas, heteroatomic or isotopic formulas were allowed to contain one of the following: N 1-2 , S 1 , or 13 C 1 . Formulas above m/z 500 were restricted from N 2 assignments. In positive-ion mode, the S-containing formulas were removed from consideration to reduce mis-assignments due to variable resolving power among instruments. In positive mode all formula masses were calculated as sodium cation adducts (C c H h O o N n Na 1 ). In negative-ion mode, theoretical formula masses were calculated as deprotonated analytes (M − H) − . This created theoretical mass lists with 75,059 entries in negative-ion mode and 54,847 entries in positive-ion mode. These formula lists were chosen after close inspection of the data-Na adducts dominated over protonated species in positive-ion mode, and due to the narrow peak split between NaH and C 2 (2.4 mDa), could not be fully resolved over the full mass range on the lower resolving power instruments (i.e., Orbitraps), as this requires m/Δm > 3 × 10 5 at m/z 400 (Δm is mass split between the two peaks to be resolved). This often led to a single centroid peak in the individual sample peak lists, rather than two, and for this reason, only the more abundant Na ions were considered for molecular formula assignment. The almost doubled number of peaks in each spectrum due to Na adduct formation in positive-ion mode also obscured the S-containing ions at higher masses and lower resolution. Thus, S assignments could not be made by the lower resolving power instruments in positive-ion mode. The objective of this study was comparison of ion abundances for confident assignments and not full sample coverage; therefore, this conservative approach was appropriate. Clearly, more complete sample coverage can be achieved with higher resolving power instruments and more complete formula assignment routines.
Formula assignment continues to be debated in the ESI-HRMS community, and here the conservative approach was taken to more severely constrain assigned formulas as compared to many other studies, bearing in mind that the samples chosen are not among the most complex examples, such as petroleum or mixtures containing fresh metabolites (Gonsior et al. 2019;Palacio Lozano et al. 2019). Rates of formula assignment heavily depend on the resolving power of the chosen instrument. The limitations of our study reflect the fact that multiple instruments with very different resolving powers were included, while only one formula assignment routine was applied throughout. Furthermore, we focused our data analysis to the subset of signals that were reliably detected by most instruments to provide benchmarking data for other users. This approach obviously improves the ability to compare the resulting data, but concomitantly means that a major part of information obtained by high-resolution instruments were left unassigned. Since our goal was to evaluate common peaks that allow comparison of the instruments and data in both this study and future applications, this conservative approach was applied to minimize false positive assignments and provide the most robust metrics and data possible.

Internal calibration
The spectra were first internally calibrated using molecular formulas with DBE-O = −1, 0, or 1. These formulas have previously been shown to be among the most abundant in DOM samples (Herzsprung et al. 2014) and are therefore extremely likely assignments for some high abundance ions. These highly probable formula assignments were thus used to perform a fine internal mass calibration (5th order polynomial) over the full mass range, using a similar approach to Sleighter et al. (2008) (Fig. 1). Three instruments required an initial mass adjustment before the internal calibration due to preacquisition calibrations that were beyond acceptable tolerance (common for initial instrument use)these were Lab M positive mode (42.67 ppm), Lab G positive mode (101.5 ppm), and Lab G negative mode (−98 ppm).

Formula assignment
After calibration with the reduced formula list, detected peaks in the full noise-filtered and calibrated peak lists were assigned to the closest molecular formulas from the full theoretical mass list within a mass tolerance of AE1 ppm (thereby allowing evaluation of the best and worst performance characteristics among the instruments). The large majority of assignments had a mass error < 0.5 ppm across the full dataset (85% negative mode, 87% positive mode), with better mass accuracy being exhibited by higher resolution instruments. In all cases, the next nearest formula from the theoretical peak list was also noted in a separate matrix of potential interferences. These generally had mass errors >2 ppm. Each assigned peak was thereby attributed to only a single formula. The ion abundance, mass error, and closest alternative formula's mass distance were recorded for subsequent analyses and error monitoring. The formula assignments at a single nominal mass were compared and confirmed with previously published values for SRFA and PLFA in negative-ion mode at m/z 311 (D'Andrilli et al. 2013) and positive-ion mode at m/z 411 . The elemental combinations used in this study were sufficiently diverse to assign formulas to the majority of detected ions in the datasets (usually > 70% of total signal abundance; Fig. S1) but may require modification when analyzing different samples or when using different ionization methods. Some low abundance peaks were not assigned in most datasets, either due to the elemental and isotopic constraints or due to the requirement for single charge. This conservative approach reduced false positive assignments.
Contaminant and rare peak removal Contaminant ions for each instrument were defined as those detected in the blank sample at > 20% of the maximum abundance (base peak) ion of the blank mass spectrum, and such contaminant ion abundances were adjusted to zero in all datasets from the respective instrument. This high threshold allows low abundance carryover to exist in the blank and prevents erroneous "blanket subtraction" of low abundance ions that may represent real DOM signals in the samples. Finally, ion formulas from all samples and instruments were aligned separately in negative-and positive-ion mode, and this list was cropped to only include formulas that were found in > 5 analyses throughout the entire data set and to not include 13 C isotopologues. The value of 5 was chosen so that each ion had to be found in at least two HRMS instruments. The resulting matrices contained 9291 rows (distinct monoisotopic formulas) in negative-ion mode and 7945 rows in positive-ion mode.
Labs G, H, and N did not provide positive-ion mode data, so positive-ion mode data was generated from three fewer instruments.
Classification of commonly assigned ions, metrics, composition dissimilarity, and compound classes Assigned ions that were common for a given sample across all instruments in each ionization mode or in allbut-one instrument were labeled as "common," and these ions were used for the metric calculations (separated for positive-and negative-ion modes). Ions detected in fewer than n -1 instruments were categorized as "50%-n -1" (at least half of instruments, but not common), "3-50%" (at least three instruments but fewer than half ), and "1-2" (fewer than three instruments). Ions that were confidently detected and assigned by at least three instruments (i.e., the sum of the first three categories) were categorized as "detected ≥ 3" ions.
Each metric value was calculated using the common ions for each sample for all instruments. The common ion signals were normalized so that abundances summed to 10,000, and the weighted average (wa) metrics (O/C, H/C, m/z, and AI mod ) were calculated from the common ion formulas k = 1 : n as in Eq. 1, where X is the metric, n is the number of common ions, and I is the relative ion abundance.
To assess compositional dissimilarity between paired samples, the percent Bray Curtis dissimilarity (%BC dissimilarity) was calculated for the common formulas k = 1 : n between samples p and q as in Eq. 2. Because the common ion list was different for each reference material, the list of ions included for consideration included all ions that were "common" in at least one of the reference samples.
%BC dissimilarity is a useful tool for HRMS datasets because it allows zeros and takes into account abundance information. %BC dissimilarity approaches 100% when samples have no assigned formulas in common, thus signifying completely different DOM molecular composition. The full dissimilarity matrix was used for principal coordinate analyses (PCoA) to obtain Eigenvector scores using classical multidimensional scaling with the cmdscale function of MATLAB. Cluster center positions were calculated for each sample in the multidimensional principal coordinate space, and distances to cluster centers were calculated for every dataset with the betadisper function in R (vegan package). This was done to calculate dispersion of the instrumental data from the average value for each sample.
Molecular formulas were assigned to compound classes based on atomic ratios. Relying upon the characteristic clustering of DOM data on van Krevelen diagrams, diverse chemical class divisions have been suggested Kellerman et al. 2014;Rivas-Ubach et al. 2018). For simplicity, in this study, we used five broad chemical compound classes, based on O/C, H/C, and AI mod (Koch and Dittmar 2016). The classes were "aliphatic" (H/C ≥ 1.5), "low O unsaturated" (H/C < 1.5, AI mod < 0.5, O/C < 0.5), "high O unsaturated" (H/C < 1.5, AI mod < 0.5, O/C ≥ 0.5), "aromatics" (0.5 < AI mod < 0.67), and "condensed aromatics" (AI mod ≥ 0.67). These classes represent broad compositional groups from HRMS data projected on van Krevelen diagrams and they are presented here to highlight key chemical differences between the samples and to demonstrate the inter-instrument variability in interpreting these results. At the same time, the broad grouping prevents us from overinterpreting molecular formula data, which allows only limited structural insight.

Assessment
Common, detected ≥ 3, and total assigned ions In total, data for the samples and blanks was compared from 17 and 14 different instruments in negative-and positive-ion modes, respectively. The number of detected ≥ 3 ions detected per sample varied widely across instruments from~1500 to~6000 in negative-ion mode and from~1000 to~5000 in positive-ion mode (Fig. 2). The total number of assigned ions did not vary significantly among samples but did vary across instruments (ANOVA: F sample 1.86, F instrument 11.2; Table 2). This variability in peak assignment number was not solely due to the differing resolving powers of the instruments-in fact, resolving power had only a small influence on the number of assigned peaks (Fig. S2).
Instruments with fewer assigned formulas generally had fewer total ions (assigned and unassigned) in each mass spectrum, indicating poorer detection limits rather than lower mass accuracy and assignment ratio (Figs. S1, S3, S4). Unassigned ions, i.e., resolved peaks that did not fit within the elemental constraints applied, were often more prominent at lower m/z where the resolution was better. Orbitraps had consistently greater numbers of unassigned ions, which may be a result of variables such as instrument sensitivity, noise reduction, and variability in mass accuracy. Regardless of the exact cause(s), this finding highlights the need for careful evaluation of formula assignment routines.
The variance in the number of assigned formulas was therefore partially attributable to instrument sensitivity as well as resolving power, ESI settings, the dynamic range and mass range of the instrument, and the calculated level of noise. The number of common ions varied between 622 and 1171 per sample among the four samples in positive-and negative-ion modes (Table 3). These ions were found in the mass range (m/z 223-661 negative, 221-713 positive) where all instruments' analytical windows overlapped (Fig. 3) and are therefore not a complete representation of the full chemical diversity of ionizable organic species contained within the samples. The common peaks made up the majority of abundance (Fig. 4, S5), specifically accounting for the following  for each formula assigned) in negative-ion (left) and positive-ion (right) modes. The median relative abundance (summed to 10,000 for each sample) is shown as line height (cropped at 50) in the mass spectra. Detection rate categories from Table 2 are shown as different colors, with the more common categories shown in the background for mass spectra and in the foreground for van Krevelen diagrams. The overlap in the assigned molecular ions occurred in the central part of the mass spectral distribution (between m/z 300-550). Samples included Elliot Soil, Pony Lake, and Suwannee River Fulvic Acids (ESFA, PLFA, and SRFA) and Suwannee River Natural Organic Matter (SRNOM). The ions that were detected in at least three instruments (detected ≥ 3; Table 3) are probably more accurate measures of the "true" diversity of species that can be ionized and resolved by direct infusion ESI employing "broadband" (full mass range) acquisitions on instruments with resolving powers on the order of 10 5 -10 6 (at m/z 401). Notably, broadband HRMS does not represent the full chemical diversity of the reference samples, just the portion that is ionizable by ESI. Furthermore, isomeric diversity will not be revealed by this approach (Hertkorn et al. 2008;Zark et al. 2017;Hawkes et al. 2018a). Formula lists (common and detected ≥ 3) for each sample in each ionization mode are available in Data S1 and online at https://go.warwick.ac.uk/InterLabStudy.

HRMS metrics for DOM comparison
Calculated weight-averaged values for each metric using the common ions are shown in Table 4. Across the HRMS instruments, for each sample, H/C wa and AI mod,wa values had lower variability in negative-and positive-ion modes than O/C wa and m/z wa values ( Fig. 4; see F instrument values from ANOVA analysis in Table 2). In this diverse IHSS sample set, the DOM sources had a larger effect on composition variability for all metric values than the different instruments (F sample > F instrument , Table 2). The only evaluated variable for which instruments led to higher variability than the DOM source was the total number of assigned peaks ( Table 2). The Orbitrap instruments produced data with lower overall m/z wa values than the FT-ICR instruments in both ESI modes (Fig. 4). In negative-ion mode, the two instrument types were significantly different (Student's t-test, 95% confidence level) for m/z wa for all samples, significantly different for O/C wa for ESFA and PLFA, and significantly different for H/C wa and AI mod,wa only for ESFA. In positive-ion mode, the two instrument types were significantly different for m/z wa for all samples and significantly different for O/C wa only for PLFA. Generally, the two instrument types (Orbitrap and FT-ICR) significantly overlapped for the metrics chosen (except m/z). However, in some cases it may be worth comparing newly obtained metric values in the context of instrument type.
The deviation of each instrument from the median instrument result with regard to O/C wa and m/z wa metric values is shown for negative-ion mode data in Figs. 5 and S6. The trends in ranking order of instruments are similar for each sample in each ionization mode. Clearly, some instruments have consistent metric value offsets across the samples (e.g., instrument B for O/C wa ), while others had highly outlying values, such as instrument I for O/C wa in ESFA. As stated previously, the Orbitrap instruments gave lower metric values for m/z and higher values for O/C, which indicated consistent differences in sensitivity among instrument types.

Reference sample and instrument dissimilarity
Bray Curtis dissimilarities (%BC dissimilarity) were computed among all pairwise combinations of samples for each instrument (Fig. 6). Assessment of pairwise dissimilarities allows evaluation of how DOM composition differences between each sample pair were perceived by the various instruments. Dissimilarities between PLFA, SRFA, and SRNOM were consistent within~20%, while pairwise dissimilarities to ESFA were more variable (40%), but had similar ranges (~20%) when considering only one instrument type (Orbitrap or FT-ICR). These results were consistent between both ionization modes, and the larger dissimilarity with ESFA is likely related to the significant difference in obtained average m/z and O/C between these two instrument types (Fig. 5). For reference, the dissimilarity for five replicate analyses of SRFA on instruments A and L were 5.7% and 5.9%, respectively, using the common ion abundances.
The dispersion homogeneity analysis determined that average distances to cluster centers in principal coordinate space was significantly higher for ESFA in negative-ion mode (Tukey's HSD, p < 0.05), while dispersion was not significantly different for the other reference samples and all four samples in positiveion mode. This method allowed the classification of outliers, as indicated in Fig. 7 (i.e., instrument I for ESFA in negative mode, instrument P for three samples in positive mode). Note that Table 3 Number of peaks detected as ≥ n -1 (common), > 50%, < n -1, ≥ 3-50%, and < 3 instruments for the Elliot Soil, Pony Lake, and Suwannee River fulvic acids (ESFA, PLFA, and SRFA) and Suwannee River Natural Organic Matter (SRNOM) samples after the molecular formula assignments and blank subtraction.  ESFA for instrument I was separated from the other instruments on the third PCoA dimension (not shown). The %BC dissimilarity measured for each sample across instruments was also assessed in individual PCoAs. The resulting PCoA diagrams (first two coordinates) are shown in Fig. 8. In each diagram, the correlation coefficients (Pearson) for the four calculated metrics and total number of assigned peaks with PCoA 1 and 2 are overlaid (significant ones, p < 0.05, are shown in bold purple font). The %BC dissimilarity among instruments generally averaged~24%, except for the more Table 4 Mean and standard deviation (SD) for four metrics (oxygenation, hydrogen saturation, mass to charge ratio, and modified aromaticity index: O/C wa , H/C wa , m/z wa , and AI mod,wa ) for the four IHSS samples: Elliot Soil, Pony Lake, and Suwannee River fulvic acids (ESFA, PLFA, and SRFA) and Suwannee River Natural Organic Matter (SRNOM) in negative-ion and positive-ion mode. These values were calculated for the common ions without outliers, as in Fig. 4, which were defined as points greater than 1.5X the IQR above or below the 75 th and 25 th percentile values (Q3 and Q1), mathematically: > Q3 + 1.5(Q3 -Q1) or < Q1 -1.5 (Q3 -Q1).  disperse ESFA, which averaged 34-35%. Inter-instrument %BC Dissimilarity values were observed to be as low as 7%, close to intra-instrument variability (5-6%). The correlations with obtained metrics indicate that mass tuning was the principal cause of %BC dissimilarity, as m/z wa correlated significantly with the first principal coordinate in every case. In negative-ion mode, O/C wa usually correlated significantly with the second coordinate, and the total number of peak assignments and H/C wa also frequently correlated significantly. In positive-ion mode, O/C wa was typically not significantly correlated, while the number of assignments and saturation (H/C wa or AI mod,wa ) were.

Compound class proportions
Using the full set of assigned formulas, the relative proportions of five compound classes were calculated (Figs. 9, 10, Table 5). In negative-ion mode (Fig. 9), the two most abundant classes were "high O unsaturated" and "low O unsaturated," which signify formulas with H/C < 1.5 and AI mod < 0.5. The ESFA sample had the highest proportion of "aromatic" or "condensed aromatic" formulas (AI mod ≥ 0.5 and ≥ 0.67, respectively), and PLFA had the highest proportion of "aliphatic" formulas (H/C ≥ 1.5). DOM compositional differences found within each reference sample were far greater than differences across instruments (see Table 2 for ANOVA statistics). In line with differences in O/C wa , highest instrumental variability was linked to the molecular groups of "high O unsaturated" and "low O unsaturated" formulas. In the case of ESFA, the proportion of "aromatics" and "condensed aromatics" was particularly variable among instruments (Simon et al. 2018). In positive-ion mode (Fig. 10) the trends were similar, but ESFA had a far greater proportion of 'aliphatic' formulas in this ionization mode (59 AE 11% in positive-ion, 8 AE 3% negative-ion mode; mean AE SD). "Aromatics" and "condensed aromatics," conversely, were poorly ionized in positive-ion mode.

Discussion
Number of assigned peaks, mass ranges, and ion distributions across mass spectra The number of assigned molecular formulas varied greatly among instruments (Fig. 2, Table 2) and did not solely correlate with instrument resolving power (Fig. S2). This is likely a consequence of variable quality of instrumental tuning related to the ESI source, the ICR or Orbitrap cell, and the ion transfer optics. Optimized instrument tuning will improve the detection limit and widen the m/z range, thereby generating a larger list of assigned formulas. Some of the instruments with higher resolving powers would be capable of assigning a wider range of compound classes than were allowed in the formula assignment routine, and so the total assigned peak number was constrained by the conservative approach taken. The minimum instrument resolving power required for any study depends on both the research question and the chemical diversity of the sample. Research that focusses on broad geochemical trends can often be achieved with lower resolving power, e.g., when determining presence and absence of specific chemical compositions or easily resolved ions, and when determining the abundance changes of the most prominent ions is sufficient (Hawkes et al. 2016;Simon et al. 2018). However, some research applications may benefit from maximizing resolving power and extending dynamic range to enable a more thorough assignment of compound classes (Pohlabeln and Dittmar 2015;Smith et al. 2018;Palacio Lozano et al. 2019).
The common ions for a given reference sample covered a broad mass range, and their assignments tended to occupy the central region of the van Krevelen diagram (Fig. 3). This indicates that these chemical species were highly abundant and/or easily ionizable in the samples, and that all instruments obtained a large overlap in the molecular composition of the samples. While the common ions represented the majority of abundance in three reference materials (PLFA, SRFA, and SRNOM), ESFA showed stronger variability due to the major differences obtained between Orbitrap and FT-ICR instruments. Even so, the common ions of a sample were detected consistently across all instruments, so can be considered as a reasonable and robust benchmark for evaluation of instrument performance in DOM-related applications.

Variability in the detection of ions and metric values
Metric values were calculated using commonly detected ions (see definition in Section 1.2.3). The largest instrument variability in these metrics was found in average m/z and O/C ratios ( Table 2). The detection of these common ions may be considered as an anchor point to evaluate instrument performance when assessing a new instrument or when troubleshooting with an established instrument. We suggest that a detection rate of > 95% of common ions is a sensible level for reasonable performance. There were instruments in our dataset that did not achieve this level (e.g., instrument I for ESFA in negative-ion mode), indicating that some tuning of instrument I may be required to analyze ESFA and other soil DOM samples at a level similar to that of other laboratories (Fig. S7). Improvements in tuning or calibration may be made to instrument O in positive-ion mode and instrument D in negative-ion mode. Due to the overall higher variability across instruments for sample ESFA, we suggest not to choose ESFA as a routine standard material for instrument evaluation.
The weight-averaged metric values of commonly detected ions can be used to assess instrument tuning bias, particularly with regards to average mass and oxygenation. Saturation levels (H/C or AI mod ) were more consistent across instruments and can be used as effective guides for instrument comparability, or may be used as benchmarks to gauge tuning acceptability for new instruments to give results in a context comparable to those of the international community (Pan et al. 2020). (top) and positive-ion mode (bottom), using common ions. Each instrument is indicated by its designated letter and Orbitraps are orange and FT-ICRs are blue. The PCoA scores are normalized in both dimensions so that the highest value is 1 (scale not shown). Overlaid are Pearson's correlation scores showing covariance of the metric values and the PCoA score. A higher correlation is therefore manifested as a longer arrow, and the direction of the arrow indicates which axis the metric correlates with. Metric values with a significant correlation at p < 0.05 are indicated in bold and purple. The minimum, median, and maximum %BC dissimilarity among instruments is annotated at the bottom of each plot. Samples included Elliot Soil, Pony Lake, and Suwannee River fulvic acids (ESFA, PLFA, and SRFA) and Suwannee River Natural Organic Matter (SRNOM).

Bias due to ionization
The choice of ionization technique and polarity has a well-known and large effect on the ions produced and detected from any complex mixture (Hertkorn et al. 2008;Hockaday et al. 2009;Barrow et al. 2010;Hertzog et al. 2017;Kew et al. 2018b). ESI is a popular choice, and negative-ion mode is the most commonly selected mode in aquatic biogeochemistry because DOM is acidic by nature and bears many O-containing functional groups (Perdue and Ritchie 2013). However, these technical choices lead to specific perspectives on analytical mixtures (D'Andrilli et al. 2010;Gross 2010). The selectivity of the ESI mode can be shown in our data simply by comparing negative-ion and positive-ion mode data. The O/C wa and H/C wa metric values in negativeion mode were similar to the published bulk elemental ratios of these mixtures (Fig. 4). The positive-ion mode O/C wa and H/C wa metric values were considerably different from the published bulk elemental ratios. It is unknown how much of each sample mixture is ionized in representative abundances, but these findings suggest that negative-ion mode ESI recovers a more representative portion of the mixture, albeit at lower oxygenation. Indications are that positive-ion mode is better suited to sensitively investigate aliphatic compounds, with higher average H/C ratios and lower aromaticity (Hertkorn et al. 2008;Hockaday et al. 2009).
Each HRMS instrument was tuned to analyze a specific ionizable portion of the material present in the samples (Fig. S8). It is unlikely that any instrument at any particular tune setting can capture the full representative distribution of ionizable compounds in these complex DOM mixtures in a single broadband analysis (Southam et al. 2007;Hawkes et al. 2019;Palacio Lozano et al. 2019). Indeed, the various instruments produced a relatively broad range in O/C wa in both ionization modes (Fig. 4). Although some values were similar to the bulk elemental ratios, we acknowledge that fractions of the DOM material remained non-ionized and thus undetected. Obtaining the published elemental ratio using HRMS should thus not necessarily be the goal. Although outside the scope of this paper, aspects of differential ionization will need to be further assessed by the community (Hertkorn et al. 2008).

Inter-sample DOM composition and inter-instrument differences
The reference material samples that were expected to exert the largest differences in DOM molecular composition (e.g., PLFA vs. SRNOM) indeed showed large and consistent % BC dissimilarity values across instruments. In contrast, samples from similar sources (e.g., SRFA vs. SRNOM), yielded consistently lower %BC Dissimilarity values for every instrument (Fig. 6). The ESFA to SRNOM dissimilarity had the widest range, depending strongly on the HRMS instrument type, and particularly m/z and O/C biases that greatly influenced ESFA data for Orbitrap instruments. With the exception of the number of assigned peaks, all metric values and proportions of compound classes had higher variability arising from intersample compositional differences than from instrumental differences (Table 2). However, our study covers a more diverse sample set compared to studies that focus on DOM temporal or spatial trends (Kellerman et al. 2014;Hertkorn et al. 2016;Hawkes et al. 2018b;Drake et al. 2019;Roth et al. 2019). For sample sets with higher DOM compositional similarity, instrument bias may become more important and trends in features such as compound class proportion may not be reproducibly determined among research groups. For this reason, inclusion of a known reference sample such as SRFA, PLFA, ESFA, or SRNOM in future HRMS DOM research is of high importance in order to give technical/instrumental context to the results. As mentioned above, ESFA is less likely to be a good reference sample for this purpose, since it exhibits greater variability of results among instruments.
As an example comparison with data collected outside of the present study, we evaluated previously published data (assigned mass lists for S/N > 6) for PLFA and SRFA (D'Andrilli et al. 2013) from a 9.4 T custom built FT-ICR MS instrument. Eighty-six percent of our "common PLFA peaks" (negative-ion mode) were found in the previous PLFA dataset, and the metric values were different from our published means by the following number of standard deviations: O/C wa + 1.17, H/C wa − 2.25, m/z wa + 0.29, and AI wa,mod − 0.6. Seventy-five percent of the common SRFA peaks were found in the previous SRFA dataset, and metric values were different from our published means by the following number of standard deviations: O/C + 1.44, H/C + 0.32, m/z − 1.9, AI mod − 1.0. This suggests that D' Andrilli et al. (2013) had fewer peak assignments in common with our study (possibly due to their more conservative detection limit) and a bias in O/C that led to higher values than the averages in our study. However, most of these metric values are within 2 SDs of these average results (with the exception of H/C for PLFA), meaning that they are not statistically different to the sample of instruments studied here at the 95% confidence level. The results from that study should be considered in the context that the detected ions were more oxygenated and less saturated (in the case of PLFA) than the average of the international community.
DOM chemical characterization by each instrument HRMS data of DOM samples is often used to characterize biomolecular compound classes as well as to quantify the relative abundances of each, as a way to reduce complex data into a smaller number of variables. Consistent with the %BC dissimilarity results (Fig. 6), the PLFA sample was the most compositionally distinct in negative-ion mode, being rich in aliphatic and low oxygen molecules and having a considerably lower abundance of unsaturated and aromatic molecules (Figs. 4,9). As expected, the SRFA and SRNOM samples were the most similar to each other, although SRNOM had higher heteroatom content (i.e., a greater number of formulas containing N or S; Fig. S9). ESFA had the highest abundance of unequivocally aromatic and condensed aromatic compounds, as defined by AI mod (Figs. 4,9), but calculated aromaticity was highly variable among instruments according to the instrument sensitivity to low mass defects (i.e., high oxygen and/or low hydrogen) species. Generally, the Orbitraps detected more "aromatic" and fewer "low O unsaturated" ions in ESFA. Instrument I, which was clearly different for ESFA for nearly every calculated metric value, also exhibited a high abundance of "aromatic" and "condensed aromatic" ions. In positive-ion mode, ESFA was the most compositionally distinct, with the majority of ions (59 AE 11%) being assigned to the "aliphatic" region ( Fig. 10). Notably, "aromatics" and "condensed aromatics" were proportionally lower for every sample, including ESFA, in positive-ion mode.
The results for condensed aromatics (AI mod ≥ 0.67) for ESFA in negative-ion mode highlight a key problem for comparison of data between studies, with the large range in its percent compound class being attributable to the differences among instruments (Fig. 9, Table 2). This finding emphasizes the need for one or more standard mixture validations of instrument performance for future studies. Since it cannot be stated in an absolute sense that a sample has x% condensed aromatics from HRMS data alone, we suggest that only relative comparisons are used when applying a metric value comparison, e.g., one sample has more condensed aromatics than a known reference mixture.
It should also be emphasized that no ESI-HRMS analysis of a highly complex mixture is thoroughly comprehensive, especially in direct infusion mode, where analytes in the mixture compete for available charge during ionization. Very few aliphatic compounds were detected in the ESFA sample in negative-ion mode, but these compounds were highly ionized in positive-ion mode. In negative-ion mode, it is likely that ionization of aliphatics was suppressed by the large abundance of oxygen-rich compounds that can readily accommodate a negative charge.  Table 5 Mean and SD for the five compound classes of the four IHSS samples: Elliot Soil, Pony Lake, and Suwannee River fulvic acids (ESFA, PLFA, and SRFA) and Suwannee River Natural Organic Matter (SRNOM) in negative-ion and positive-ion modes. These values were calculated for the common ions without outliers, as in Figs. 9, 10, which were defined as points greater than 1.5X the IQR above or below the 75 th and 25 th percentile values (Q3 and Q1), mathematically: >Q3 + 1.5(Q3 -Q1) or < Q1 -1.5(Q3 -Q1).

Comments and recommendations
It is clear that a great effort would be required to obtain virtually identical mass spectra for complex DOM mixtures among laboratories. Instruments can be initially tuned to modify the "analytical window" that is covered (Kew et al. 2018a;Pan et al. 2020), but it is conceptually difficult to decide a priori what the desired spectrum should look like.
Here we begin this effort by establishing the averages and ranges obtained at the status quo, including the presence of commonly detected ions and weight-averaged metric values of these ions. The use of reference samples to evaluate HRMS instrument bias will be an important feature of future studies in order to generate robust and comparable data, and will thereby help to minimize irreproducibility. The results presented here enhance our ability and awareness to reveal differential instrument responses in HRMS studies focusing on DOM. This will also enable more focus in discussions about HRMS analytical challenges. In a follow-up paper, HRMS parameter tuning effects will be investigated to gain better understanding of observed data variability, so that other laboratories may set up their instruments in a more uniform manner. It is our hope that this will help the DOM community to disentangle instrumental from interpretational variability in sample sets from identical (or comparable) environmental contexts.
We recommend that at least one of the four IHSS reference mixtures used in this study be included in future datasets, and that the % common ions detected and their metrics be reported, so that the instrument bias and extent of analytical window can be understood. For freshwater studies, SRFA is the preferred option to include (if only one is to be selected) because more studies have used this reference sample in the past. While marine studies would benefit from the provision of a marine reference material (Green et al. 2014), the current preference may be to select PLFA. However, in all cases SRFA or SRNOM would also be suitable to provide overall context to the results. Large deviations from the detection of common ions (< 95%) or metric values (> AE 2 SD offset from mean) should be investigated and explained. To facilitate this, we have set up an online data comparison toolsee https://go. warwick.ac.uk/InterLabStudy. We also recommend that when analyzing a large number of samples for comparison that 10-20% of the samples be analyzed by replicate injections, to ensure that variability observed between samples is larger than that between replicates.
Care should be taken to describe results in a relative sense, because the values obtained (abundances and averaged calculated metric values) are not absolute. Even though more work is needed to understand ion abundance pattern reproducibility, its determinants and what fraction of DOM is ionized by ESI, this work initiates a unification among ESI-HRMS users that can continue to advance DOM research at the molecular level.
As open communication among HRMS laboratories for optimal data collection continues, more reliable comparisons across the community will be made. Including a commercially available (and continuously collected) reference material in a data set, either made in-house or ideally with internationally certified mixtures, adds substantial confidence that the data has been collected and assigned in a reproducible manner. Moreover, it allows the new data to be compared with previous findings from the community in a less isolated way, enabling a more direct integration of new results in the broader context of existing data. It remains to be investigated whether and how a complex reference material could be used to standardize a dataset in order to increase comparability of results among instruments. Furthermore, the community should aim to work together such that raw and processed peak lists along with data processing methods are available in a central, accessible repository as part of routine data publishing and archiving. In this manner, future work in environmental sciences can reproduce and reliably build upon already published efforts.