Evaluation of marine pH sensors under controlled and natural conditions for the Wendy Schmidt Ocean Health XPRIZE

The annual anthropogenic ocean carbon uptake of 2.6 ± 0.5 Gt C is changing ocean composition (e.g., pH) at unprecedented rates, but our ability to track this trend effectively across various ocean ecosystems is limited by the availability of low‐cost, high‐quality autonomous pH sensors. The Wendy Schmidt Ocean Health XPRIZE was a year‐long competition to address this scientific need by awarding $2 million to developers who could improve the performance and reduce the cost of pH sensors. Contestants' sensors were deployed in a series of trials designed to test their accuracy, repeatability, and stability in laboratory, coastal, and open‐ocean settings. This report details the validation efforts behind the competition, which included designing the sensor evaluation trials, providing the conventional true pH values against which sensors were judged, and quantifying measurement uncertainty. Expanded uncertainty (coverage factor k = 2, corresponding to 95% confidence) of validation measurements throughout the competition was approximately 0.01 pH units. A custom tank was designed for the coastal trials to expose the sensors to natural conditions, including temporal variability and biofouling, in a spatially homogenous environment. The competition prioritized the performance metrics of accuracy, repeatability, and stability over specific applications such as high‐frequency measurements. Although the XPRIZE competition focused on pH sensors, it highlights considerations for testing other marine sensors and measuring seawater carbonate chemistry.

The global oceans have absorbed about 30% of anthropogenic CO 2 emitted since the start of the Industrial Revolution (Sabine et al. 2004;Le Qu er e et al. 2015), resulting in an approximately 28% increase in [H 1 ], corresponding to a 0.11 decrease in pH (Gattuso et al. 2015). The rate of this decline in pH, termed ocean acidification (OA), is faster than at any other time in the geologic record (Zeebe 2012). Ocean acidification is expected to have deleterious effects on many marine taxa and ecosystems (Fabry et al. 2008;Doney et al. 2009;Somero et al. 2016), including reduced growth in marine calcifiers (Hendriks et al. 2010;Kroeker et al. 2010) and impaired sensory function in fishes (Nilsson et al. 2012). The effects of ocean acidification on foundational species like coral, phytoplankton, and shellfish will have cascading effects on community structure, food webs, biogeochemical cycling, and commercial fisheries (Hofmann and Schellnhuber 2009;Narita et al. 2012;Kroeker et al. 2013;Punt et al. 2014;Ekstrom et al. 2015;Mathis et al. 2015).
Despite the documented effects of altered seawater carbonate chemistry on many ecosystems, ocean acidification data are limited in both coverage and quality ). Global ocean climate datasets like the World Ocean Climate Experiment (WOCE), Hawaii Ocean Time-series (HOT), and Bermuda Atlantic Time-series (BATS) have included carbon variables only since the late 1980s. These large-scale sampling efforts and long-term time series have mostly concentrated on the open ocean. Coastal areas experience more variability in CO 2 partial pressure (pCO 2 ) and represent 10-20% of the oceanic CO 2 sink (Chen and Borges 2009;Doney et al. 2009;Takahashi et al. 2009), yet most pCO 2 time-series in coastal waters are less than a decade old Feely and Dickson 2009). This relative paucity of data can be attributed to both economic and technological limitations.
Recent developments in floats, profilers, and automated underwater vehicles are increasing scientists' capacity to measure ocean conditions across time and space by increasing the number of observations and decreasing the cost of obtaining them (Byrne 2014). Equally important to the development of these new sensing platforms are highquality, cost-effective autonomous sensors that perform comparably to established laboratory methods and that have effective in situ quality control. Furthermore, sensors need to be capable of long deployments, which requires in situ calibration, anti-biofouling materials, and low power consumption (Schuster et al. 2009;Byrne 2014). As platforms add more sensing packages, the capacity for sensors to integrate with others and communicate both bi-directionally and remotely with users becomes more important.
While greater distribution of OA measurements throughout the oceans is important, properly defining uncertainty in those measurements is also critical. The Global Ocean Acidification Observing Network (GOA-ON) has defined two tiers for OA measurement quality: (1) a "climate" standard, whereby long-term anthropogenic effects on OA can be distinguished from natural variability, and (2) a broader "weather" standard that allows spatial patterns and shortterm natural variation to be adequately resolved (Newton et al. 2015). The "weather" standard defines the target relative standard uncertainty in carbonate ion concentration [CO 22 3 ] as 10%, corresponding to absolute uncertainties at 400 latm seawater pCO 2 of approximately 0.02 in pH, 10 lmol kg 21 in total alkalinity (TA) and dissolved inorganic carbon (DIC), and 10 latm in pCO 2 . The "climate" standard defines the target relative standard uncertainty in a change in [CO 22 3 ] as 1%, corresponding to uncertainties of approximately 0.003 in pH, 2 lmol kg 21 in TA and DIC, and 2 latm in pCO 2 . Currently, some laboratories are capable of achieving "climate" quality measurements, but these measurements are costly and labor-intensive, often requiring the collection of discrete bottle samples. These costs have contributed to the lack of spatiotemporal coverage of OA measurements based on established methods.
Tracking ocean acidification requires values for at least two of the aforementioned carbon parameters to fully characterize the CO 2 system chemistry. Currently, only pCO 2 and pH can be measured autonomously on commercially available sensors (Martz et al. 2015). These two parameters are not the ideal pair for characterizing CO 2 system chemistry because they covary. Even under a best-case scenario with uncertainties of 0.002 pH units and 2 latm pCO 2 , uncertainties of at least 15 lmol kg 21 would be propagated into the estimates of DIC and TA (Millero 2007). Autonomous sensors for TA and DIC are in various stages of development (Byrne 2014;Spaulding et al. 2014;Martz et al. 2015). Until they become widely available, one approach to increasing OA measurements is the co-deployment of pCO 2 and pH sensors along with continued efforts to reduce uncertainty in these sensors. Gas-equilibrating surface sensors for pCO 2 are believed to be able to measure at "climate" level targets of <2 latm and are present on many moorings and ships of opportunity (Friederich et al. 1995;Pierrot et al. 2009;Sutton et al. 2014). Subsurface pCO 2 sensors have approached this target (Tamburri et al. 2011;Atamanchuk et al. 2014;Fietzek et al. 2014;Jiang et al. 2014). Conversely, pH sensors have experienced relatively slower development. For example, the commonly used potentiometric glass electrode has changed little since its development in 1908 (Haber and Klemensiewicz 1909) and subsequent commercialization in the 1930s. Glass electrodes tend to measure inconsistently as a consequence of a variety of random and systematic effects that are inherent in such measurements, introducing errors of unknown magnitude. As a result, they are limited to uncertainties >0.005 pH units (Spitzer and Pratt 2011) and are not suited for long-term OA monitoring activities.
Newer autonomous pH-sensing technologies, namely potentiometry with ion sensitive field effect transistors (ISFETs) and spectrophotometry with pH-sensitive indicator dyes, have the potential to meet the GOA-ON "climate" standard. ISFETs were initially tested for seawater applications by Le Bris and Birot (1997) and have been commercialized in the last 7 yr . Spectrophotometric methods for pH, regularly used in oceanography, were developed 15 yr earlier (Robert-Baldo et al. 1985), and sensors based on this approach were commercialized in the 2000s (Martz et al. 2003;Seidel et al. 2008). These divergent approaches each have some distinct advantages. Sensors employing ISFETs are low-power, have rapid response times, and do not require moving parts. Spectrophotometric sensors are very precise and do not require calibration beyond characterizing the apparent absorption profiles of the purified indicator dye species (Liu et al. 2011). Both technologies are capable of long-term deployments with little drift (Gray et al. 2011;Hofmann et al. 2011). However, they are expensive and too complicated for many users, limiting their widespread use. If pH sensors performed as reliably as temperature or salinity sensors and were available at similar price points, they could dramatically improve our understanding of ocean acidification processes.
The XPRIZE Foundation targets technology gaps such as this through competitions that catalyze research and development in areas with potential benefit to humanity (Virmani and Bunje 2015). The goals of the Wendy Schmidt Ocean Health XPRIZE (WSOHXP) were to improve the performance and reduce the cost of robust pH sensing technology. The competition's $2 million prize was therefore divided into two categories: Performance and Affordability. The WSOHXP was announced in September 2013, and the evaluations were conducted from September 2014 through May 2015. This competition satisfied two important requirements in advancing sensor technology: (1) testing new sensors against conventional methods under varied yet appropriate conditions, and (2) focusing on commercialization as a means of ensuring widespread adoption (Byrne et al. 2010). Seventeen sensors competed, of which 12 were prototypes from non-commercial entities. The new technologies and increased participation illustrate how prizes can leverage research and development.
Previous studies offered valuable insights for the experimental design of the WSOHXP sensor assessments. A year earlier, the industry consortium Alliance for Coastal Technologies (ACT) tested seven commercially available pH sensors in a 10-week laboratory deployment and several 4-12 week coastal deployments (Johengen et al. 2015). Validation samples were collected in close proximity to sensors (< 1 m) at synchronized sampling times to reduce variability between the validation and instrument measurements arising from changes in water mass or diffusion of heat/CO 2 within the system. In the field trials, observed differences between instruments' and validation measurements increased when environmental conditions fluctuated. The largest spatial variation was observed during a storm event where the maximum recorded temperature difference across the test space (< 2 m) was 2.58C, resulting in potential pH errors of at least 0.04 pH units. In a separate evaluation for pCO 2 sensors, the ACT estimated pCO 2 uncertainty of 2-15 latm across the test spaces (< 1 m) during their evaluations at two coastal sites (Tamburri et al. 2011). They observed large, rapid fluctuations of temperature (18C h 21 ) and pCO 2 (400 latm within hours) during these tests. Bubbles near sensor interfaces were another concern during the pH sensor evaluations, especially for deployments near the surface. Outside of the ACT trials, developers have tested their instruments in laboratory, underway, and moored field deployments where pH was compared directly against spectrophotometric pH measurements or indirectly against other CO 2 parameters (Bellerby et al. 2002;Liu et al. 2006;Seidel et al. 2008;Martz et al. 2010). In these tests, variability among sensors and validation measurements was attributed to spatial differences in water masses, temporal mismatching, biofouling, and interference from bubbles and detritus. For example, Seidel et al. (2008) observed water mass variability of 28C between sensors located 2 m apart at the same depth. Though they did not report the resulting pH variability, a 28C difference could result in pH measurement differences upwards of 0.03 pH units. These studies illustrate how environmental factors can complicate sensor evaluations, and how sampling protocols can mitigate potential sources of error.
The WSOHXP competition was designed to test the performance of sensors in a series of environments representing real-world applications. The four competition phases in chronological order were Phase 1 registration, Phase 2 laboratory trials, Phase 3 coastal trials, and Phase 4 open-ocean trials. Phase 2 required sensors to measure both Tris-artificial seawater buffer and seawater in stable, regulated systems. This phase was designed to directly assess sensor performance with minimal uncertainty from extraneous, uncontrolled sources of variability. The Phase 3 coastal trials tested sensors for a month in more naturally varying estuarine conditions with the added possibility of biofouling. The Phase 4 open-ocean trials focused on the ability of finalist sensors to record pH over a range of depths up to 3000 m. The sensors themselves are not evaluated in this report because they were judged by a separate, independent panel. However, the judging criteria (Table 1) are described because they influenced the design of each phase, and anonymized sensor results are discussed in terms of competition design. The performance criteria consisted of accuracy, repeatability, and stability. Accuracy was defined as the pH residual, i.e., the difference between the value of pH reported by a sensor and the conventional true value of pH for that seawater. The conventional true value of pH for a seawater sample was defined as the pH measured with a spectrophotometer using purified m-cresol purple indicator dye and calculated from the calibration values of Liu et al. (2011). By definition, there is no uncertainty in this conventional true value, only in the realization of it. Repeatability, a short-term estimate of precision, was defined as the agreement among successive measurements of pH carried out under the same conditions over the course of each phase, expressed as a standard deviation. This metric assumes the pH in the test space does not change significantly in the time interval between two consecutive measurements. Stability, a long-term estimate of precision, was defined as the interdecile range of pH residuals observed over the trial deployment. Stable sensors should have similar pH offsets from validation measurements for the entire phase, and therefore a small interdecile range. This range, encompassing 80% of the pH residuals, corresponds to 2.56 standard deviations of a normal distribution. Two additional judging criteria were manufacturing cost and ease-of-use. Manufacturing cost was directly assessed from the materials cost estimate to manufacture a sensor and from supporting documentation provided by competition participants. Ease-of-use was defined as the ease with which devices can be calibrated (or selfcalibrated), deployed, maintained, and the data downloaded, taking into consideration physical size, weight, durability, accessibility, and related characteristics. Critically important to the competition were (1) ensuring the validation estimates of conventional true pH reflected the environmental conditions experienced by all the sensors and (2) characterizing the uncertainty of those estimates. These validation goals were accomplished in each competition phase through engineering, monitoring, modeling, and QA/QC protocols. Test spaces were designed to be spatiotemporally homogenous with respect to pH during validation measurements, and variation in pH of the test spaces with time was characterized through regular monitoring. Magnitudes of sampling and analysis uncertainty were evaluated from validation instruments, measurements, and comparisons with an independent laboratory. Tradeoffs in the experimental design and sensor evaluations are considered in the context of sensor performance. The topics covered here are likely relevant to the broader topics of seawater chemistry measurements and sensor evaluations.

Phase 2: Laboratory trials
The laboratory trials consisted of an accuracy test in Trisartificial seawater buffer (Phase 2A) and repeatability and stability tests in seawater over 7 weeks (Phase 2B). The laboratory trials were conducted at the Monterey Bay Aquarium Research Institute (MBARI) in Moss Landing, California, U.S.A. from September to November 2014. The pH of the Tris buffer test solution was unknown to contestant teams, and this test solution was composed of approximately 0.03 : 0.05 mol kg 21 Tris : Tris-HCl in synthetic seawater of nominal salinity 35 (DelValls and Dickson 1998). Each sensor was submerged in a container of Tris buffer and after equilibrating with the buffer, recorded 10 measurements at a rate of one measurement per minute or as quickly as possible for slower instruments. After the first set of pH measurements, additional HCl was added to the buffer to lower pH by approximately 0.4 pH units. Validation samples were collected in duplicate at the midpoint of the sensor's regular and lowered pH measurements. The Tris buffer and HCl solutions were composed of the same synthetic seawater background to maintain the same activity coefficients of their constituent species. Sodium bromide was included to ensure stable behavior in sensors with chloride ion-sensitive electrodes. The Phase 2A trials were conducted in a temperature-controlled laboratory, and temperature of the Tris test buffer solution was monitored throughout the tests (Fluke 1521 and 1504 readouts with Amphenol AS115-4-wire thermistors, manufacturer-specified accuracies 6 0.005, 6 0.002, 6 0.001 K, respectively). Teams were allowed to submit up to three sensors. Following the Tris tests, teams were allowed to review preliminary results, recalibrate sensor(s), and choose a single sensor to compete for the remaining phases of the competition. They were not allowed to make substantial modifications that affected sensor performance. Sensors were securely stored by XPRIZE personnel between competition phases.
Phase 2B long-term repeatability and stability tests were conducted in MBARI's seawater test tank (L 3 W 3 D 5 14 3 9 3 10 m) following a 1-d test deployment. The test tank is stable, with low diurnal variability in temperature and only gradual increases in salinity and pH from evaporation and out-gassing ). To provide a larger range of pH for sensors to measure, weekly step changes in pH of up to 0.2 pH units were planned. Sensors were programmed to record pH every 2 h. Validation samples were collected in duplicate from three sampling locations in the tank, twice daily, four times a week (i.e., 48 samples per week). Sensors were deployed at 3 m depth within 1.2 m of a sampling location at the same 3 m depth (Fig. 1a). Seawater was recirculated within the tank with minimal turbulence at a turnover rate of 12 h. Tank conditions were monitored with a CTD (Seabird 16plusV2, manufacturer-specified temperature accuracy 6 0.005 K, conductivity accuracy 6 0.00005 S m 21 ) and pH sensor (Satlantic SeaFET, manufacturer-specified precision 6 0.004 pH units). Temporal variation was estimated as the standard deviations of these measurements, pooled by hour. Spatial variation across the tank was assessed weekly with additional discrete pH samples or CTD surveys. The tank was ozonated once a week to limit biofouling, and the resulting microbubbles were allowed to out-gas for at least another day to limit potential interference with the sensors. No sensor data were evaluated nor validation samples collected during these periods.

Phase 3: Coastal trial
The 4-week coastal trial tested the repeatability and stability of sensors under approximately naturally varying seawater conditions at the Seattle Aquarium (Pier 59) utilizing seawater pumped from Elliott Bay, Washington, U.S.A. in February 2015. Tests were conducted in a custom 10 3 1 3 1 m (L 3 W 3 D) tank designed to reduce spatial variation while still tracking external temporal changes in near real time. Seawater was pumped from 10 m depth in Elliott Bay at 400 L min 21 into the tank with a distribution manifold through 32 equal lengths of tubing terminating at diffusers positioned along the bottom of the tank (Fig. 1b,c). Seawater flowed into the tank evenly from the bottom and exited through two skimmer pipes running the length of the tank with minimal turbulence. An awning provided protection from the elements. Sensors were positioned at mid-depth (0.5 m) and within 1 m of one of three evenly spaced validation sampling locations at the same 0.5 m depth. During sampling, inflow was stopped and powerheads provided water circulation. Validation samples were collected in duplicate at each sampling location twice daily, 5 d per week (i.e., 60 measurements per week). Tank monitoring and surveys of spatial pH variability were conducted as in Phase 2B. The pH of Elliott Bay water was monitored in situ with two glass pH electrodes (YSI 6600 V2) at 1 m and 10 m maintained by King County (http://green2.kingcounty.gov/ marine-buoy/SeattleAquariumData.aspx; King County 2015). The sensor at 10 m is located at the seawater intake for the aquarium. A third estimate of pH was derived from a salinity-TA relationship based on discrete samples collected during the experiment and a flow-through pCO 2 equilibrator (General Oceanics GO8050) on the Seattle Aquarium's seawater intake. Although these additional data provided further information on the intake water, only the validation measurements were used to evaluate sensor performances.

Phase 4: Open-ocean trials
The oceanographic deployment tested the accuracy and repeatability of five finalist sensors on nine vertical hydrocasts with max depths increasing from 250 m to 3000 m. These trials were conducted at the HOT Station ALOHA (A Long-Term Oligotrophic Habitat Assessment; 22845 0 N, 1588 00 0 W) during 15 May 2015-19 May 2015. The water column was profiled by a CTD lowered at a rate of 30 m min 21 and discrete samples were collected in 12 L Niskin bottles at 15 depths on each upcast. Before each cast, sensors were attached to a rosette containing the CTD and Niskin bottles just before each cast using hose clamps, stainless steel bolts, fiberglass strut, and cable ties. Sensors were programmed to begin sampling at a specified time and at a sampling rate of 1 Hz. Sensors that sampled at slower rates were penalized under the Ease-of-use category. The longest sampling period of the finalist sensors was 6 min. Consequently, the rosette was stopped for a minimum of 6 min at each sampling depth, and the Niskin was tripped at the 3-min mark. The rosette CTD (Seabird SBE 9/11plus with 6800-m rated pressure sensor, SBE 3P, SBE 4C, and SBE 43 sensors; manufacturer-specified temperature accuracy 6 0.001 K, conductivity accuracy 6 0.0003 S m 21 , depth accuracy 6 1 m) recorded ambient conditions at 1 Hz. Once the rosette returned to the deck and sensors were removed, water samples were collected in the following order: validation pH, audit pH, dissolved oxygen (DO), and salinity. This procedure prioritized pH and was a departure from common oceanographic practice where DO is sampled before pH (Dickson et al. 2007).

Validation procedures for all phases of the competition
For the purposes of the competition, "pH" is defined as the negative decadic logarithm of total hydrogen ion concentration of seawater expressed in moles per kilogram of solution. The conventional true value of pH used to validate the sensors was based on spectrophotometric pH measurements using a stock solution of 10 mM purified m-cresol purple indicator dye (Byrne Laboratory, University of South Florida) and calculated using the parameters of Liu et al. (2011) after correcting for any pH change resulting from the addition of dye to the sample (see below).
Sampling and measurements were conducted according to best practices (Dickson et al. 2007). Briefly, samples were collected with minimal turbulence in borosilicate glass bottles after pre-rinsing and at least one volume overflow. Seawater samples were left with 1% headspace and poisoned with a 0.04% (by volume) addition of saturated HgCl 2 solution. Samples were poisoned as a precautionary measure for any laboratory problems that might delay analyses and for methodological consistency with the audit samples that were analyzed in a different laboratory days-to-weeks later. Validation measurements were performed with a 10 cm path length cell on single-beam spectrophotometers (Agilent Cary 8453, 8454, manufacturer-specified absorbance accuracy 6 0.005 absorbance units, AU) manually in Phase 2 and with an automated system (Carter et al. 2013) in Phases 3 and 4. Dye perturbation measurements were made on >15% of seawater samples by measuring those samples with a double addition of dye. With the manual system in Phase 2B, the dye correction was based on an added 10 uL dye addition to the same sample (Dickson et al. 2007). Dye corrections with the automated system were based on a second, double-dye measurement, and the amount of dye added was calculated from isosbestic absorbances at 488 nm (Carter et al. 2013). Analysis temperature was 158C in Phase 2A and 208C for the remainder of the competition. Analysis temperatures were regulated by a custom cell incubator in the manual method and by a custom water-jacketed spectrophotometer cell in the automated method. Temperature was recorded immediately following measurements (Fluke 1521, 1504 readouts with Amphenol AS115-4-WIRE thermistors, manufacturer-specified accuracies of 60.005, 60.002, and 60.001 K, respectively). Validation pH was reported, as the mean of duplicate measurements, at in situ temperature and pressure conditions on the total scale using equilibrium constants K 1 and K 2 from Lueker et al. (2000); K HSO4 from Dickson (1990a); K B(OH)3 from Dickson (1990b); K H3PO4 , K H2PO4 , K HPO4 , K H2O , and K Si from Millero (1995); and K HF from Perez andFraga (1987) (or Dickson andRiley 1979) for temperatures < 98C). Pressure corrections for equilibrium constants follow Millero (1995) and were calculated on the seawater scale before conversion to the total scale. Total alkalinity was derived from audit measurements, and carbonate alkalinity was calculated from total concentrations of boron (Uppstr€ om 1974), sulfur (Morris and Riley 1966), and fluoride (Riley 1965) in proportion to salinity. Phosphate and silicate concentrations were assumed to be 0 lmol kg 21 . In Phase 2A, the DelValls and Dickson (1998) temperature dependence for the pH of equimolar (0.04 : 0.04 mol kg 21 Tris : Tris-HCl) buffer was used to adjust pH to in situ temperature for both test solutions, assuming it represents the change in pK with temperature. To adjust seawater pH to in situ temperature and pressure, DIC was first calculated from the pH measured under laboratory conditions and TA, then TA and DIC were used to estimate in situ pH (at the in situ temperature and pressure). The in situ adjustments of seawater samples were calculated using the seacarb package v3.0 (Gattuso et al. 2014) in R v3.1 (R Core Team 2014) and CO2Sys package v25 (Pelletier et al. 2015) in Microsoft Excel.

Quality assurance and quality control (QA/QC)
The quality assurance plan required pH measurements to be made using appropriate equipment operated by qualified individuals, targeting a standard uncertainty in pH of 0.003-0.005 based on state-of-the-art methods and "climate standards" (Carter et al. 2013;Newton et al. 2015). Appropriate equipment was defined as new (< 1 yr old) equipment with traceable calibrations and/or older equipment passing periodic testing relative to newer equipment and with reference materials. Throughout the competition, spectrophotometers passed diagnostic tests using holmium oxide and absorbance filter reference materials. Primary and backup spectrophotometers were compared in side-by-side tests during Phases 2B and 3. Thermistors were batch-tested in baths, measuring within 0.01 K of one another. The purified dye composition was verified by the National Institute of Standards and Technology. All validation measurements were performed by a single analyst, while additional measurements of the analyst were compared with colleagues at the start of the competition, with an audit laboratory, and against reference materials throughout the competition. An advisory board provided feedback on protocols and results. Operating procedures, diagnostics, notes, and data were recorded in a notebook and digitally with regular back-ups.
Quality control measures included replicate measurements, regular measurement of suitable reference materials, comparisons with an independent laboratory, and comparisons of secondary measurements. Duplicate validation bottle samples were collected and analyzed at every validation point. During each analytical run, validation bottle samples were only analyzed after confirming acceptable pH measurements of Tris or seawater reference materials (all reference materials were obtained from the Dickson Laboratory at the Scripps Institution of Oceanography, University of California, San Diego). The reference materials used had pH values differing by up to 0.3 pH units. Additional discrete samples were collected in duplicate at places and times corresponding to approximately 15% of validation samples, and these audit samples were analyzed by the Dickson Laboratory for pH, total alkalinity, and salinity. Secondary measurements of pH, other carbonate parameters, and dissolved oxygen provided additional information on the likely quality of the validation pH measurements and testing environments.

Sensor scoring
Contestant sensors were judged by a separate, independent panel from the validation team. After the conclusion of the competition, contestants' measurements at every validation sampling point were provided to the validation team. These data were used to estimate accuracy and stability, which were both estimated from differences between the measured sensor pH and the validation pH as described earlier. By definition, a sensor's repeatability was calculated as the standard deviation of differences between consecutive measurements, regardless of their accuracy relative to validation pH values. Here, repeatability can only be approximated based on the available data, with consecutive measurements spanning up to 6 h apart for Phases 2B and 3. Repeatability was not estimated for Phase 4 because pH naturally changes by as much as 0.5 pH units over the depth profile. The competition guidelines indicated sensors scored full points for accuracy and stability values within the reported validation expanded uncertainty corresponding to 95% confidence. For repeatability, full points could be scored for 0 pH unit standard deviations. Sensors scored no points for values greater than the minimum standards (Table 1). Values falling between these thresholds were awarded linearly scaled points, and total scores in each phase were weighted according to the maximum possible score for each criteria in each phase (Table 1). Similarly, if a sensor stopped collecting data during any point of the competition, its score was penalized in proportion to the amount of data missing.

Validation pH uncertainty
The main goal of validation efforts in the WSOHXP competition was to ensure that the reported conventional true pH values were representative of the seawater around the test sensors. Efforts to characterize uncertainty in validation measurements were as important as the measurements themselves because the judges used such uncertainty estimates in the scoring of sensor performance. The total expanded validation pH uncertainty (coverage factor k 5 2, corresponding to 95% confidence) was approximately 0.01 pH units throughout the competition (Table 2) and is comparable to estimates made previously (Carter et al. 2013;Hammer et al. 2014;Bockmon and Dickson 2015). This uncertainty consisted of contributions from analysis, sample handling, corrections to in situ conditions, spectrophotometer absorbance, and dye perturbation corrections. These sources of uncertainty were summed in quadrature to obtain the combined standard uncertainty for the validation pH measurements in each phase (Ellison and Williams 2012).
All analyses were conducted by the same operator in the same mobile laboratory, housed in a shipping container that traveled with the competition. Outside environmental conditions, electrical power supply, internal temperature fluctuations, and so forth varied within and among the different phases of the competition. Their contributions to analytical uncertainty were estimated from the standard deviation (SD) were 0.0030 (n 5 52, j 5 3 batches of reference materials, t 5 55 d), 0.0017 (n 5 34, j 5 3 batches, t 5 25 d), and 0.0010 (n 5 26, j 5 2 batches, t 5 7 d) pH units, respectively. The improvement in analytical uncertainty over time between Phase 2B and Phases 3 and 4 was likely a result of the automated system's reduced and more standardized sample processing time.
Sample handling uncertainty was defined as uncertainties in measurement observed among different labs analyzing the same sample materials, including other unknown contributions. Sample handling uncertainty was estimated from the standard deviation of differences between the paired audit and validation pH measurements. Sample handling uncertainties for Phases 2B, 3, and 4 were 0.0020 (n 5 26), 0.0054 (19), 0.0034 (25) pH units, comprising 10-50% of the combined standard uncertainty in each phase. Phase 2A did not include a sample handling comparison because audit samples were collected from the initial mixing carboy while validation samples were collected from testing containers with instruments.
Uncertainty from corrections to in situ conditions includes uncertainties in salinity, temperature, and pressure measurements of laboratory and CTD sensors. These uncertainties were assumed to have a uniform distribution within the manufacturer-specified accuracy ranges. Salinity uncertainties of 0.01% have negligible effects on pH. The uncertainty of the laboratory thermistors used in Phases 2A and 2B were conservatively assumed to be 0.02 K based on tests in a controlled water bath and intercomparisons with other thermistors. Pressure sensor accuracy was irrelevant in Phases 2B and 3 and up to 1 dbar in Phase 4. The resulting pH uncertainties from temperature and pressure uncertainties were 0.0001-0.0006 pH units. Laboratory-measured pH values were corrected to in situ conditions, and the magnitude of the adjustment increases with increasing pressure. Similarly, uncertainty in the estimated in situ pH consequent on the uncertainties in the constants used in pressure adjustments will increase proportionately with depth. This uncertainty was assumed to be 8% of the difference between pH calculated, at in situ temperature, at the surface and at the actual sample depth. The assumed 8% pressure correction uncertainty corresponds to 0 pH units at the surface and 0.0101 pH units at 3000 m.
The contribution of the spectrophotometers to pH uncertainty was calculated assuming absorbance uncertainty of 0.005 AU. Treating this range as a triangular distribution implies a standard uncertainty in absorbance measurement of 0.005/(6 1/2 ) 5 0.002. This value was then used to estimate the relative standard uncertainty in the absorbance ratio R (dR/R), using the relationship together with the measured absorbance values at 434 nm and 578 nm (A 1 and A 2 ) for all samples in each competition phase. The maximum "smoothed" value over the range of absorbances was used as the resulting spectrophotometer contribution to uncertainty in pH for each phase. Reported uncertainties from spectrophotometer absorbance were 0.0019-0.0037 pH units across all three phases. Lower spectrophotometer uncertainty in the later phases likely resulted from an increase in the amount of dye added to samples with the automated system that subsequently reduced the ratio of absorbance error to absorbance measured. The addition of indicator dye can affect the sample seawater pH, and there is an uncertainty associated with the empirically derived correction to this dye perturbation. This dye-correction intercept uncertainty was estimated as onefourth of the largest adjustment. In Phase 2A, dye perturbation was assumed to be negligible as the samples were strongly buffered Tris/Tris-H 1 mixtures.

Phase 2A
Tris seawater buffer pH is twice as sensitive to temperature as seawater pH. Therefore, additional steps were taken to reduce the potential contributions of temperature variability to pH uncertainty, including regulating the testing lab temperature, allowing sensors to acclimate to the test room, and submerging test solutions in water baths to slow temperature fluctuations. Thermistors were cross-checked in water baths to verify their performance. Despite these efforts, temperature variability was introduced by larger sensors' thermal mass in the water baths and movements of personnel to and from both the laboratory and the testing room. Ultimately, when temperature could not be held constant, monitoring and recording temperatures during sample collection and analysis was critical. The advantage of testing sensors in Tris buffer as opposed to seawater is that the pH of a Tris buffer solution would not be expected to drift during the test period, even for the new solution obtained after the addition of HCl. The tradeoffs include the aforementioned temperature sensitivity and the added cost. Phase 2A required mixing approximately 800 L of Tris buffer to test approximately 50 sensors.
Sensor accuracy varied considerably, with nine of 17 teams measuring pH within the 0.02 pH unit performance threshold and another two measuring within the 0.04 pH unit affordability threshold (Fig. 2).

Phase 2B
Protocols were designed to limit natural variability in physical conditions throughout the test tank, ensuring sensors were compared fairly to the assigned conventional true pH value of the test solutions. These protocols included regular monitoring of spatiotemporal variability in pH and temperature during Phases 2B and 3. Periodic surveys with discrete pH sampling and CTD/pH sensors further assessed spatial variability. Discrete pH samples collected across the MBARI tank had a pooled standard deviation of 0.0027 pH units (n 5 96, j 5 4 surveys), comparable to the analysis uncertainty in this phase of 0.0030 pH units, and exhibited no consistent spatial pattern. The standard deviation of all validation samples pooled across sampling events, i.e., reflecting spatial heterogeneity between sampling locations, was 0.0017 pH units (n 5 293, j 5 50 events). Temperatures across the tank measured within 0.005 K of the tank center with an increasing temperature gradient from the west to east wall. This slight temperature gradient would result in negligible pH uncertainty of 4 3 10 25 pH units, assuming no differences in salinity and TA. Given the lack of measured pH differences throughout the tank and the negligible temperature effects, validation pH was reported as the average of the three sampling locations for each sampling event. Hourly standard deviations (n 5 6 h 21 ) of the SeaFET pH sensor used for tank monitoring were generally less than the manufacturer-specified precision of 0.004 pH units, indicating stable pH over hour periods. During Phase 2B, accelerated stray current corrosion was observed in the MBARI test tank. Consequently, the tank pH manipulations were cancelled and testing was halted after 7 weeks to protect the sensors, cutting this phase short from the originally planned duration of 3 months. Subsequent electrical tests were unable to determine the source of the corrosion. Damage to sensors was limited to metal support bolts and sacrificial anodes.
The repeatability and stability of about half the sensors were within the Performance and Affordability thresholds (Figs. 3,4). One sensor malfunctioned during this phase and reported only a single pH value for every timestamp, effectively earning it a perfect score in repeatability. Because the tank pH only increased by 0.002 pH units over 6 weeks, this sensor also performed relatively well in the stability category, whereas in a more dynamic environment this malfunction would have more severely penalized the sensor's score. Five sensors failed to record measurements.

Phase 3
In Phase 3, pre-experiment pH surveys with the pH probe indicated a pH range across the Seattle Aquarium test tank of 0.005 units with no discernible spatial pattern. Similarly, temperature readings ranged 0.005 K with no consistent gradients. However, discrete pH samples detected an intermittent pH gradient up to 0.008 pH units across the tank, though the overall tank pooled SD was 0.001 pH units (N 5 240, j 5 40 events). Because of this potential gradient, the validation pH was reported for each sampling location. Reporting mean tank sensor and validation pH (y-axis, log-scale) for each team's sensor(s) (colored symbols along x-axis, ranked from left to right in order of increasing residuals). Teams were allowed to test up to three replicate sensors (shapes). Accuracy trials consisted of measurements in Tris seawater buffer at two different pH levels for each sensor (i.e., two points per shape). Residuals within the expanded uncertainty of validation pH measurements (gray shaded region) scored full points. Residuals between validation uncertainty and the Performance (dashed line) and Affordability (dotted line) category thresholds scored linearly scaled points from 0 to the maximum indicated in Table 1. values would have introduced an additional source of variability disproportionately affecting sensors at the far ends of the tank. This observation of ephemeral pH gradients illustrates the importance of continually surveying the test area. As in the previous phase, the pH variability on hourly timescales in Phase 3 was less than the auxiliary pH sensor repeatability. Discrete samples were generally collected within 20 min. Within these timeframes, no detectable pH variability could be attributed to time in both Phase 2B and 3, and therefore no additional steps were taken to account for temporal variability when reporting results.
The pH of Elliott Bay increased from 7.66 to a peak of 7.76 pH units over the first week, dropped to the original value over the next week, and increased again to 7.71 pH units 2 weeks later (Fig. 5). These pH movements reflected changes in primary productivity on weekly timescales. The King County buoy pH sensors tracked the relative changes of validation pH measurements well, but their offsets from the validation measurements drifted over time and even reversed upon redeployment on 18 February 2015. The performances of these glass electrode sensors suggest their utility was limited to tracking relative changes in pH on weekly timescales. The agreement in relative fluctuations of all buoy parameters (pH, dissolved oxygen, chlorophyll fluorescence) and validation pH over time indicated that the tank represented environment conditions. This tank design appeared to be well-suited for sensor testing in a variable coastal environment.
Phase 3 was conducted in February when biological productivity and pH variability are lower compared to the summer. Nonetheless, the background pH variability was higher than in Phase 2B. As in Phase 2B, half of the competing sensors recorded a significant proportion of their measurements within the Performance and Affordability thresholds for both repeatability and stability (Figs. 6, 7). The largest deviations in repeatability and stability were an order of magnitude higher than in Phase 2B, likely due to the increased environmental variability and biofouling compared to Phase 2B. These results illustrate the differences between variable field and controlled laboratory environments.

Phase 4
The high cost of ship time limits the testing of sensors in oceanographic settings. Most of the finalists' prototype sensors had never been exposed to pressures >1000 dbar. Casts were gradually increased in depth to allow sensors to collect some data in the event their housings were compromised at higher pressures. Indeed, one sensor's battery compartment was crushed at approximately 2000 m, precluding it from participating in the last cast. Another advantage of progressively deeper casts was increased sampling resolution where pH changes more rapidly at depths <1000 m. The in situ pH profiles decreased from 8.1 pH units in surface waters to a minimum of 7.6 between 700 dbar and 800 dbar, corresponding to the oxygen minimum zone (Fig. 8). The The interdecile range (IDR) of residuals (sensor pH -validation pH) (yaxis, log-scale) represents the stability of sensors over time. Sensors are displayed as x-axis ticks in order of increasing IDR. The gray shaded region represents expanded uncertainty of validation pH measurements. Performance (dashed line) and Affordability (dotted line) thresholds were 0.05 and 0.10 pH units. reported expanded pH uncertainty increased from 0.009 pH units at the surface to 0.022 pH units at 3050 dbar due to the assumed uncertainty in pressure correction. This pressure correction uncertainty is currently unknown, but a lower bound of 5% can be inferred from Millero (1979). The 8% pressure correction uncertainty used here resulted in a pH uncertainty of 0.0101 pH units at 3050 dbar, which provided judges the scope to assess sensor accuracy within the competition's minimum standard of 0.02 pH units.
Hawaii Ocean Time-series Station ALOHA has measured pH since 1992 and is the longest sustained monthly seawater pH time-series in the world. This extensive time-series provided a record against which both sensor and validation measurements could be compared. Validation pH values were generally within 0.01 pH unit of historical April-June HOT Station ALOHA pH values (Fujieki et al. 2015), after recalculating the HOT ALOHA data using the same constants from Liu et al. (2011) (Fig. 8). The HOT-ALOHA pH measurements are made with uncharacterized, impure dye, resulting in an additional contribution to HOT-ALOHA pH uncertainty of 0.02 pH units (Liu et al. 2011). Despite these uncertainties, the validation pH-depth profiles were consistent with the historical record.
In addition to stepping casts to progressively deeper maximum depths, the 6-min rosette stop time at each depth was designed to ensure the collection of sensor measurements that could be compared to validation measurements obtained from Niskin bottles. Normally, stop times are less than a minute. Consequently, the competition protocol eliminated the handicap that slower-sampling technologies, namely spectrophotometric methods, would have encountered in collecting high-resolution time-series. Slower instruments were instead penalized under the qualitative Ease-of-use category, which was worth fewer points (Table 1). Accuracy appeared to decrease with increasing pressure for all sensors (Fig. 9), potentially reflecting uncertainties due to the pH adjustment from laboratory to in situ pressure in addition to instrument measurement uncertainties resulting from the changing conditions. Temperature changes from approximately 268C to 28C along the cast profiles also could have contributed errors to sensors' pH readings if sensor thermistors were unable to equilibrate before the pH measurements were made. A third potential source of sensor pH errors could arise if in situ temperature, salinity, or pressure conditions extended beyond the ranges that teams used to calibrate their sensors. Repeatability cannot be assessed here because only the sensor measurements associated with Niskin samples were provided to the validation team, and Niskin bottles were collected at different depths. The overall results from Phase 4 reflect the tradeoffs of evaluating sensor performance vs. application, the unquantified uncertainties due to pressure, and the time/cost constraints of oceanographic deployments.

Discussion
The WSOHXP competition demonstrates how sensor accuracy, repeatability, and stability can be evaluated in    Liu et al. (2011). The gray shaded region in the pH panel represents the expanded uncertainty of the validation pH measurements. Fig. 9. Phase 4 accuracy at HOT Station ALOHA as residuals between sensor and validation pH (y-axis, log-scale) for each team's sensor (x-axis ticks, ranked from left to right in order of increasing residuals) colored according to pressure. Residuals are plotted with slight horizontal scatter and transparency to reduce overplotting. Expanded uncertainty ranged from 0.009 pH units (gray shaded region) to 0.022 pH units at 3050 dbar. The Performance category threshold (dashed line) was 0.02 pH units. laboratory, coastal, and open-ocean settings. Temperature, gas exchange, pressure, and biological processes simultaneously affect seawater chemistry over differing spatial and temporal scales. For sensor comparisons, controlling and documenting these sources of variability is a necessary prerequisite to evaluating sensor performance. Collectively, the sensors tested in this competition demonstrated the ability to collect high-quality pH measurements under a range of conditions, including variable coastal and deep ocean environments. Their performances are noteworthy considering a recent inter-laboratory comparison resulted in differences from the assigned true value of up to 0.04 and 0.1 pH units for spectrophotometric and electrometric approaches, respectively, under controlled laboratory conditions (Bockmon and Dickson 2015). Beyond the technical aspects of the competition, the XPRIZE is a means of catalyzing technological development to address scientific and humanitarian issues. In the coming years, the technologies tested in this competition should simultaneously reduce the cost and increase the quality and coverage of OA data. These advances should improve climate models, OA-biological studies, and our understanding of ocean processes.