Optical types of inland and coastal waters

Inland and coastal waterbodies are critical components of the global biosphere. Timely monitoring is necessary to enhance our understanding of their functions, the drivers impacting on these functions and to deliver more effective management. The ability to observe waterbodies from space has led to Earth observation (EO) becoming established as an important source of information on water quality and ecosystem condition. However, progress toward a globally valid EO approach is still largely hampered by inconsistences over temporally and spatially variable in‐water optical conditions. In this study, a comprehensive dataset from more than 250 aquatic systems, representing a wide range of conditions, was analyzed in order to develop a typology of optical water types (OWTs) for inland and coastal waters. We introduce a novel approach for clustering in situ hyperspectral water reflectance measurements (n = 4045) from multiple sources based on a functional data analysis. The resulting classification algorithm identified 13 spectrally distinct clusters of measurements in inland waters, and a further nine clusters from the marine environment. The distinction and characterization of OWTs was supported by the availability of a wide range of coincident data on biogeochemical and inherent optical properties from inland waters. Phylogenetic trees based on the shapes of cluster means were constructed to identify similarities among the derived clusters with respect to spectral diversity. This typification provides a valuable framework for a globally applicable EO scheme and the design of future EO missions.

the UN Sustainable Development Goals. Nevertheless, several aspects of their role in these processes remain unclear (Raymond et al. 2013), while their resilience to changing environmental conditions and anthropogenic disturbance is still poorly understood (Fabry et al. 2008;Petrescu et al. 2015). Globally valid approaches for the study of these processes based only on field data is typically hindered by their high variability in both temporal and spatial scales (Dickey 2003;Peters et al. 2007). Furthermore, the sheer number of waterbodies and their geographic remoteness hampers their systematic study (Karl 1999;Verpoorter et al. 2014).
Satellite remote sensing offers a means to quantify physical and biogeochemical processes in aquatic systems at large scales, providing valuable insights into mechanisms associated with biogeochemical cycles, the climate system and its changes Guo et al. 2015;Hestir et al. 2015). The rapidly increasing rate of data collection from Earth observation (EO) missions suitable for observing waterbodies (e.g., European Space Agency [ESA] Envisat and Sentinel, National Aeronautics and Space Administration [NASA] Landsat and Aqua missions) offers long-term archives of our aquatic environments while advances in optical sensors support new and more detailed characterization of the Earth surface. Of particular interest is the remote sensing signal in the visible and infrared part of the spectrum since it comprises information on key color-forming substances such as phytoplankton pigments, suspended minerals, and dissolved compounds. Nonetheless, the wide range of possible combinations and composition of these substances found within and between aquatic systems challenges the applicability of EO techniques (Bukata 1995;Morel and Maritorena 2001;M elin and Vantrepotte 2015). Numerous approaches have been developed for the retrieval of biogeochemical properties from remote sensing data (reviews in Acker et al. 2005;Matthews 2011;Odermatt et al. 2012;Blondeau-Patissier et al. 2014;Tyler et al. 2016) but quantifying the associated uncertainties when these are applied over different conditions has hitherto proved difficult.
Water optical typologies has been suggested as a mechanism to delineate water masses on the basis of their optical properties (Jerlov 1977;Prieur and Sathyendranath 1981;Baker and Smith 1982) and thereby schematize the application of EO methods (Arnone et al. 2004). As a result, a range of parameters linked to the observed variability in water color has been encompassed in classification schemes. These include water column parameters such as Secchi disk depth (Z SD , see Table 1 for a list of symbols and acronyms) (e.g., Arnone 1985), inherent optical properties (IOPs; mainly absorption: e.g., Babin et al. 2003;Shi et al. 2014) as well as radiometric quantities measured below or above the water surface (e.g., Le et al. 2011;Moore et al. 2014).
Traditionally, the partitioning of water properties into optical types has been driven by the failure of retrieval algorithms, often developed for oceanic waters, to provide accurate data in coastal and inland systems. In this context, Morel and Prieur (1977) distinguished two water types, depending on the predominance of phytoplankton and autochthonous production of dissolved and particulate detrital material (Case-1), or the input of external particulate and dissolved material into the system causing an uncoupling of phytoplankton with bulk optical properties (Case-2). More recent studies have moved toward the differentiation of water types in optically complex environments using in situ and/or satellite-derived reflectance data. Most of these studies have considered the range of optical classes in marine systems (English Channel and North Sea: Lubac and Loisel 2007;Tilstone et al. 2012;Vantrepotte et al. 2012, Iberian coastal waters: Spyrakos et al. 2011;Adriatic Sea: M elin et al. 2011, Yellow Sea: Ye et al. 2016; Northwest Atlantic shelf: Moore et al. 2001, global ocean: Moore et al. 2009, 2014, global coastal waters: M elin and Vantrepotte 2015 with only a few studies focussed on inland systems (lakes and reservoirs in China: Le et al. 2011;Shen et al. 2015; Estonian and Finnish lakes: Reinart et al. 2003). Overall, these classification schemes can substantially improve the remote sensing products associated with individual optical water types (OWTs), and have demonstrated the need for a better understanding of the underlying variability especially in nearshore and inland waterbodies (Moore et al. 2014). In parallel, optical water typologies based on remote sensing data have found further applications in ecological studies (Martin Traykovski and Sosik 2003), the detection of blooms (comprehensive list in Blondeau-Patissier et al. 2014) and in the more detailed study of the relationships between absorption parameters and water constituents especially when these can be determined in large datasets from different aquatic systems (Torrecilla et al. 2011).
Several hierarchical, partitional, and hybrid (Jain et al. 1999) clustering techniques have been implemented for the classification of remote sensing reflectance (R rs ) into groups based upon differences in magnitude and shape. Consequently, techniques including agglomerative hierarchical (Shi et al. 2014), k-means clustering (Palacios et al. 2012), fuzzy clustering Moore et al. 2014), and artificial neural networks (Canziani et al. 2008) have been used to uncover clusters present in these datasets using different degrees of implicit or explicit knowledge. While these approaches provide useful insights into the differentiation of water masses based on their optical properties, they have often lacked a comprehensive analysis of the physical basis to the definition of the clusters in terms of their variability in IOPs and biogeochemical significance. Moreover, few studies have considered the relations between OWTs found in coastal and inland aquatic systems. In spite of the progress made in the development of these methodologies, a solid foundation for dealing with high data dimensionality, uncertainty due to the use of different sensors, and variability in the relevant spectral features is still lacking.

Spyrakos et al. Optical types of inland and coastal waters
The aim of the present study is to extend our knowledge of the optical diversity of aquatic systems, and in particular inland waters. To this end, a large database of observations from a range of different systems and a wide range of water conditions is used to: (1) obtain distinct OWTs; (2) develop a methodological approach for capturing key features found in the spectra based on functional data analysis; and (3) assess similarities and differences between inland waters and coastal marine systems. It is expected that the optical diversity of inland waterbodies exceeds that of marine systems, reflecting the wide diversity in morphology and surrounding land use of inland waters. Nevertheless, we expect that within and between regions recurrent OWTs can be detected, such as systems dominated by phytoplankton or by high light absorption due to dissolved matter. We subsequently investigate the extent to which OWTs can be approximated by a limited set of wavebands available from current and future remote sensors ("Implications for implementation to satellite imagery" section).

Datasets
A large dataset (hereafter denoted Dataset-N) of 4035 in situ hyperspectral R rs spectra from inland and coastal marine waters was used in the clustering analysis. The dataset consisted of data from more than 250 inland lakes, reservoirs and large rivers (Dataset-I, inland waters) and data from 14 campaigns in marine waters (Dataset-C, coastal waters). For this study, data were sourced from the in situ bio-optical data repositories LIMNADES (Lake Bio-optical Measurements and Matchup Data for Remote Sensing: http://www.limnades.org) and SeaBaSS (SeaWiFS Bio-optical Archive and Storage System: http://seabass.gsfc.nasa.gov).

Inland aquatic systems
The LIMNADES data (Dataset-I) used here were compiled from 16 individual datasets of bio-optical and biogeochemical measurements from a variety of natural and artificial inland aquatic systems including mainly lakes and reservoirs but also rivers and floodplains. datasets, providing references to detailed information including sources and spatial coverage. A total of 3025 R rs (k) spectra across a wide range of system characteristics, conditions, and geographical conditions were used in the clustering analysis. Cluster analysis was initially performed only on Dataset-I R rs (k) to facilitate the determination of distinct OWTs solely in inland water systems. Paired measurements of IOPs and biogeochemical parameters were then used to support the characterization of the resulting clusters.

Coastal systems
The well-documented SeaBaSS dataset (Dataset-C, Table 3) (Werdell and Bailey 2002;Werdell et al. 2003) was used for comparison to inland optical clusters as defined by the classification analysis. Data extracted from SeaBaSS were restricted to hyperspectral R rs (k) (n 5 1010) spectra from mainly coastal and but also some open ocean environments originally measured above-water. Dataset-C included few spectra (n 5 68) from open ocean environments, but due to the dominance of data from coastal waters, it is considered here to represent coastal environments. Only a limited number of these datasets also included coincident measurements of IOPs and water quality parameters. As a result, IOPs and water constituents from the marine environment were not considered in this study. Nevertheless, clustering algorithms were applied to R rs (k) spectra from both inland and coastal systems in order to broaden the application of the classification scheme and study commonalities in spectral patterns across inland and coastal waters.

Definition of reflectance
Clustering analysis was based on hyperspectral R rs , with a minimum resolution of 1 nm and spectral range of 400-800 nm. R rs (k) (in sr 21 ) is defined here as the upwelling radiance emerging from the water column divided by the downwelling irradiance reaching the water surface. For those cases when in situ measurements were carried out just below surface, R rs (02) was converted to R rs (01) using the air-sea interface transfer coefficients of Eq. 1 (Lee et al. 1999

Functional data analysis
Clustering was employed in order to identify statistically robust groups of spectra, which can be used to assist the definition of distinct OWTs found in aquatic systems. In the clustering process, the approach used for preprocessing of the data can play a crucial role in determining the influence of spectral features on the clusters obtained. In previous studies, classification of radiometric quantities have mainly considered unscaled data (e.g., Moore et al. 2001;M elin et al. 2011); however, spectra scaling has been suggested by multiple authors in order to moderate the effect of variation in amplitude attributed to changes in the concentrations of optically active constituents (Mobley 1994;Schalles 2006;Ficek et al. 2012). In the analysis presented here, the R rs (k) were standardized prior to clustering in order to reduce the effect of the mean spectral reflectance on the separation of clusters. It is further thought that uncertainties in R rs (k) are more likely to have an effect on the amplitude of the spectra rather than their shape (Craig et al. 2006). The standardization used in this study entailed division by the area between each spectra and a zero baseline, calculated using numerical integration. This standardization approach was chosen because it preserves the shape of the R rs across the different parts of the spectrum (Vantrepotte et al. 2012).
Subsequently, a functional data analysis approach was used to cluster the spectra. This approach approximates each R rs (k) using a smooth function which is estimated via a linear combination of B-spline basis functions (full details are provided in Ramsay 2006). Rather than treating the reflectance values measured at each wavelength as single, correlated observations they are viewed as realizations of an unobservable continuous variable. Viewing the R rs spectra in this way and clustering the smooth curves allows features within the groups (i.e., commonalities in shape and mean level) to be captured, which may be neglected if clustering was applied only to a single summary value (Tarpey and Kinateder 2003;Tarpey 2007). An attractive feature of the smoothing methods used within functional data analysis is that the underlying functions can be estimated such that excessive local variability which is not of interest is removed. In addition to reducing the noise in the data, by treating the basis coefficients which estimate each curve as the quantities to be clustered, we can justify the assumption of independence amongst variables. This is a fundamental assumption of clustering which is often overlooked (Fraley and Raftery 1998) and can be violated in hyperspectral data due to the presence of strong autocorrelation between observations at neighboring wavelengths.
The number of basis functions used to estimate each smooth R rs function controls the degree of flexibility, with more basis functions resulting in more flexibility. Adaptive smoothing can also be applied via the use of a non-constant basis to enable more flexibility in regions where there is greatest variability amongst each R rs (k). In general, far fewer basis functions are used to represent each smooth function than there are original measurements, leading to a large reduction in dimensionality.

Dataset
Principal institute responsible for sample and data collection and analysis/experiment Marine system(s)

k-means clustering
The k-means approach (MacQueen 1967;Lloyd 1982) was used to generate spectrally distinct water classes from the R rs (k) datasets. The k-means algorithm is a partitional approach (Jain 2010), well known for its efficiency in the classification of large datasets (Huang 1998). For functional data, each individual and the cluster centers can be defined in terms of the sets of basis coefficients which define the curves. Multiple starting points (50) were specified for the cluster centers in order to ensure the partition identified is not sensitive to the initial selection.
As with all clustering approaches, for k-means, there is a choice for the appropriate number of clusters. In this case, we used a gap statistic (Tibshirani et al. 2001), which selects the statistically optimal number of clusters by comparing the change in within-cluster dispersion between the observed data and a null reference distribution that is generated using the observed data. The reference distribution assumes there is no cluster structure in the data. Fuzzy c-means (FCM) clustering was also explored. While the estimated membership function in FCM may be attractive, the drawback of this approach is the required specification of an additional parameter, namely a weighting exponent which determines the degree of fuzziness in the clusters. k-means is a special case of the FCM with the weighting exponent fixed to 1, resulting in all data points being assigned to one and only one cluster.
As a measure of proximity to cluster mean, the L2 norm distance was calculated between each individual and each cluster mean. These curve-mean distances were scaled between 0 and 1 and were used to quantify how close the curve was to each cluster mean.
In this study, we use the term: "Cluster" to refer to the end-member resulting from cluster analysis, i.e., set of distinct spectra as these were separated by k-means algorithm, "Group" to refer to spectra with high within similarity of the second derivatives of cluster means based on the L2 norm distances and "Type" for the representative spectrum (here, the mean spectrum is used) and in-water optically active compounds for a cluster.

Results
Spectral variability, rescaling and adaptive smoothing Figures 1a, 2a show the in situ R rs (k) spectra from Dataset-I and Dataset-C on their original scale (dataset details are provided in Tables 2, 3). Spectra from inland waters generally had higher mean reflectance than those from coastal waters but both sets demonstrated considerable variation in magnitude, even when they exhibited similar shapes. The reflectance peak in the green part of the spectra (500-600 nm) ranged from 0.0003 sr 21 to 0.2031 sr 21 in Dataset-I and from 0.001 sr 21 to 0.051 sr 21 in Dataset-C. In the near-infrared (NIR) spectral region (680-720 nm), maximum values of the R rs peak were 0.2137 sr 21 and 0.0359 sr 21 , respectively, for the inland and coastal data. Spectral features appearing around R rs (760) could be indicative of an abnormal signal, pertaining to flaws in the measurement and processing protocols (e.g., suboptimal sensor calibration, incompatible viewing angles, or lack of synchronicity in the measurement). As shown in Figs. 1b, 2b, standardized in situ R rs (k) spectra are accompanied by lower variability in the overall magnitude of reflectance. Coefficient of variation varied from 98% to 236% in Dataset-I with a local maximum at 675 nm and an overall minimum at 550 nm. Similarly, for Dataset-C, this varied from 98% at 550 nm to 236% at 675 nm.
The resulting design matrix for the B-spline basis used in this study was based on 25 cubic basis functions (Fig. 3). This number provided the best achievable fit to the data and captured all key features of the standardized R rs (k). Figure 3 also illustrates the unequally spaced basis of B-spline functions. The B-spline representation used approximately one knot every 30 nm between 400 nm and 500 nm, one knot every 15 nm between 500 nm and 750 nm, with the 750-800 nm part of the spectrum being covered by a single interval. Although the large basis function used between 750 nm and 800 nm ignores a large part of the variability in this range, it helps resolve issues of instrument noise and poor instrument calibrations that often affect this part of the spectrum (Fargion and Mueller 2002). The sparsity of the basis here will prevent features which are not of interest, or are subject to a high degree of uncertainty having a disproportional influence on the definition of clusters. Unusual spectra revealed by functional boxplots (not shown) were considered to correspond to "extreme" cases (13 spectra) or erroneous measurements (27 spectra) where successive peaks were shown. The former cases referred to very clear waters (cluster I13), while the latter cases were removed from the dataset.

Clustering of reflectance spectra
The k-means algorithm was applied to the basis coefficients which defined the smooth R rs (k) spectra for three datasets: inland (Dataset-I), coastal (Dataset-C), and all waters (Dataset-N). For k-means clustering, the statistically optimal number of clusters determined by using the gap statistic (with 500 reference distributions) was 12 for Dataset-I and 9 for Dataset-C. An additional group of curves (I13) that were identified as being unusual by the functional boxplots was added to the 12 inland clusters identified using the k-means approach. All pairs of cluster means were found to be significantly different using a permutation t-test (Ramsay 2007), suggesting unique structural groups.
The optimal number of clusters for Dataset-N was identified using an approach based on analysis of similarity between a fixed number of clusters due to the large number of spectra positioned on the boundaries between several clusters. The number of clusters was initially set to 21 based on the assumption that this would represent the upper bound (as the sum of clusters resolved in dataset-I [n 5 12] and Dataset-C [n 5 9] separately). In order to identify clusters that could subsequently be merged, the difference between clusters was explored in terms of the L2 norm distance between the mean curves for each cluster.
Differences between the shapes of the cluster mean curves, following a second derivative transformation, were also considered. For Dataset-I, the means of the 13 distinct standardized and non-standardized R rs (k) spectral clusters, as identified by the k-means algorithm, are presented in Fig. 4. The largest numbers of spectra were assigned to clusters I2 (15.3%) and I6 (14.3%). Clusters I1 and I13 collectively contained 1.1% of the data. We noted that clusters were not strongly driven by waterbody or season but were distributed across space and time. Figure 5 presents the mean in situ R rs (k) spectra before and after standardization, for the nine groups obtained by applying the k-means algorithm to the functional data from Dataset-C. The spectra were nearly equally partitioned (12.1-15.3%) between clusters C1, C3, C4, C6, C7, and C8. Conversely 22 (2.2%), 46 (4.9%), and 83 (8.4%) spectra were grouped in clusters C2, C5, and C9. Figure 6 shows R rs and standard deviation for each cluster identified in inland waters. The k-means classification of all data combined (Dataset-N) resulted in the 21 sets of reflectance spectra shown in Fig. 7. Figures 8, 9 summarize water constituents and the optical properties corresponding to each cluster of Dataset-I (Iclusters). Several parameters measured coincident to the reflectance measurements are considered here. The water constituent concentrations that were most commonly Optical types of inland and coastal waters measured in parallel with the radiometric measurements were chlorophyll a (Chl a) (n 5 2835), total suspended matter (TSM) (n 5 1836), and absorption of colored dissolved organic matter (CDOM) at 442 nm (a CDOM (442)) (n 5 1720), while 622 R rs (k) measurements were also accompanied by absorption coefficients of phytoplankton pigments a ph (k) and non-algal particles (NAP) a NAP . Despite the high variability of these in-water parameters and the often complex relationships between apparent optical properties and the particulate and dissolved material found in inland waters, there are some notable differences among the groups of in situ water properties for each partition of R rs (k). As expected, the optical properties and concentrations of optically active substances underpin the clustering of R rs (k). The 13 I-clusters exhibited marked differences in terms of their water constituent concentrations and IOPs. For example, clusters I1, I7, and I8 exhibited very high concentrations (mean values well above 100 mg m 23 ) of Chl a and the accessory pigment phycocyanin (PC) (mean values greater than 200 mg m 23 ). In contrast, Chl a was remarkably low in clusters I3 (1.60 6 1.02 mg m 23 , n 5 214) and I13 (0.27 6 0.57 mg m 23 , n 5 8). These clusters also showed higher values of Secchi disk depth (I3: 6.17 6 2.52 m, n 5 173; I13: 18.45 6 4.17 m, n 5 2) and the lowest mean concentration of TSM (I3: 1.57 6 1.64 mg L 21 , n 5 87; I13: 1.00 6 0.88 mg L 21 , n 5 8). We noted that the highest mean inorganic suspended matter (ISM) concentration (94.41 6 64.45 mg L 21 , n 5 200) was found in the samples grouped in cluster I5. In addition, cluster I5 was characterized by the highest a NAP (442) mean (5.76 6 2.90 m 21 , n 5 112), while clusters I10 and I1 had higher a CDOM (442) (9.00 6 7.35 m 21 , n 5 50) and a ph (442) (106.49 6 10.28 m 21 , n 5 11), respectively. Clusters with the highest a ph (442) and a NAP (442) values were principally found among the groups with their lowest massspecific absorption coefficients (a ph (442):[Chl a] or a* ph (442) and a NAP (442): [TSM] or a* NAP (442)) and, corresponding to a higher degree of "pigment packaging" (e.g., Bricaud et al. 1995) or cell shading and a more minerogenic NAP. In cases where clusters had similar mean concentrations of one or more biogeochemical parameters, we generally observed differences in other variables which facilitated their distinctive characterization. For example, cluster I4 showed comparable to I5 Chl a but contrasting ISM concentrations. Figure 10a-c illustrate absorption spectra of CDOM and specific absorption of phytoplankton and NAP for each cluster identified by the classification analysis. In the analysis, we considered a spectral range from 400 nm to 700 nm, which corresponds to the range available for most data points. Both a CDOM and its spectral slope (S CDOM ) varied between the different clusters. Cluster I3 showed the lowest S CDOM (0.0114 6 0.0068 nm 21 , n 5 6). Higher S CDOM values were observed in clusters I9 (0.0173 6 0.0050 nm 21 , n 5 39), I2 (0.0161 6 0.0037 nm 21 , n 5 57), and I11 (0.0150 6 0.0017 nm 21 , n 5 30). S CDOM showed relatively low variability within the remaining clusters with mean values in these ranging from 0.0139 nm 21 to 0.0147 nm 21 . Figure 10b shows high variability of mean a* ph (k) in both magnitude and spectral shape among the clusters where both a ph (k) and Chl a were measured. The differences in spectral amplitude were mainly observed in the blue and red regions of the spectra; cluster I3 exhibited the lowest blue to red peak ratio while that ratio was higher in clusters I5, I11, and I12. These clusters were also characterized by the lowest mean value (0.0080 6 0.0017 nm 21 , n 5 45) of slope for NAP absorption, S NAP .  Figure 11 summarizes the relative contribution of optically active substances a CDOM , a ph , and a NAP to total absorption (minus pure water absorption) at 442 nm for each optical cluster. Phytoplankton absorption was consistently the dominant absorption component of samples grouped in clusters I1 and I7 and regularly the weakest component in I4 and I5. Spectra belonging to cluster I5 were predominantly characterized by strong relative influence of a NAP . a CDOM was the dominant light absorbing coefficient at 442 nm for clusters I2 and I3. Data points grouped in clusters I8 and I6 were mainly found toward the upper half of the ternary plot, whereas samples collected from clusters I11 and I12 mostly appeared at the lower half of the plot.

Relationships among optical clusters in inland and coastal waters
R rs (k) spectra from coastal systems were predominant in clusters N2, N5, N10, N12, N18, and N20 (all with relative contributions above 79.3%), while the remaining clusters were largely composed by spectra from inland waters. Sixteen clusters contained spectra derived from both inland and coastal systems (Fig. 7). A phylogenetic tree was constructed to explore relationships among the 21 cluster means (Fig.  12). This tree represents the similarity of the second derivatives of cluster means based on the L2 norm distances. Clusters N2 (n 5 20), N3 (n 5 36), N7 (n 5 57), N9 (n 5 21), and N15 (n 5 59) can be seen to be most distinct from other clusters with N3, N15, and N2 showing most difference from all other clusters in terms of their second derivatives. Two of these clusters (N2 and N9) also contained the lowest number of R rs (k) spectra indicating they may be composed of uncommon spectral properties. Cluster N2 displayed spectral features in the blue and red region of the spectrum that suggests residual glint contribution in the measured signal. This group of measurements was therefore excluded from further analysis.
Using the phylogenetic tree, seven major Groups with high within group similarity were identified (Group A: N18, N12; B: N11, N13, N14; C: N4, N17; D: N10, N21; E: N8, N20; F: N1, N6, N16; G: N5, N19). Group A (n 5 310) mainly included reflectance spectra from Dataset-C with relatively high R rs (k) in the blue. Groups E-F both had three R rs (k) peaks between 500 nm and 750 nm and were mainly found in Dataset-I. The reflectance peak around 700 nm in Group F appeared associated with particulate scattering and occurred at longer wavelengths than in Group E, where cluster N5 suggests the presence of Chl a fluorescence at around 685 nm and cluster N19 suggests highly turbid water with a minor contribution of phytoplankton absorption. Groups B, D, and G were assembled closely (Fig. 13). These three Groups contained data from both Dataset-I and Dataset-C. Relatively clear waters (no prominent peak near 700 nm) and a strong influence of a CDOM in the blue characterize the clusters in Group B. Clusters N10 and N21 (Group D) shared a sharp R rs (k) decrease near 600 nm and high blue-to-green R rs (k) ratio suggesting clear waters, but with a lower blue-to-green ratio compared to the clusters of Group A. Last, Clusters N5 and N19 showed high similarities of the second derivatives; neither cluster shows clearly defined features beyond the attenuation of light by a CDOM in the blue and absorption by water in the red to NIR domain. Interestingly, N5 contained primarily data collected in coastal systems (79.27%) whereas N19 was composed of spectra found in inland waters (98.21%).

Methodological considerations
R rs (k) holds valuable information on the concentration and composition of in-water constituents (Gordon et al. 1988;Gordon and Franz 2008) and is now readily available from multispectral ocean color satellite sensors. We present a novel approach for classification of in situ hyperspectral R rs (k) to help optimize the interpretation of proximal or remotely sensed R rs (k) in terms of biogeochemically-relevant quantities. While k-means is a classical statistical method, its application in a functional setting is not routine, particularly when the irregularly spaced B-spline basis coefficients have been selected so the clusters are based on the areas of the spectra, which are of most interest. The robustness of this approach is potentially dependent on the smooth functions which are used to estimate the underlying smooth processes from which the observed data have arisen. The 25 cubic basis functions, with different resolution along the wavelength, employed here provided an excellent fit to the data (Fig. 3), capturing all key features of the spectra while removing local variability. This approach also proves to be an efficient way to reduce dimensionality and noise of the spectra while preserving distinctive features. FCM clustering was also explored and the adjusted Rand index (ARI) (Hubert and Arabie 1985) used for the comparison with the k-means approach suggested strong agreement (ARI greater than 0.76) between the two clustering methods. While both shape and amplitude of R rs (k) contain information about optically active constituents, we standardized the in situ spectra ( Vantrepotte et al. 2012) in order to reduce the influence of Optical types of inland and coastal waters spectral amplitude on clustering. This is considered a suitable approach when considering R rs (k) from such diverse origins. Focussing on the shape rather than amplitude of the spectra implies primary sensitivity to spectral variation in absorption coefficients in the clustering (Loisel and Morel 2001). However, since the absorption by water itself also displays a spectral dependence, the attenuation depth of the recorded signal, and therefore the light path and intensity of light scattering which primarily affects R rs (k) amplitude, does bear influence on the clustering results. General observations of the obtained clusters (Figs. 4, 5) show amplitude variability in dominant parts of spectra but with distinctive spectral features. However, data standardization prior to clustering has also been suggested to reduce spurious effects of unequal variances and clustering of non-standardized R rs (k) is still most common in the literature (Moore et al. 2014;Shen et al. 2015). Nevertheless, when Mahalanobis distance is used in the cluster analysis, data preprocessing is considered redundant unless rounding errors in the covariance matrix have not been restrained (Besset 2001;Eyob 2009).

Optical water typology
While studies of optical water typology (e.g., Jerlov 1977;Morel and Prieur 1977;Moore et al. 2009) provided useful insights on the distinctive optical types found in aquatic systems, they were challenged by the representativeness of optical conditions and/or limited understanding of factors driving the observed variability among the different optical clusters. Our ability to identify representative clusters from 4035 R rs (k) spectra collected in 250 inland water and several coastal systems benefits from operating over a wide range of in situ biogeochemical parameters. In this study, we were able to resolve clusters of R rs (k) spectra representing statistically distinct optical clusters found in inland or coastal waters and, in some cases, in both environments. The number of clusters was identified following a purely data driven approach where the number chosen was selected by the gap statistic as statistically optimal. The OWTs suggested here are considered as typical OWT found in the datasets and emerge as representations of optical conditions that are a glimpse of a natural continuum system in aquatic systems. Moreover, as an extension to previous research, we provide a detailed physical interpretation of the derived clusters facilitated by extensive data on the IOPs and concentrations of color-forming biogeochemical constituents. IOP data allowed a more detailed characterization of the optical clusters and provided reference subsets. However, we recognize that different instruments, methods, and protocols have been utilized for the measurement of optical and biogeochemical parameters. Consequently, some of the variability observed in the R rs (k) spectra will have arisen from different instrumentations and data collection and processing methodologies. In practice, biogeooptical data covering such a wide range of ecosystem scales are scarce, and measurement protocols have often been locally refined, modified, and optimized. It may be expected that the continued contribution of in situ observations to community databases such as LIMNADES and SeaBASS will lead to a gradual convergence of methodologies and a reduction in the associated uncertainties on in situ radiometric measurements.

Inland waters OWTs
The classification of inland waters R rs (k) revealed 13 different optical clusters (Figs. 4, 6). The categorization of these clusters to OWTs was subsequently based on in-water information on absorption coefficients (i.e., Figs. 10, 11) and biogeochemical properties (i.e., Figs. 8,9). Table 4 provides a brief description of each OWT. PC and ratio of PC to Chl a (Simis et al. 2005) indicated the presence and relative abundance of cyanobacteria in an OWT. This is of particular interest for the monitoring of cyanobacteria blooms.
OWT1 represents waters with extremely high concentrations of Chl a, PC, and high R rs (k) in the red to near-infrared region of the spectrum indicating high abundance of cyanobacteria near or at the water surface. High PC concentrations (6953.3 6 9778.9 mg m 23 ) and ratios of PC to Chl a above 1 are also indicative of high abundance of cyanobacteria in this OWT. It is not uncommon to find extremely high concentrations of pigments and vegetation-like R rs (k) spectra due to shallow light penetration (and therefore limited water absorption) in inland and coastal waters (Kutser et al. 2012). For all spectra pooled into OWT1, we observed an R rs (k) peak close to 655 nm. This has been suggested to be a combined effect of high Chl a and PC absorption either side of the peak (Kudela et al. 2015) and could also be associated with sun-induced autofluorescence of phycobilipigments. a NAP while high, is largely masked by phytoplankton absorption, suggesting dominance of living material over detritus and mineral particles, and masking of a CDOM influences on the spectrum due to a short light path, similar to the masking of the absorption by water.
OWT2 was the most common case in our dataset, showing diversity in reflectance shape with peaks at regions (565 nm, 645 nm, and 695 nm) where particles scatter light (Gitelson et al. 2000;Doxaran et al. 2009) and where peaks where bounded by pigment absorption maxima (Kirk 1994). In terms of the absorption budget at the blue wavelengths, OWT2 is located close to the center of the ternary plot which indicates that a CDOM and a NAP over a ph were contributing almost equally to non-water absorption, while the high S CDOM (400-700) suggests the dissolved fraction was dominated by terrestrial humic acids (Yacobi et al. 2003;Zhang et al. 2005;Fichot and Benner 2012).
OWT3 denotes clear waters characterized by high transparency and relatively low concentrations of water constituents that do not co-vary. Remote sensing applications could be challenging in these waters due to the lack of diagnostic features while still providing the optical complexity that invalidates the use of blue-green ratio ocean chlorophyll algorithms. Specific absorption of phytoplankton and NAP in this OWT was generally high and in line with values recorded in coastal areas (e.g., Tilstone et al. 2012). OWT4 represents turbid waters with moderate concentrations of Chl a, PC, CDOM, and dominance of a NAP combined with high a ph variability at the shorter wavelengths. Specific absorption of NAP of OWT4 was substantially high. Using the available data and reported information of the sites categorized in this OWT (Dall'Olmo and Gitelson 2006;Matthews and Bernard 2013), it can be deduced that the increased a* NAP (442) is related to high organic content of TSM (Ferrari and Dowell 1998;Babin et al. 2003).
OWT5 shows the brightly reflective nature of sedimentladen waters with high reflectance across a wide range of the spectrum. Similar reflectance spectra are described in highly turbid aquatic systems (Dekker 1993;Ruddick et al. 2006;Schalles 2006). Sites belonging to this optical type were mainly shallow floodplain (e.g., Amazon) and lowland lakes (e.g., Taihu) or rivers (e.g., Missouri). The ISM contribution to TSM in these waters is high (generally above 70% and on several occasions up to 100%), while a NAP (442) is noticeably high and NAP is often the dominant component of light absorption. The dominance of particles of mineral origin is likely to be related to the observed low a NAP (442) : ISM (mean 5 0.0736 m 2 g 21 , N 5 109) values (Mikkelsen 2002).
OWT6 includes waters with balanced effects of optically active constituents to the absorption budget. This OWT pooled samples with relatively high concentrations of Chl a and PC and equal contributions of CDOM, phytoplankton, and NAP to absorption at blue wavelengths. Relatively high values of PC (62.5 6 51.21 mg m 23 ) and PC to Chl a ratio (1.4 6 0.9) reveal a significant presence of cyanobacteria in this OWT.
OWT7 delineates waters with particularly high values of Chl a concentrations and cyanobacteria abundances (PC : Chl a: 1.9 6 0.8 and PC: 733.4 6 394.1 mg m 23 ) and high R rs (k) at red/near-infrared spectral region (albeit lower than OWT1). In contrast to OWT1, OWT7 exhibits a pronounced reflectance peak around 700 nm. a ph dominated the absorption budget at 442 nm while a CDOM was high but very variable.
OWT8 is characterized by elevated concentrations of water constituents and especially of Chl a and accessory pigment PC (cyanobacteria presence) is also the main characteristic of OWT8. Nevertheless, Chl a and PC levels are lower when compared to OWT1 and OWT7, resulting in differences in R rs (k) amplitude and shape particularly in the red and near-infrared parts of the spectrum. In this context, R rs (k) appears lower at this spectral region while the reflectance peak is closer to 700 nm.
OWT9 shows similar spectra to those of OWT2 with an absence of a well-defined peak in the red to near-infrared region and increased non-standardized and standardized R rs (k) between 500 nm and 600 nm. Reflectance at shorter wavelengths was generally higher in OWT9. Optically active compounds in these waters were at similar concentrations to those observed in OWT2.
OWT10 differed from any of the other optical categories in having considerable lower reflectance from 400 nm to 600 nm with no discrete peaks and troughs in this part of the spectrum. However, a R rs (k) peak is noticeable near 700 nm. OWT10 grouped data collected from rivers and lakes with markedly higher concentrations of CDOM, which has a strong absorption effect at the shorter wavelengths < 500 nm (Kirk 1994;Del Vecchio and Blough 2004). Similar spectra have been previously reported in CDOM-rich environments (e.g., Kallio et al. 2001). Str€ ombeck and Pierson (2001) have shown that CDOM, at high concentrations, can significantly absorb light even in the red region.
OWT11 appears typical for inland waters with presence of cyanobacteria, high a* NAP (442) and high concentrations of CDOM. Reflectance spectra of this OWT appear with clearly observable but flattened peaks between 550 nm and 700 nm and with high red to blue ratios. The green maximum is suppressed and shifted to longer wavelengths due to strong CDOM absorption.
OWT12 represents turbid, moderately productive waters with cyanobacteria presence. R rs (k) spectral shapes resemble those of OWT11 but with a shorter wavelength of the green maximum while values are higher in the blue and lower from 580 nm to 720 nm.
Finally, OWT13 shows typical clear blue waters with high reflectance at shorter wavelengths and low reflectance values in the red region of the spectra, similar to clear oceanic waters (e.g., Cannizzaro and Carder 2006). This OWT was poorly represented in Dataset-I. In general, there is a scarcity of observations below 3 mg m 23 of Chl a (14%) or below 3 mg L 21 of TSM (7%) in Dataset-I which reflects the recent focus of research toward eutrophic lakes and reservoirs with harmful algal blooms.

Relationships among optical clusters in inland and coastal waters
Synthesis and analysis of datasets coming from both inland and coastal waters provided a glimpse of the optical proximity between systems with a diverse range of properties. The results highlight common as well as unique spectral characteristics found in these waters, supporting a move toward an integrated optical classification framework for inland and coastal systems. This could be of great help especially in studies of multiple-component dynamic aquatic systems (Tyler et al. 2016) and global climatic trends. Classification of all available data led to 21 clusters of reflectance spectra (Fig. 7), many of which contained data from inland and coastal systems that importantly demonstrates a continuum of OWTs that extends across system boundaries. Previous related research (Moore et al. 2001(Moore et al. , 2009(Moore et al. , 2014Reinart et al. 2003;Lubac and Loisel 2007;Le et al. 2011;M elin et al. 2011;Spyrakos et al. 2011;Vantrepotte et al. 2012;Tilstone et al. 2012 Ye et al. 2016) has suggested a substantially smaller number of optical clusters but these studies were primarily conducted at regional scales where sample sizes and the global representativeness of waterbodies considered might have limited the resolution of OWTs. Sun et al. (2012Sun et al. ( , 2014 suggested a different approach for optical classification of aquatic systems based on the normalized trough depth at 675 nm and data from turbid and productive waterbodies. This approach could be extremely useful especially for the retrieval of Chl a but its applicability to other environments included here (e.g., clear waters, high in a CDOM waters) needs to be proven. Many of the clusters described in these previous studies are represented in Figs. 4-7. Moreover, here we have considered waters with extreme scattering and/or absorbing properties, which have typically been omitted from previous optical classification schemes as outliers. In some cases, surface waters with extreme optical properties were found to form discretely identifiable optical clusters (e.g., cluster I3 and I10). The current analyses and results show a greater number of clusters in inland than in coastal and open-sea systems. This is, at least in part, explained by the larger size and geographical and seasonal coverage of the inland water dataset. However, given the diversity in inland waters, it is not unreasonable to suggest that these system could also comprise a larger portion of the optical diversity of natural waters. Despite these differences, the cluster analysis performed here has shown that some optical clusters are common to both inland and coastal waters. The phylogenetic tree of Fig. 12 represents the similarity of the second derivatives of all cluster means based on L2 norm distances and identified seven major groups. In parallel, it provided useful information regarding the parts of the spectra responsible for the observed similarities/dissimilarities between the clusters. These principally concern R rs (k) peak shifts, changes in the ratio of blue to green or red and features associated to accessory phytoplankton pigments. Such information should be considered when designing future EO missions.

Implications for implementation to satellite imagery
The scope of this work was to identify distinct optical clusters and suggest OWTs for natural waters based on in situ data. Clusters were defined based on hyperspectral R rs (k) but these can be resampled to any sensor spectral resolution to assess the capability of differentiating clusters from EO data. In order to broadly evaluate the consistency of clustering results with respect to available EO satellite sensor wavebands, we performed a preliminary analysis to test the applicability of the approach. This included a comparison between the output of a spectral matching approach applied to multispectral sensor data simulated from in situ R rs (k) and the above mentioned clusters identified in the in situ datasets. Consistency was expressed as agreement between the dominant cluster identified by spectral matching to the bands of the medium resolution imaging spectrometer (MERIS) and the k-means output where hyperspectral data were used. Values of 1 indicate perfect agreement while zero indicates no agreement between identified clusters. MERIS was chosen as the optimal sensor for this investigation due to its long catalogue of ocean color images (2002-2012) with a spatial resolution of 300 m, making it useful for coastal and inland water applications. However, similar results may be attained with alternative sensors such as ocean land colour instrument (OLCI) on Sentinel-3 and to some extent moderate resolution imaging spectroradiometer (MODIS). Different strategies are available to accomplish cluster assignment of satellite-derived spectra, but we followed the approach described in Moore et al. (2014) and M elin and Vantrepotte (2015) that has already been implemented in the ESA Ocean Color-CCI project.
Spectra were standardized by dividing by the spectrum integral, which in every case led to substantial improvement in the value of cluster memberships. For the 13 clusters identified in Dataset-I, the cluster membership agreement was 0.85. However, in some cases, the differences in class membership between the top and second ranking cluster were negligible. When we considered shared top ranking for differences less than 0.001 in the membership between top and second ranking clusters, a perfect agreement was achieved. When considering spectra from coastal environments, membership agreement was lower (0.65). Agreement was improved with the removal of spectral bands between 700 nm and 800 nm prior to operation of the spectral matching routine. While encouraging, further refinement of the method is necessary to justify the classification scheme when all data are considered (21 clusters). Given that the application of this scheme on satellite imagery is sensitive to the performance of atmospheric correction methods, the selection of spectral bands must be exercised with caution. We anticipate that residual errors from incomplete atmospheric correction unrepresented in the OWT spectra and partition inefficiencies can result in spectra with zero or very low membership values. These spectra could be used to provide a better understanding of the representativeness of OWT and the limitations of atmospheric correction models and clustering methods.

Concluding remarks
With increased interest in monitoring aquatic systems across wide temporal and spatial scales using remote sensing data, reliable OWT classification approaches are essential to deal with the optically diverse nature of aquatic systems, and to optimize the selection of atmospheric correction and water constituent algorithms. Through the use of a comprehensive dataset and the development of an elegant but robust approach for the classification of the in situ hyperspectral measurements, we expect to better understand the variability of OWTs across inland and coastal waters and provide a framework to support global change research in coming years. Our methods and results can be used to identify OWT-specific technological and modeling requirements for remote sensing applications and highlight gaps in knowledge and data needs. In this regard, we note the rarity of particulate scattering and backscattering data and of standard protocols for radiometric measurements and data processing. Application of this approach to satellite imagery will require careful consideration of these confounding factors as well as the influence of uncertainties associated with atmospheric correction on the reflectance signal. Public access to cluster spectral means and covariance matrices are provided through the web page http://www.globolakes.ac.uk/.