Unsupervised biodiversity estimation using proteomic fingerprints from MALDI‐TOF MS data

Species identification using matrix assisted laser desorption/ionization time‐of‐flight mass spectrometry (MALDI‐TOF MS) data strongly relies on reference libraries to differentiate species. Because comprehensive reference libraries, especially for metazoans, are rare, we explored the accuracy of unsupervised diversity estimations of communities using MALDI‐TOF MS data in the absence of reference libraries to provide a method for future application in ecological research. To discover the best analysis strategy providing high congruence with true community structures, we carried out a simulation with more than 30,000 analyses using different combinations of data transformations, dimensionality reductions, and cluster algorithms. Species profile, Hellinger, and presence/absence transformations were applied to raw data and dimensions were reduced using principal component analysis (PCA), t‐distributed stochastic neighbor embedding, and uniform manifold approximation and projection. To estimate biodiversity, data were clustered making use of partitioning around medoids, model‐based clustering, and K‐means clustering. The analyses were carried out on published mass spectrometry data of harpacticoid copepods. Most successful combinations (Hellinger transformation + PCA or raw data + partitioning around medoids) returned good values even for difficult species distributions containing numerous singleton species. Nevertheless, errors occurred most frequently because of such singleton taxa. Hence, replicative sampling in wide sampling areas for analysis is emphasized to increase the minimum number of specimens per species, thus reducing putative sources of errors. Our results demonstrate that MALDI‐TOF MS data can be used to accurately estimate the biodiversity of unknown communities using unsupervised learning methods. The provided approach allows the biodiversity comparison of sampled regions for which no reference libraries are available. Hence, especially data on groups which demand a time‐consuming identification or are highly abundant can be analyzed within short working time, accelerating ecological studies.

Matrix assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is a tool widely used in microbiology to identify bacteria, viruses, or fungi (Majchrzykiewicz-Koehorst et al. 2015;Singhal et al. 2015;Normand et al. 2017). It can be used as a rapid, cost-effective alternative to specimen by specimen DNA barcoding (Hebert et al. 2003). That is why recently, mainly in pilot studies, this technique was successfully applied for species identification of metazoans (Dvorak et al. 2014;Yssouf et al. 2014;Mazzeo and Siciliano 2016). To identify specimens based on a proteomic fingerprint, often company supplied supervised identification software solutions such as the MALDI Biotyper by Bruker are used. These find the most similar spectra from a reference library and return a value of certainty for the resulting identification. Similar to this, an open source R-based random forest (Breimann 2001) approach was introduced using machine learning to classify species (Rossel and Martínez Arbizu 2018a) according to the available reference library. Applying a post hoc test evaluates if the identification can be considered correct or false positive. Some studies employed techniques such as hierarchical clustering (Kaiser et al. 2018) or principal component analysis (PCA; Hynek et al. 2018) to discriminate species. However, all these techniques rely on reference libraries to assess species diversity and some fail to detect false-positive classifications. Hence, identifying new species in biodiversity assessments is difficult.
However, assessing biodiversity is crucial in ecological studies to understand the interaction of a community to the surrounding ecosystem but also to infer effects of a changing environment on species' diversity. Quantifying biodiversity is necessary to allow comparisons of different ecosystems and environments. Nevertheless, assessing biodiversity using proteomic fingerprinting in areas for which no MALDI-TOF reference libraries are available is difficult. Supervised tools such as the Bruker MALDI-TOF Biotyper or the random forest approach cannot provide identifications without a library. Hierarchical clustering and PCA on the other hand will fail to provide species margins and thus depend on researchers to recognize these by themselves subjectively. In molecular analyses, tools such as the Automatic Barcode Gap Discovery (ABGD; Puillandre et al. 2012a) or the Generalized Mixed Yule Coalescent approach (Pons et al. 2006) are frequently used to automatically delimit species. However, to date there is no comparable tool for MALDI-TOF MS data to delimit species.
By having tested various combinations of data transformations, dimensionality reduction methods, and different clustering algorithms, we provide a workflow to unsupervised biodiversity estimation based on MALDI-TOF MS data without the need for reference libraries. The workflow can easily be applied by using the R function provided in the Supporting Information S1.

Data set
MALDI-TOF MS data from published mass spectra Martínez Arbizu 2018b,c, 2019a,b;Supporting Information Table S1) on harpacticoid copepods were pooled and screened for species with at least 10 specimens. Only specimens reported to be obtained from appropriately stored samples were included and combined into the data set used for this study.
By repetitive sampling from this data set, 660 different data sets were constructed containing between 2 and 34 different species. In total, 20 data sets per n species were generated (33 × 20). For each species, only 75% of the 10 specimens available per species were included in the simulated data set ( Fig. 1) by random sampling.
The data set is a matrix containing specimens as rows and m/z values (molecule mass/charge) as columns. Values in these columns indicate peak intensities after standard workflow consisting of data trimming, smoothing, normalization, noise reduction, peak detection, and peak binning (Bode et al. 2017;Kaiser et al. 2018;Holst et al. 2019). In the following, these data will be referred to as raw data.

Data analyses
To find the best method for unsupervised biodiversity estimation, different combinations of data transformation, dimensionality reduction, and cluster algorithms were tested (Fig. 1). Data were either analyzed with presence/absence (p/a), species profile (total), or Hellinger transformation (Legendre and Gallagher 2001) using "decostand" from the Rpackage vegan (Oksanen et al. 2013) or without data transformation (raw). Transformed data were tested without dimensionality reduction (none) or after applying PCA (Pearson 1901), uniform manifold approximation and projection (UMAP; McInnes and Healy 2018) from the R-package "umap" Fig. 1. Workflow for the test of unsupervised diversity estimation. From the initial 34 species, specimens were sampled with a frequency of 0.75 into new data sets containing between two and 34 species. For each of these species numbers, 20 data sets were simulated. Different data transformations, dimensionality reductions, and cluster algorithms were applied to these data sets. Finally, the resulting estimated biodiversity was compared to the correct data. (Konopka 2018) or t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton 2008) from the R-package "Rtsne" (Krijthe 2015). The resulting data sets were analyzed with model-based clustering (MC; Fraley and Raftery 2002) using the command "Mclust" from the R-package "mclust" (Scrucca et al. 2016), K-means clustering (Lloyd 1982;Fraley and Raftery 2002) using the command "cascadeKM" from the R-package "vegan" or with partitioning around medoids (PAM; Kaufman and Rousseeuw 1990) from the Rpackage "cluster" (Maechler et al. 2018). While for the majority of methods the most commonly applied R-packages were used, the "Rtsne" package was chosen because it uses a faster t-SNE implementation compared to other t-SNE packages.
When analyzing p/a data in mass spectrometry, weight is put on the presence of a mass peak and not on the intensity thus, giving importance also to less intense peaks. This may particularly be relevant in studies searching for biomarkers to differentiate between closely related species (Carrera et al. 2013). However, unlike raw data, p/a transformation disregards information provided through signal intensities. That is why, additional to raw data, species profile (total) and Hellinger data transformations (Legendre and Gallagher 2001), which were already shown to notably reduce identification errors in machine learning applications (Rossel and Martínez Arbizu 2018a), were tested. These data transformations are originally recommended for use in ecological studies because of giving low weights to rare species (Legendre and Gallagher 2001), reducing the effect of many zeros in a data set. In ecological research, many zeroes may occur especially in studies analyzing gradient data comprising sites with different species compositions (Legendre and Gallagher 2001). Because of mass spectra differences between species, MALDI-TOF MS data sets also do often contain numerous zeros. Therefore, these transformations were found suitable for this kind of data. Like with raw data, one advantage of these relative transformations over p/a transformation is the additional information obtained through species-specific peak intensities considered during analyses.
When using species profiles transformation, relative abundances are obtained row wise by division through margin totals. Hellinger transformation is based on these relative abundances and is calculated by applying a further square root transformation to total transformed data (see command decostand from R-package "vegan").
Our workflow for unsupervised biodiversity estimation demands reanalyzing the data set as many times as there are specimens in the data set. Computational effort can be very high because data sets may contain several hundred m/z values Martínez Arbizu 2018c, 2019b). Therefore, dimensionality reduction is an important step to reduce computational resources needed. However, chosen dimensionality reduction still needs to preserve the actual data structure as much as possible. Here, we chose PCA because the majority of information is kept while reducing the length of the data set to the number of specimens included and hence the demand for computational power. If the data contain fewer specimens (rows) than mass peaks (columns), the number of dimensions is automatically reduced by this PCA implementation to the number of specimens contained in the data set. t-SNE and UMAP are especially designed to deal with complex, high dimensional data, preserving local or global structures in the data (McInnes and Healy 2018), which is the key premise to find clusters. Because t-SNE is designed to reduce dimensionality for visualization purposes, we chose dimensionality reduction to three dimensions. UMAP on the other hand generally reduces the data to two dimensions. That is why, in contrast to the application of PCA in our workflow, t-SNE and UMAP reduce the number of dimensions more vastly saving additional computational power during data clustering.
To estimate the number of species in the data set, three different cluster algorithms were applied. K-means was used as one of the most widely used cluster algorithms. Because in contrast to K-means clustering K-medoids is less sensitive to noise and outliers (Park and Jun 2009), PAM clustering was chosen as the most common K-medoids implementation. Both these methods depend on distances of data points to a suspected cluster mean or medoid. MC on the other hand uses Bayesian information criterion to find the best underlying model parameters for a Gaussian Mixture Model to assign clusters (Scrucca et al. 2016).
In total, 48 combinations of data transformations, dimensionality reductions and clustering methods were applied (data transformation × dimensionality reduction × clustering: 4 × 4 × 3) resulting in 31,680 analyses of the generated data sets. The number of clusters to test was chosen between 2 and the number of specimens −1.
While "MClust" and "cascadeKM" provided functions to assign the optimal number of clusters, for PAM clustering the average silhouette width after analyses for each number of proposed clusters was saved. The result showing the highest average silhouette width was chosen as the best number of clusters (Rousseeuw 1987).
From all clustering approaches, the best number of estimated clusters was used to calculate the Shannon index and species evenness. To find the best performing combinations, the mean percentage deviation from the correct diversity was calculated for the entire range of species but also for all species numbers individually. parison to the correct diversity (black). The average percentage difference to the correct diversity is displayed for all combinations of n species between 2 and 12, for n species larger than 12 and for the entire data set.

Species dominance simulation
To test the best six approaches for "real world scenarios", data sets containing 100 specimens were generated according to three typical species distributions. From the original data set, simulated mass spectra (n = 3,400) using "smooth.data" from the R-package "RFtools" (Martínez Arbizu and Rossel 2018; Rossel and Martínez Arbizu 2018a) were generated to increase the number of spectra, from which a population can be simulated.
To create a data set with data distributed according to the Unified Neutral Theory of Biodiversity and Biogeography (UNTB), the 3400 mass spectra (100 mass spectra per species) were used as a metacommunity which was mutated using the command "untb" from the R-package "untb" (Hankin 2007). The mutation was carried out with a probability of new organisms not being a descendant of an existing individual of zero, five organisms that die in each time step and 10,000 simulated generations. The results were adjusted to a community of 100 specimens (Fig. 2a). The Negative binomial (Nb) distribution (Fig. 2b) was generated in R using "rbinom" with a probability of success of 0.5. The command "rrbs" from the R-package "sads" (Prado et al. 2018) was used to generate a Broken stick (BS) distribution (Fig. 2c). According to these distributions, communities were sampled out of the 3,400 mass spectra using the command "stratsample" from the R-package "survey" (Lumley 2004).  With n species > 12 the variance of percentage difference and the difference itself decreases. While Hellinger transformed data with UMAP dimensionality reduction and PAM clustering performs worst with n species < 12, it performs best with n species > 12.
To test the influence of the minimum number of specimens per species (n min ) on the resulting diversity estimates, n min was varied between 1 and 10. For each n min , 20 data sets were generated. For n min = 1, the distributions were generated as described above. To keep the distribution pattern for tests of n min from 2 to 10, in each step one specimen per included species was added (Fig. 2, colored distributions). Shannon index was assessed for all data sets and the percentage deviation from the correct values was calculated.

Diversity estimated individually vs. pooled estimation
Finally, a scenario was simulated in which 20 stations with varying combinations of 29 different species were generated. All stations contained 100 specimens distributed according to BS distribution of which 10 were singleton species (number of specimens per species equals one). Species clusters were estimated either individually (100 specimens per analysis) or all stations pooled together (2,000 specimens per analysis). Communities resulting from both analyses were Hellinger transformed and plotted together with control communities in an MDS plot to check for congruence. Furthermore, a Mantel test for both communities was carried out to compare it to control communities using 9,999 permutations.

Results
In total, 31,680 analyses were carried out and estimated species diversities for all 48 combinations of data transformation, dimensionality reduction and clustering methods were evaluated (Table 1; Fig. 3). The nine best estimates with a percentage difference from the correct diversity of less than 10% used PAM clustering. These were followed by seven further estimates including PAM as clustering algorithm (difference < 50%). Percentage difference of all diversity estimates including K-means or MC clustering ranged from 56.90% to 88.55%.
Notably, most combinations tend to overestimate diversity ( Fig. 3b-f) while only the combination of Hellinger transformation with UMAP and PAM underestimates diversity within the range of 2 to 12 analyzed species (Fig. 3a). The best estimations with only 5.55% difference from the correct values were made with Hellinger transformed data analyzed without dimensionality reduction using PAM clustering or with PCA for dimensionality reduction. These two  figure 6. To the right of each tree, the results of individual cluster estimations are plotted. The colored bars display the errors that occurred during cluster assignment. Gray bars indicate clusters that were estimated correctly. Ranging from only a few singletons that were assigned to the wrong clusters (a) to all singletons that were classified erroneously (b). Finally (c) depicts a case in which during individual diversity estimation a single species was split into five different clusters. In the lower left hand corner, the Shannon index and species evenness are displayed. Arrows highlight clusters for which delimitation based on expert opinion can be considered difficult.
combinations also showed the lowest overall difference from the correct diversity. However, in the range from 13 to 34 analyzed species in the data set, a combination of Hellinger transformation, UMAP, and PAM performed best with only 1.63% difference, followed by Hellinger transformation, t-SNE, and PAM with 2.35% difference. With increasing number of species included in the estimation, the average deviation to the correct diversity decreased (Fig. 4). The variance in average differences reduced when the number of included species exceeded 12 (Fig. 4). However, the best six estimates (Table 1; Fig. 3) generally showed quite similar results. Hence, these were used for further tests including dominance of species as it would be expected in actual samples.

Species dominance simulation
As species in real samples are typically not evenly distributed but follow different distribution models, communities according to three different species distributions were generated. The six best performing combinations were used to estimate diversity of communities containing a certain number of minimum specimens per species (n min ) ranging from 1 to 10 (Fig. 2). For each n min , 20 data sets were analyzed resulting in 3,600 analyses. For each analysis, the percentage difference of the estimation to the correct diversity was evaluated and plotted separately in Fig. 5. All combinations failed to consistently provide estimations that are congruent with the correct diversities when singletons were included in the analyzed communities (Fig. 5). However, even when n min = 1, it was possible to obtain some correct estimations from Hellinger transformed data using either PCA or no dimensionality reduction with PAM clustering (Fig. 5e,f). The differences of estimations to the correct diversity were generally higher for BS distributed data sets and lower in the UNTB data sets. Most approaches performed almost flawless when n min > 3 (Fig. 5b, d-f). Percentage differences were higher for analyses using PAM clustering with Hellinger transformed data after UMAP was applied (Fig. 5a) and PAM clustering of Species profile transformed data after t-SNE dimensionality reduction (Fig. 5c).

Individual estimation vs. pooled estimation
For the final test of applicability in a "real world scenario", 20 stations with BS distributed species (Fig. 6a) were simulated and either analyzed all together or individually (Fig. 6b,c, upper right hand corner). By creating 20 stations with different distributions of the same 29 species, we wanted to simulate sampling within a wider study area, where chances of reoccurring species in different samples is high. Results showed good concordance of species communities as displayed in Fig. 6d. This also resembled by high r-values of the Mantel tests. Individually estimated communities pooled together still showed an r-value of 0.72 (Fig. 6b). The pooled estimated stations showed an r-value of 1 compared to the correct communities (Fig. 6c). Here, biodiversity and species evenness were estimated 100% correctly. The estimated Shannon index for pool estimation and control communities was 3.50. Of the 20 individually estimated communities, 5 were absolutely congruent to the correct data (Fig. 6b, yellow stars). Deviating species compositions for some communities were often caused by singleton specimens included in the analysis. However, not all singletons were automatically assigned to other clusters. Sta. 1 for instance contained 10 singleton species of which only 3 were assigned to other clusters. Nevertheless, the majority of false cluster assignments were caused by singleton species and sometimes all singletons were assigned to other species (Fig. 7b). In one case, a single species was split into numerous clusters. At sta. 14, the species Mesochra pygmaea (Claus 1863) (n = 13) was split into five different clusters (Fig. 7c). This resulted in an all overestimated Shannon index slightly higher than the correct distribution (H = 3.52).

Accuracy across the different setups
The initial tests clearly showed that several combinations of data transformations and dimensionality reduction together with PAM clustering provided diversity measures highly congruent to the correct values. However, the clustering success differed strongly when only few species were included (n species < 12). That is why, even though Hellinger transformation with UMAP dimensionality reduction and PAM clustering only showed lower percentage difference from the correct diversity when n species > 12 (1.6%), we recommend using either Hellinger transformed data with PAM clustering without applying a dimensionality reduction or using PCA. These approaches showed the lowest deviation from the correct value on average for the entire examined range of species numbers (4.08%). This is also supported by the results of the species dominance simulation analyzing specimen numbers ranging from 1 to 10 for each species. Here, the aforementioned methods showed only low deviation from the correct diversity even though singletons were included in the analyses. The number of singletons also affected the differences in estimations for the different species distribution patterns. Deviations of BS distributed data were always found to be a bit higher compared to other distributions as these included more singleton species than the others. However in a test with 20 simulated stations, species distributions of 5 stations were perfectly in congruence with the correct data when estimations were carried out for each station individually. Moreover, for some stations, distribution patterns deviating only slightly from the correct distributions were estimated. This is also resembled by the high Mantel test r-value when comparing the allover result of the single estimated communities to control communities.
Finally, the test on 20 simulated stations showed that including several stations from a certain geographic area and repetitive sampling at certain stations, which show overlapping species occurrences, should be favored over analyzing single stations individually. Repetitive sampling and sampling of different stations in a geographic area will likely increase number of specimens for each species and hence result in a more robust result for this unsupervised clustering approach. This follows the general recommendations to increase sample size to increase power in statistical tests (Fairweather 1991). However, when including additional stations, care must be taken to include stations which will likely include similar species as doing otherwise may again lead to inclusion of further singleton species. Regarding Harpacticoida for instance, deep-sea studies frequently deal with lowfrequency occurrences of species and numerous singletons (Schmidt et al. 2019). In such cases, analyzing stations individually would inevitably lead to high errors that have to be taken into account when working with the results. That is why analyzing samples from repeated sampling at a station conjointly increases chances of obtaining good biodiversity measures. In contrast to deep-sea studies, shallow-water studies often deal with lower species diversities and higher specimen numbers per species (Packmor and George 2018). That is why applying the described workflow should result in receiving good biodiversity measures.

Data transformation, dimensionality reduction, and clustering methods
Throughout the tests, it was shown that an additional relative data transformation vastly improve the results. In contrast to using p/a data, data from an additional relative data transformation also take peak intensities into account. Due to Hellinger transformation, peak intensities are more similar between specimens from the same species with smaller intraspecific distances during clustering, resulting in more distinct, species-specific clusters compared to total data transformation. The additional relative data transformations give less weight to less intense mass peaks (Legendre and Gallagher 2001), thus reducing the influence of frequent zeros making it feasible for MALDI-TOF MS data sets. Moreover, applying an additional transformation including a square root transformation probably diminishes the influence of outliers compared to total data transformation. However, relative intensities of peaks as an important factor for grouping mass spectra into clusters are neglected when p/a transformation is carried out. Increased interspecific difference as a result of varying abundances of the same peptide or protein between species does not influence groupings because peaks are considered as present without providing further information about species specificity of molecule abundances.
When PCA is used for dimensionality reduction of data sets containing fewer rows than columns, the applied PCA implementation reduces the number of dimension to the number of specimens in the data set. Hence, the number of analyzed parameters equals the number of specimens in the data set. Thus, this dimensionality reduction results in the loss of only a negligible part of the information. That is why applying PCA and working without dimensionality reduction resulted in the same clusters after Hellinger transformation. We recommend using PCA only when the former number of traits is larger than the number of specimens included. Otherwise the calculation time is artificially delayed. If very large data sets in terms of included specimens have to be analyzed, using Hellinger transformation with t-SNE dimensionality reduction and PAM could be an alternative to using raw data. Because the number of dimensions is reduced to three, the computing time for clustering is also highly reduced. Deduced from the results we obtained, the estimated biodiversity should only deviate slightly more from the correct values than for the aforementioned approaches as this approach showed comparatively good results also when singletons were included.
In contrast to PAM clustering, both K-means clustering and MC largely failed to produce biodiversity measures closely resembling correct values. While PAM clustering uses a hypothetical point centroid within the data as starting point to find clusters, K-means uses an actual data point to begin with. This makes K-means clustering, for instance, more sensitive to outliers (Park and Jun 2009), resulting in differing number of cluster. The problem with MC on the other hand may be that this method assumes an underlying distribution of the data which may be different from the models included in this methods implementation.

Potential pitfalls in biodiversity assessments
When applying the workflow presented here, attention has to be paid to data origin and quality. For instance, joining data from different studies may cause problems as various factors such as storage and fixation were shown to influence mass spectra quality (Yssouf et al. 2014;Rossel and Martínez Arbizu 2018c). This may affect the species-specific fingerprint in such a way that single species could be recognized as different species due to mass spectra alteration. Also, prior to application in cross laboratory studies, it needs to be tested if in inter laboratory analyses the species-specific fingerprint is retained or if mass spectra obtained from different instruments display different complexities causing the same abovementioned outcome. Nevertheless, for bacteria, cross laboratory tests have already been carried out successfully (Mellmann et al. 2009). Hence, this may not be a problem during unsupervised biodiversity estimation. Moreover, accurate binning of homologous mass peaks is crucial to reduce artificial variability within a species that might cause splitting single species into several very similar species. Peak binning is carried out to merge homologous data into discrete bins. To date, different methods for peak binning such as using predefined bins of a certain width (Laakmann et al. 2013;Kehrmann et al. 2016), clustering of peaks to group into certain m/z values (Chen et al. 2007) or using other cross spectra algorithms (Gibb and Strimmer 2012). However, sometimes homologous peaks are binned into different m/z values resulting in artificially divergent peaks within a single species. Hence, attention must be paid to peak binning and alignment to obtain reliable results.
The same is true for different developmental stages. For calanoid copepods it was shown by different authors that, within species clusters from hierarchical clustering, different developmental stages can be found in separated, stagespecific clusters (Laakmann et al. 2013;Bode et al. 2017;Kaiser et al. 2018). Thus, it should be paid attention to developmental stages of analyzed specimens when preparing such microscopic species.

Comparison to DNA-based delimitation methods
With this study, we try to provide a method for unsupervised biodiversity estimation. This includes unsupervised species delimitation based on mass spectrometry data as congruent to real species (according to COI and morphological investigations) as possible. To date, analyzing mass spectrometry data for species identification is mainly carried out in supervised approaches (Yssouf et al. 2013). Reference libraries are searched and a species assignment is returned with a certain probability of identification correctness. However, if species are not part of such a reference library, supervised methods cannot delimit these. To differentiate between species, clustering can be applied. However, hierarchical clustering itself does not provide information on cluster margins and hence, researchers themselves have to evaluate species boundaries from cluster analyses. Arrows in Fig. 7 emphasize clusters, in which it is at least difficult to delimit species based on, for instance, branch lengths.
In DNA barcoding (Hebert et al. 2003), several methods are already available to delimit species. In contrast to our approach that tries to find cluster boundaries, ABGD searches for a barcoding gap in pairwise distances to successfully delimit species. GMYC on the other hand delimits species based on a change in evolutionary speed among branches of an ultrametric tree (Puillandre et al. 2012a). This however demands a phylogenetic tree which cannot be computed based on mass spectrometry data, as for metazoans MALDI-TOF MS data were shown only once to reflect evolutionary relationships (Feltens et al. 2010). Both techniques provide good results on species boundaries and are used frequently to assess species diversity in numerous studies (Puillandre et al. 2012b;Lin et al. 2018). This supports the demand for such delimitation methods also in mass spectrometry applications for biodiversity studies. Especially in research fields and taxonomic groups with high number of undescribed species and difficult morphological identifications, such a method can help to accelerate biodiversity assessments. Instead of demanding morphological identification of all analyzed specimens, only a few specimens for every cluster have to be examined to be able to assign clusters to morphological working species.

Conclusion
Our unsupervised biodiversity estimation workflow allows comparison of sampling sites in ecological studies without prior knowledge of the species occurring in the working area and without the need for complete reference libraries. As was shown, the average difference to the correct biodiversity measures is little and often estimates are perfectly congruent to the actual distribution of species. However, caution must be taken on data quality as this will likely affect results.

Data availability statement
No new data were collected for this study. The sources of the used data are given in Supplementary Table S1.