Only the RDP training set resulted in the classification of honey bee microbiota short reads as Orbus and these sequences were used as queries in a blast search against all three training sets (RDP, SILVA, and GG). On average, these Orbus-classified sequences were 93% identical to top hits in the RDP training set. They did not find close homologs in the SILVA training set either, the closest top scoring hits being 86% identical (on average).
In contrast, in the GG Small Molecule Compound Library training set, top hits that were 98.6% identical were found and these sequences were classified as γ-proteobacteria, without further taxonomic depth. This result suggests that training set breadth is playing a role in the incongruity observed here. In support of this hypothesis, a large number of short reads were unclassifiable using each training set (1,167 unclassified by SILVA, 1,468 by GG, 2,818 by RDP) and the RDP training set resulted in the least confident classification out of all three with a majority (62%) of the sequences unclassifiable at the 60% threshold. Bootstrap scores resulting from RDP-NBC classifications can be an indicator of sequence novelty ; sequences with low scores Trichostatin A molecular weight at particular taxonomic levels may
represent new groups with regards to the training set utilized. The average bootstrap scores for each classification at the family level for each of the three training sets was calculated (Figure 2A). Certain sequences were classified with relatively low average bootstrap values, suggesting that these sequences do not have close representatives in the training sets. For example, a low average bootstrap score was observed for the classification of sequences as Succinivibrionaceae 4��8C by SILVA or as Aerococcaceae by the RDP. The use of custom sequences improves the stability of classification of honey bee gut pyrosequences, regardless of training set In order to improve the classification of honey bee gut derived 16S rRNA gene sequences, a custom database was used to classify
unique sequences. The taxonomic classifications in this custom database were generated either by close identity (95%) to a cultured isolate or by the inclusion of cultured isolates in the phylogeny. This phylogeny mirrors those published by others for these bee-associated sequences [18, 19, 30]; honey bee-specific clades were recovered with bootstrap support >90% (Figure 1). The addition of honey bee specific sequences to each training set not only altered spurious taxonomic assignments for certain classes (notably the δ-proteobacteria are not present in results from these datasets, Figure 2B) but also significantly improved the congruence between classifications provided for each training set (nearly 100% of sequence classification assignments concurred at the family level, Figure 2B).