||Classification Methods for 16S rRNA Based Functional Annotation
||Kulakowski,Rafal, Lausen, Adi, Low-Decarie, Etienne and Lausen, Berthold
||Microbial communities play an essential role in Earth’s ecosystems. The goal of this study was to investigate whether the functional potential of microorganisms forming these diverse communities can be directly identified using a 16S rRNA marker gene with supervised learning methods. The recently developed FAPROTAX database has been used along with the SILVA database to produce a training set where 16S rRNA sequences are linked to a number of metabolic functions. Since gene sequences cannot be explicitly used as feature vectors by most classification algorithms, the present research aimed to investigate possible feature engineering approaches for 16S rRNA. Techniques based on Multiple Sequence Alignment (MSA) and N-grams are proposed and tested. The results showed that the feature representation based on the Ngrams outperformed MSA, especially when implemented with large and diverse functional groups. This suggests that a clustering-like alignment procedure results in a biased feature representation of the marker gene. Since classifiers trained using Random Forest and Support Vector Machines techniques were able to accurately detect a range of functional groups it is concluded that the 16S rRNA gene provides substantial information for the direct identification of functional capabilities.