Supper & Supper GmbH - The Data Engineers
Data Science
Use Cases
Clustering of DNA Sequences from Microbiome Sequences

Project scope

The human microbiome includes all microorganisms living inside and on the human body. It plays an important role for the induction and function of the host immune system. By observing the changes of the gut microbiota composition can provide insight of host-microbiome interaction and may suggest new options for therapeutic intervention.
 
Project goal was to develop a pipeline to cluster the microbiome sequences based on the order of nucleotides and to subsequently match the formed clusters with known sequences by the blast algorithm. The results of the project can be used for the clinical treatment.

Provided Data

The samples collected for microbiome analysis were collected from the colon. The samples were composed of stool and blood. The raw sequenced data (.FASTQ) files were provided.

Applied Methods

We applied Self Organizing Maps (SOM) and the Basic Local Alignment Search Tool (BLAST) in this project. SOM is a neural network based unsupervised clustering method which can adjust the batch effect and controlling the false positive rate from technical error. BLAST is a tool to find the similarity between biological sequences. We used it to separate the human genome and bacteria genome

Project outcome

The SOMs provided clustering with high homogeneity. Neural networks made it possible to identify similarities in huge amount of DNA sequences based on Euclidean distance. Additionally, according to heatmaps a user can identify which pattern dominates in a respective cluster. The average homogeneity within the cluster reach 90% after applying SOM.
 
According to the BLAST results, the identified microorganism allow us to have deeper understanding of the microbiota composition within the patients. The results can be further used for the investigation of host-microbiota interaction, drug effect and environmental influence.