Detection of Genome Sequence Outliers Across Pan-Genomes
Version: 1,
Uploaded by: Administrator,
Date Uploaded:
26 November 2022
Warning
You are about to be redirected to a website not operated by the Mauritius Research and Innovation Council. Kindly note that we are not responsible for the availability or content of the linked site. Are you sure you want to leave this page?
Thousands of genomic sequences of multiple microbial species have already been deciphered, providing us with extensive vistas of variation at the micro- and macro-evolutionary levels. Pan-genomes denote sets of all unique gene families found in multiple related genomes in a given taxon - for instance related strains of a bacterial species, thus representing the entire gene pool of the taxon. We demonstrate here that characterizing a pan-genome of a given taxon using sequences generated from different genome projects can misguide subsequent genome comparison studies when a set of incorrect strains is selected as input. Deploying genomic resources and tools, we report that seven bacterial species datasets, representing a total of 249 strains, contained "contaminating" data and 11 genomic sequences were identified as outliers. The example of Streptococcus sanguinis used in this study revealed that the outlier strain ATCC 49296, detected among a dataset of 23 Streptococcus sanguinis genome sequences, showed a much closer relationship with Streptococcus oralis 35037T than with other Streptococcus sanguinis strains, thus confirming the outlier status of the strain ATCC 49296. Results provided in this study are supported by pan-genome trees and gene sequence- based phylogeny. This approach provides a better quality control for pan-genome analysis in a rapid, efficient and scalable manner including application to other taxa.