Tackling annotation errors to find the pangenome with Panaroo

Bacterial Genetics and Genomics book Discussion Topic: Chapter 12, question 15

I saw a Tweet about a paper on pangenome analysis and decided to read it. It is on bioRxiv, the preprint server for biology (https://www.biorxiv.org/). The papers here aren’t peer reviewed, like to you would find in typical journals, so when you read something here, it relies on your own critical analysis and assessment of what is being presented by the authors even more than usual. No reviewers or editors have looked at these papers. They haven’t already been through revisions based on reviewers comments and haven’t then been changed before an editor will release them for general reading. They are put on the server as the authors wrote them and this is done because the authors feel they have an important message that needs to be out there. It also is a forum for the authors to get feedback from a wider pool of scientists than just 2 to 4 reviewers and an editor on their manuscript, which they can use to revise it and then perhaps submit it to a peer-reviewed journal.

Looking at this pangenome paper, I recognized some of the authors named and immediately knew that not only did I want to read it because the topic was interesting, but I knew these authors as past collaborators and contributors to the field of bacterial genomics, so I wanted to read what they had to say. I was delighted to see right from the abstract, agreements with points made in my book about some of the pitfalls with automated annotations. Even better was claim that they have devised a system to overcome some of those problems with a graph based pangenome clustering tool called Panaroo.

As more and more bacterial genomes are being generated, more and more sequence data is being annotated. Since this uses automated annotation systems and is often infrequently manually curated on a gene-by-gene basis by someone with intimate knowledge of the genomics of the bacterial species (who has time to do that for a thousand genomes?), there are inevitable errors. Errors in one genome are replicated in others and so on. These errors impact studies that look at pangenomes, which start by identifying all of the orthologous CDSs. Also problematic are fragmented assemblies of the genomic data, mis-assemblies (where the data is put together in the wrong order), and contamination (where sequence data from another source is present in the reported data – something that happens far more than we like to admit!). If a core gene is fragmented, so that half of it is present in one contig and half of it is present in another contig (see Bacterial Genetics and Genomics Figure 17.15), then that core gene isn’t identified and can’t be included in the pangenome analysis and will be reported as absent in that isolate, which might mean it is removed from the core genes list. In the case of the gene in the Figure 17.15 example, that gene is dnaA; it is core and important for bacteria. So, there is a problem with fragmented genome sequence data, as well as errors in annotations.

Panaroo shares information between genomes to improve annotation calls, which improves the clustering of the orthologues and paralogues. The tools in Panaroo empower the researchers and have been tested against genome sequence data and compared to the results from other pangenome analysis tools. Simulated genome sequence data sets were used that included contamination and genome fragmentation. One was for Mycobacterium tuberculosis, which is a highly conserved species that has little variation in its genome. The other data set was made of Klebsiella pneumoniae genome sequence data, which is highly diverse. Panaroo was superior in its analysis of both datasets compared to other pangenome tools in identifying the core and accessory genomes. In addition, there are tools within Panaroo that allowed the researchers to identify 9 samples within the K. pneumoniae dataset that were outliers during the quality control stage, before pangenome analysis. Other tools within Panaroo allow users to conduct pangenome genome wide association studies (pan-GWAS) to find genetic features from a collection of genome sequences that might be associated with phenotypes and tools that investigate pangenome evolution. This led to some new discoveries in a few species discussed in the paper. These are genome sequence datasets that are not simulated test data for Panaroo, but actual genomic data, where after having tested Panaroo on the M. tuberculosis and K. pneumoniae datasets, they then tried it out on some other species to see what they would find. The results are quite interesting and they found some novel and in some cases unexpected features.

Panaroo is freely available and uses Python. The paper describing it is available here (https://www.biorxiv.org/content/10.1101/2020.01.28.922989v1) and it can be downloaded here (https://github.com/gtonkinhill/panaroo).

Bacterial Genetics and Genomics book Discussion Topic: Chapter 12, question 15

Share this:

Related

Leave a comment Cancel reply