Harnessing aptamers for rapid detection of bacteria and antibiotics in our food.

Bacterial Genetics and Genomics book Discussion Topic: Chapter 4, question 14

The specificity of antibody binding has been exploited for many years in a variety of technologies. Although perhaps less famous, aptamers also have high binding specificity for their targets and, being made of RNA or DNA rather than protein, are much smaller in size, can be modified more easily, can be more easily reproduced, and can be stored and delivered more easily than antibodies.

Figure 4.14 from Bacterial Genetics and Genomics. In the top panel, an Aptamer is depicted as a bent green line. In the bottom panel, the Aptamer has bound a target and changed conformation; the green line has changed in shape around a purple sphere representing a Ligand.
Figure 4.14 from Bacterial Genetics and Genomics. In the top panel, an Aptamer is depicted as a bent green line. In the bottom panel, the Aptamer has bound a target and changed conformation; the green line has changed in shape around a purple sphere representing a Ligand.

It is by virtue of these properties that aptamers have been explored for a range of biotechnology applications, including as tools for detection of bacterial contamination of foods for human consumption.

The food-borne pathogen Vibrio parahaemolyticus is the leading cause of seafood-associated bacterial gastroenteritis. All of the traditional methods of detection rely on bulky laboratory equipment and are time-consuming. In a paper by Jiang et al., 2021, the goal was to create an electrochemical aptasensor. This combined microfluidic technology with the specificity of binding in an aptamer to enable detection of the V. parahaemalyticus outside of a laboratory within 30 minutes. The paper demonstrates the sensitivity and specificity of their device and suggests that it could be adapted to detect other bacterial pathogens.

Photograph of raw seafood, a common source of transmission of V. parahaemolyticus to humans. Photograph from Photographer Sergy from Thailand.
Photograph of raw seafood, a common source of transmission of V. parahaemolyticus to humans. Photograph from Photographer Sergy from Thailand.

Although we tend to think of Staphylococcus aureus perhaps in other contexts, it is also an important food-borne pathogen, particularly in the USA. To overcome time delays associated with traditional methods of detection of S. aureus using culturing, Yang and colleagues developed an aptamer-based technology that makes use of established portable detection platforms available in personal glucose meters. These hand-held devices are already readily available and reliable, so they made an ideal starting point for development of a portable bacterial detection device. The sensitivity of the aptamer biosensors were able to selectively detect S. aureus in food samples, demonstrating the usefulness of the portable technology.

Portable glucose meter icon image by By Laymik, UA. An icon drawing of a glucose meter, which has been used as part of an aptamer-based detection technology by Yang et al., 2021.
Portable glucose meter icon image by By Laymik, UA. An icon drawing of a glucose meter, which has been used as part of an aptamer-based detection technology by Yang et al., 2021.

There is a study published in Analyst, a journal from the Royal Society of Chemistry, which investigated the contamination of powdered infant formula by the food-borne pathogen Cronobacter sakazakii. These bacteria cause meningitis, sepsis, and necrotizing enterocolitis in premature and immune-compromised infants. It is therefore vitally important that these bacteria do not end up in products for babies, like powdered infant formula. This particular study by Hye Ri Kim and colleagues, published in 2021, did not use the standard SELEX technique to develop its aptamers for detection of these pathogens. SELEX is ‘systematic evolution of ligands by exponential enrichment’ and has been used since 1990 to produce aptamers. Instead of SELEX, Kim et al. used a centrifugation-based partitioning method (CBPM), which produces target-specific aptamers in a shorter time-frame. Using this CBPM method, these researchers were able to isolate two aptamers against C. sakazakii and demonstrated that these could be used to efficiently detect the pathogen in powdered infant formula.

An icon drawing of a baby bottle beside a can of powdered infant formula with a scoop measure above it. Infant formula icon image by Chiara Rossi, IT.
An icon drawing of a baby bottle beside a can of powdered infant formula with a scoop measure above it. Infant formula icon image by Chiara Rossi, IT

Sometimes the issue isn’t the bacteria, but rather the antibiotics we have been using to combat the bacteria. For example, an aptamer was developed by Komal Birader and colleagues that specifically detects the antibiotic oxytetracycline in milk. This antibiotic was used in veterinary practice, however it was banned due to potential side effects. This meant that it was necessary to develop a way to detect oxytetracycline in milk that was destined for human consumption. The detection method needed to be both affordable and one that could be used in the field. This team was able to modify the aptamer that they developed so that the presence of the antibiotic would result in a visual detection, making it ideally suitable for use in the field.

A glass of milk. Photograph by H. Zell.
A glass of milk. Photograph by H. Zell.

Much like antibodies have been used in a variety of ways that are well outside of their original purpose inside our bodies to fight off foreign invaders, aptamers are therefore being used in a range of ways that are beyond their original purpose. As explored in Chapter 4 of Bacterial Genetics and Genomics, aptamers can have a role in bacterial gene expression, but outside of the bacterial cell, scientists have found a whole range of uses for them, including detecting bacteria and antibiotics in our food.

Merging two concepts: finding genes that are both essential and core

Bacterial Genetics and Genomics book Discussion Topic: Chapter 3, question 13

One of the concepts that is discussed in Bacterial Genetics and Genomics is the difference between essential genes and accessory genes. The essential genes are those that are required for the bacteria to live and the accessory genes allow it to do something else that might be beneficial, but not essential. This is the difference between having all of the genes needed for cellular division and having the genes needed for a bacterial capsule. Being able to divide is an essential function and without it, the bacteria are not viable. The bacterial capsule may enable the bacteria to avoid the human immune system thus boosting its survival, but it is not essential. In this comparison, those that are essential and those that are accessory are determined within the bacteria itself, not by comparing between bacteria – even those of the same species.

An image with a blue background on which is a fuzzy green diplococcus bacteria surrounded by a fuzzy dark blue halo, which is the bacterial capsule..
This is an image from a microscope of bacteria expressing accessory genes for the bacterial capsule. Some bacterial cells are surrounded by a layer of extracellular polysaccharides that form the bacterial capsule. The capsule is visible here as the dark blue layer surrounding the bacterial cells. The capsule can help the bacteria avoid the immune system and resist drying. (This is from Bacterial Genetics and Genomics Figure 9.14, image courtesy of Dr Kari Lounatmaa/Science Photo Library.)

Another concept discussed is the difference between core genes and accessory genes. In this case, we are comparing the genes present between two different bacteria of the same species. The genes that are in common between the bacteria are core. Those that are not are accessory. This is often represented with a Venn diagram, showing all of the genes that are shared between two or more bacterial strains – the core genes – and those that are not and are accessory genes. Everything together is the pan-genome.

A Venn diagram, where overlapping circles represent where concepts interconnect. At the center is the word 'core'. In the circles that do not overlap are the words 'accessory'. A circle encompasses everything and says 'pan-genome'.
The core genome, accessory genome, and pan-genome. The core genome includes all of the genetic features that are included in all of the examples of a particular species. Each genome will, in addition, have other genetic features that contribute to the accessory genome. When all of the potential genes in a species are considered, everything from the core genome and all accessory genomes, even when not seen together in the same cell, this is the pan-genome. (This is from Bacterial Genetics and Genomics Figure 3.4.)

It is interesting that both of these concepts use the term accessory gene – something that could easily confuse students and researchers when discussing the nature of these genes. The point being that the genes in question are not essential to the function of the bacterial cells and quite often are not present in the genomes of all examples of that species.

In looking into this topic in the research literature, I came across a paper in PNAS (Proceedings of the National Academy of Sciences of the United States of America) by Bradley Poulsen and colleagues in Massachusetts, published in 2019. This article begins with a quite frank overview of what was believed to have been the promise of bacterial genome sequencing when it comes to the development of new antimicrobials. As they describe it, the revolution that was set to transform antibiotic discovery through the ability to identify essential genes and target them for new drugs, made possible from the first bacterial genome sequence in 1995, has just simply failed to materialize.

There are, of course, a number of problems that have arisen for which genomics is not entirely to blame. As noted in the PNAS paper, generally the bacterial membrane and efflux pump systems need to be overcome for any new drugs to be able to reach their targets. This is a strength of our bacterial foes, not a weakness of our genomic methods. Finding antimicrobials in some cases starts with screens of chemical libraries and improvements on these are often needed, which is again not a shortcoming of genomics. Lastly, there have been some successes in identification of new antibiotics, but often these are not broad-spectrum agents and therefore their development is abandoned or not picked up by pharmaceutical companies for further development and clinical trials due to the lack of profitability. This is again, not the fault of genomics.

However, the definition of essential genes does fall within genomics. In some cases, targets have been pursued that have not actually been essential, at least not in all species or in all strains of a species. This results in inhibitors having been developed to bacterial targets that are only effective against a sub-population of the pathogen, which is pretty much useless in a clinical setting.

In addition, essential genes can be essential for a particular set of circumstances. They can be essential depending on how the bacteria are being grown, for example, but if the bacteria are grown in different conditions those genes are no longer needed and would have otherwise been classified as accessory genes. These are ‘conditional essential genes’ – ones that are essential only under certain conditions. The bacteria may not encounter that condition or may be able to avoid being in that condition within the patient and therefore could avoid being cleared by the antibiotic developed to target the essential gene by not needing to use the gene and staying somewhere else in the body. It is quite common for genes to be essential in laboratory media, which is an artificial environment, but entirely different sets of genes to be needed for the various environments encountered in the body, like the gut, the blood, the mucosal surfaces, and inside cells.

Poulsen et al. proposed in their 2019 PNAS paper that antibiotic drug discovery could be improved through focusing on core essential genes – those that are shared amongst all of the sequenced genomes of a species AND that could be demonstrated to be essential in a range of different conditions that mimic the human body, not just one. In their proof-of-concept investigation, they chose to look at Pseudomonas aeruginosa due to the urgent need for novel antibiotics.

To find essential genes, they used transposon insertion sequencing. This is known as Tn-Seq, TIS, INseq, HITS, and TraDIS. Two P. aeruginosa laboratory strains were mutated in this way, strain PA14 and PAO1. Eight other strains were also investigated under five growth conditions and a statistical method called Finding Tn-Seq Essential genes (FiTnEss) was used to identify the core essential genes from the data collected.

In the end, the conclusion of the authors is that it is not genomics that has failed us in providing answers that we need for developing new antibiotics to combat the growing threat of antimicrobial resistant infections. It is rather our own inability to use genomics to its best advantage and to design experiments to best identify the genes that need to be targeted by novel antimicrobials. Through the advances in genomic technologies, it is now possible to rapidly and inexpensively sequence and re-sequence bacterial genomes, which was not possible in the 1990’s, 2000’s, and even 2010’s. At the start of the 2020’s, we are able to use the advances of genomics to do the types of experiments we have wanted to do, but have often been out of reach due to the expense and scope that would have been out of reach previously.

Therefore, the quest for core essential genes is has only just begun.

What can we learn from bacterial genome sequences that are never ‘finished’ and are left in pieces?

Bacterial Genetics and Genomics book Discussion Topic: Chapter 17, question 14

When we first started sequencing bacterial genomes, back in the 1990s, the goal was to ‘finish’, to generate a complete, closed, circularized chromosomal sequence that as accurately as possible reflected what would be seen in the bacterial cell. This process often took months, if not years, and required extensive laboratory work. It generated sequences such as the first genome sequence of a free-living organism, the sequences of model organisms E. coli and Bacillus subtilis, and the first comparative genome sequence analysis.

Graphic design of the DNA double helix as a circular chromosome. Not to true proportions. Cropped from Figure 2.3 of Bacterial Genetics and Genomics (https://www.routledge.com/Bacterial-Genetics-and-Genomics/Snyder/p/book/9780815345695), Chapter 2: Genes.
Graphic design of the DNA double helix as a circular chromosome. Not to true proportions. Cropped from Figure 2.3 of Bacterial Genetics and Genomics, Chapter 2: Genes.

As time went on and sequencing technology changed, it became faster and less expensive to do whole genome sequencing, yet the challenge of finishing the genome sequence into a closed chromosome remained. It became standard to analyze incomplete sequence data as contigs, contiguous strings of sequence data that had been assembled based on localized homology from raw sequence reads.

Figure illustrating the assembly of individual sequencing reads generated by sequencing technologies into an assembled contig, ready for analysis. Figure 17.2 of Bacterial Genetics and Genomics (https://www.routledge.com/Bacterial-Genetics-and-Genomics/Snyder/p/book/9780815345695), Chapter 17: Genome Analysis Techniques.
Figure illustrating the assembly of individual sequencing reads generated by sequencing technologies into an assembled contig, ready for analysis. Figure 17.2 of Bacterial Genetics and Genomics, Chapter 17: Genome Analysis Techniques.

Although these contigs ranged in lengths and the number of contigs per sequenced genome also varied, due to features within the chromosome such as repetitive and within genome homologous regions and repeats, assemblies did not generate closed sequences. Therefore, as time went on in the 2000’s and 2010’s, the number of genome sequences in the public databases such as GOLD (Genomes Online Database) referred to as ‘permanent drafts’ continue to grow exponentially.

Graph generated by the Genomes Online Database from available sequence data in GOLD (https://gold.jgi.doe.gov/statistics). Complete are finished to finalized chromosomes, while permanent drafts will remain in contigs because no further sequencing is planned for that sample to improve the sequence data or close the gaps.
Graph generated by the Genomes Online Database from available sequence data in GOLD. Complete are finished to finalized chromosomes, while permanent drafts will remain in contigs because no further sequencing is planned for that sample to improve the sequence data or close the gaps.

Since this data in the ‘permanent drafts’ is not assembled into the final chromosome and is potentially incomplete or contains sequencing errors, someone new to the field or unfamiliar with genomics may initially assume that the draft nature of the data means that it is still in progress and will eventually be completed. For the vast majority of projects, this is not the case. This is simply because, genome sequencing projects have learned to adapt to the data that is being generated and are able to use this draft data, assembled into contigs, to answer the scientific questions that are being posed.

There are many examples of this in the literature dating back just shy of 20 years, to the earliest publications using next-generation sequencing technologies. In this blog, I am going to highlight a paper by Joon Liang Tan et al. from 2020, which has investigated 15 newly generated genome sequences of Mycobacterium tuberculosis from Malaysia, using MiSeq by Illumina. This paper is a good example of the type of analysis that can be done with genome sequences that are in contigs.

Scientific Data journal article by Joon Liang Tan, et al., 2020, 'Genome sequence analysis of multidrug-resistant Mycobacterium tuberculosis from Malaysia'. Screenshot of the title with author list and Abstract.
Scientific Data journal article by Joon Liang Tan, et al., 2020, ‘Genome sequence analysis of multidrug-resistant Mycobacterium tuberculosis from Malaysia’. Screenshot of the title with author list and Abstract.

As explained in the paper, multidrug resistant M. tuberculosis (MDRTB) is still relatively rare in Malaysia (1.5% of cases) and M. tuberculosis (TB) itself is present at an incidence of 92 per 100,000 population as of 2019. In this study, 15 MDRTB archive isolates from patients from 2009 to 2012 from the University of Malaya Medical Center were whole genome sequenced using MiSeq.

The raw sequence data underwent quality control, with poor quality reads being removed before assembly. The quality reads were then assembled into contigs, which were polished to improve the accuracy of the sequence data. This draft genome sequence data, in contigs, was then annotated using Prokka. The Prokka annotation data was used to identify core and accessory genomes with Roary. The Prokka and Roary data together fed into the Piggy analysis for intergenic region prediction. Lineages were predicted using the TB Profiler Web server. Based on the Prokka annotation of predicted protein encoding genes, proteins sequences were submitted to the Comprehensive Antibiotic Resistance Database. The 15 MDRTB draft genome sequences were also compared against the collection of MDRTB genome sequences at the Broad Institute.

Often, such projects compare draft genome sequence data to an established reference genome. This relies upon there being a complete, closed, circularized genome sequence that is representative for the species. This means that at some point in the past someone had to do the hard work of finishing the genome sequence. In this case, the reference genome sequence used was M. tuberculosis H37Rv. Each of the 15 MDRTB draft genome sequences was compared against this reference genome. This draft sequence data is in fragments, the assembled into contigs, ranging from 149 contigs to 439 contigs. It is not unusual for draft genome sequences to have over 100 contigs. It is quite good for a draft bacterial sequence to have under 100 contigs or better yet under 50. The contigs are aligned against the reference genome sequence to determine how much similarity there is between the newly sequenced bacterial genome and the reference bacterial genome. In this case, an average of 99.26% of the M. tuberculosis H37Rv reference genome sequence was covered by the 15 Malaysian MDRTB sequences. This means that almost everything found in H37Rv could be found in these Malaysian isolates. Focusing just on the core gene families that were identified in the 15 MDRTB isolate sequences by Roary, about 97% were also present in the reference genome sequence TB H37Rv. This means that most of the identified genes that are common between the Malaysia isolates are also found in the reference genome, but a few are not. Of those that are not, most are hypothetical (predicted by the software, but where function is not known).

Shifting to analysis of the intergenic regions, there were 2,172 to 2,288 regions between the CDSs in the 15 MDRTB isolate sequences. Of these, 1,365 were found in all of the 15 MDRTB isolate sequences and 1,453 were present in at least two of the isolates. Of the intergenic regions, 974 were found in only one of the 15 MDRTB isolate sequences. These sequences between the CDSs were one of the main contributing factors to differences between the isolates and between the isolates and the reference strain M. tuberculosis strain H37Rv.

Seven of the 15 MDRTB isolate sequences were classified as part of the East Asian Lineage 2.2.1 and four part of East Asian Lineage 2.1. Two isolates were determined to be part of the Indo-Oceanic Lineage 1.1.3 and one included in the Euro-American Lineage The final isolate was of Lineage 1.2.2. The SNP differences between each isolate indicated that they were epidemiologically distinct and not related to one another. I think it would have been nice to see a phylogenetic tree with the data from these isolates, to put the Malaysian TB into visual context with sequenced TB from the rest of the world, but this publication does not include any figures.

The investigation also conducted a detailed analysis of the antimicrobial resistance determinants in the genetic data. Importantly, these analyses showed that there is no evidence of XDRTB (extensively drug-resistant tuberculosis) because there wasn’t resistance to: 1. first-line anti-TB drugs (isoniazid; pyrazinamide; streptomycin; rifampin; and ethambutol) and; 2. a fluoroquinolone and; 3. at least one second-line drug (amikacin, kanamycin, or capreomycin). To be classed as XDRTB, all three criteria need to be met. The isolates are not only phenotypically not XDRTB (they did not display resistance in lab tests), but they also are not genotypically XDRTB (they don’t have the genetic features needed to satisfy all three criteria). Due to regulation and mutation, it is possible for bacteria to carry a gene, but not express its phenotype. However, if evidence of the genetic markers for XDRTB were identified, this could be concerning. Fortunately, there was none.

The authors highlight the importance not only of monitoring antibiograms of isolates for epidemiology, but also regular sequencing to evaluate the distribution of lineages and genetic basis for observed resistance.

This information was all gained, vitally, from genome sequence data that was not ‘finished’. Indeed, one of the genomes was in quite a lot of contigs (439!), yet meaningful information was extracted from these pieces of the whole chromosomal puzzle. Whilst a complete chromosome can be ideal for answering a range of research questions, there is a great deal that can be done with the vast quantities of incomplete, draft data that is available in the public databases and is continuing to be generated.

Contributions of women to bacteriology: Blog for International Women’s Day 2021

Bacterial Genetics and Genomics book Discussion Topic: Chapter 21, question 16

For these blogs I have not been including the wording of the end of chapter questions from Bacterial Genetics and Genomics. Instead, I have blogged about the general theme of these question, often highlighting a research article on the topic.

However, today (8th March 2021) is International Women’s Day and the very last self-study end of chapter question in the book is very relevant:

“Esther Lederberg discovered lambda (λ) bacteriophages and described lysogeny. She also made other contributions to microbiology and microbial techniques that contributed to bacterial genetics and genomics. The study of λ provided insight that was extrapolated across the field of genetics. Explore and discuss the contributions made by Esther Lederberg and at least one other scientist who made important contributions, but may not be well known, perhaps due to gender or race.”

Photograph of Prof. Esther M. Zimmer Lederberg wearing a lab coat and standing in a laboratory.
Photograph of Prof. Esther M. Zimmer Lederberg wearing a lab coat and standing in a laboratory. http://www.estherlederberg.com/ColleaguesIndex.html

Prof. Lederberg is mentioned more than once in the book, having been instrumental in the discovery of not only lambda (λ) bacteriophages, but also description of F factor in bacteria and development of the replica plating technique. I encourage you to look for more information about Prof. Lederberg and her various contributions to microbiology.

Another woman I would like to discuss in this blog is Jane Hinton. There are some bacterial growth media with interesting names, including Mueller-Hinton agar as well as a range of other broths and agars that are all obviously named after someone. We don’t often think about the people who took their time and effort to develop these valuable and essential resources that enable us to do the fundamental aspect of our work – culturing bacteria. Most of my research is on Neisseria gonorrhoeae and Jane Hinton was involved in creating the Mueller-Hinton media that was instrumental in making culturing of N. gonorrhoeae practical.

Portrait photograph of Dr. Jane Hinton.
Portrait photograph of Dr. Jane Hinton from The 1949 Scalpel, yearbook of the Senior Class, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, Pennsylvania

Dr. Hinton was the daughter of Prof. William Augustus Hinton, the first African-American professor at Harvard University and the first African-American author of a textbook. In 1931, he developed a Medical Laboratory Techniques course that was open to women, which led to Jane Hinton working with John Howard Mueller at Harvard on the creation of a media for culturing N. gonorrhoeae and Neisseria meningitidis. The Mueller-Hinton agar became the standard for antibiotic susceptibility testing, due to their incorporation of starch in the media, which enhanced growth and produced reliable antimicrobial testing results, and the transparent nature of the media, making the plates easier to read than opaque chocolate agar media, as well as easier to make.

In 1949, Dr. Jane Hinton graduated from the University of Pennsylvania as a Doctor of Veterinary Medicine. She and Alfreda Johnson Webb, graduating from the Tuskegee Institute, were the first two African-American women veterinarians.

When you are next in the lab or writing up a method and come across a name for growth media or a technique, perhaps take a minute to look into the history behind the name and the microbiology discoveries that went into what we think of as commonplace today. And remember that our discoveries and inventions made by women working in microbiology today may be historic events to someone in the future.

Bacteria breaking the rules – again. This time, its coupled transcription – translation.

Bacterial Genetics and Genomics book Discussion Topic: Chapter 2, question 14

It has long been a defining difference that bacterial cells, like E. coli, have coupled transcription-translation and eukaryotic cells, like animals and plants, make their mRNA in the nucleus and their proteins in the cytoplasm. This is mostly the case. Mammals did break the rules and got found out a few years ago (see Coupled Transcription and Translation within Nuclei of Mammalian Cells). But, it seemed clear that the proteins made in the mammalian nucleus weren’t the main process for these cells and that bacterial cells were the clear leaders in doing coupled transcription-translation.

Bacterial coupled transcription-translation. RNA polymerase generates mRNA in transcription from the DNA sequence template. This mRNA is translated by ribosomes, which follow closely behind the RNA polymerase.

The transcription-translation complex (TTC) is the physical association of the RNA polymerase (RNAP), which is transcribing the DNA into mRNA, and the ribosome (containing rRNA and ribosomal proteins), which is translating mRNA into protein with the help of tRNAs carrying amino acids. The TTC is also known as the expressome.

It’s been discovered that bacteria have been breaking the rules as well. As we often find, when we try to define our world, we will come across exceptions to our rules and classifications, and we need to be able to reappraise our understanding as we learn more and do more research. This is one such example. Coupled transcription-translation is an ideal example from microbiology. A lot of what we know in microbiology, especially insights gained in early microbiology, is based on research in E. coli that was extrapolated across to other bacterial species. The more we learn about other bacterial species, the more we learn about how diverse they are and how much they do not do things the same as E. coli, like coupled transcription-translation.

In August of 2020, Grace E. Johnson, Jean-Benoît Lalanne, Michelle L. Peters, and Gene-Wei Li published research focused on the Gram-positive model species Bacillus subtilis that challenged the coupled transcription-translation paradigm that had been established 50 years earlier in E. coli, a model Gram-negative bacterial species.

Johnson et al., Nature 2020 Functionally uncoupled transcription-translation in Bacillus subtilis.

The B. subtilis RNAP was shown to transcribe DNA into mRNA much faster than the ribosomes can progress along the mRNA to conduct translation. This ‘runaway transcription’ means that the RNAP and ribosome are not coupled in transcription-translation, with RNAP moving at nearly twice the speed and leaving the translation machinery in its wake.

The differences between E. coli and B. subtilis with regards to the association between RNAP and the first ribosome conducting translation mean that there are also other differences. B. subtilis tends to rely less on Rho-dependent transcriptional termination and to have a greater prevalence of riboswitches in the mRNA leader sequences. Looking at these features, the researchers were able to suggest that ‘runaway transcription’ is not limited to B. subtilis and that there should now be two models of bacterial transcription and translation: translation-coupled transcription and runaway transcription.

From Johnson et al., Nature 2020, Figure 4b&c. Two models for bacterial transcription and translation, depending on species. On the left (b) is translation-coupled transcription, also known as coupled transcription-translation, in which RNAP is followed closely by ribosomes, as described first in E. coli. Characteristic here is intrinsic termination, Rho-dependent termination, and leader peptide attenuation. On the right (c) is runaway transcription in which RNAP runs much faster than the ribosome that follows, as described first in B. subtilis. Characteristic here is protein and riboswitch-based regulation, and termination-based non-stop mRNAs.

There is some further reading about insights into transcription – translation in bacteria, including coupled, uncoupled, and collided E. coli expressome states in a Nature Microbiology Reviews Research Highlight article.

The Johnson et al., 2020 contribution to our wider understanding of transcription and translation in bacteria establishes a second model of ‘runaway transcription’, which is a common feature of some bacterial species, rather than coupled transcription-translation. This is an important insight into bacterial genetics and genomics and a reminder of the diversity of bacterial species.

Studying fascinating microbiomes, working in cooperation with communities, and illuminating study design.

Bacterial Genetics and Genomics book Discussion Topic: Chapter 1, question 14

I recently attended and was an invited speaker, on-line, at a conference that had a topic about short-and long-read sequencing technologies. The various talks and the panel discussion that I participated in looked at the advantages of both short-read sequencing and long-read sequencing for what they can bring to research and to the clinical setting. The conference was Oxford Global NextGen Omics UK, which runs yearly, usually in London.

Following the conference, I decided that I wanted to blog about a study that had used sequencing to do something interesting. I found this paper, by Ovokeraye Achinike-Oduaran and colleagues at the University of the Witwatersrand, which describes a microbiome study from South Africa. Although the paper does not do whole genome sequencing of the microbiome (it is using 16S rRNA microbiome sequencing), it does overturn longstanding assumptions to develop its experimental design to investigate the gut microbiome.

Graphic in pink of the human small and large intestine overlaid with graphics of larger-than-life microbes.
Graphic in pink of the human small and large intestine overlaid with graphics of larger-than-life microbes. CC

Firstly, there is an issue in a lot of our human genomic data, microbial genome data, and metagenomic data. The vast majority of it has come from Western samples and from white patients. Although this may be changing, as with many things, change is slow. We cannot extrapolate information from the data we have to the whole, if we are lacking information from segments of that whole. In bacteriology, we have relied for a long time on extrapolating the knowledge gained by studying E. coli to other bacterial species, only to discover that many of these other species do things differently. We need more data about other bacterial species and about our own.

Secondly, it is important to remember that Africa is vast and varied. It is important to stop thinking of Africa as one homogeneous population and to remember that the people, the cultures, and the ecosystems and environments vary considerably across the continent. People living in different States in the USA have very different cultures, different weather conditions, and often different accents. In addition to that, they share the same continent with people in Canada and Mexico, who also have their own characteristic States and Provinces. The North American continent is diverse and so too is the continent of Africa.

Thirdly, may studies that have previously looked at the microbiome of humans in Africa have focused on those from populations that live in what Western cultures would consider to be “extreme” conditions, living in rural hunter-gatherer societies or agricultural populations. These studies have been vital in understanding traditional African population microbiomes and, in particular, the microbiomes of the populations that were studied. However, there has been a shift in the last 50 years or so to a more industrialized and sedentary lifestyle in sub-Saharan Africa. Therefore, these transition populations warrant investigation as well. There are a growing number of supermarkets and fast food outlets with trends toward Westernized processed and animal-based food products. As seen in other cultures, there are also associated reports of increases in obesity and decreases in physical activity across 24 African countries.   

In this investigation, two South African populations are investigated, one urban and one transitioning rural. No assumptions are made about the ability to extrapolate findings from non-South African populations onto this community. The gut microbiomes here may be very different to elsewhere in the world, therefore necessitating investigation. Here there is an ongoing transition epidemiologically, which makes it a fascinating community to study, as the gut microbiome may be in transition as the dietary lifestyles of the people change.

The microbiota landscape of obese and lean female individuals from South Africa, chosen from diverse ethnolinguistic groups, were chosen for this investigation. It is noteworthy that even within the study, it is recognized that there is a wealth of diversity within South Africa, both in terms of the dietary transitions and also in terms of ethnolinguistic groups. This is a pilot study that is part of the Human Heredity and Health in Africa (H3Africa) initiative.

Although both cohorts in this study are from South Africa, they are 300 miles (483 km) apart and represent different lifestyles and diets. One group of females, ranging in age from 43 to 72 years, was from Bushbuckridge, a rural community. Here the researchers worked extensively with a community advisory group to ensure their research was performed in a sensitive and respectful manner that was clear and engaging with the community it was seeking to understand.

Photograph of Bushbuckridge, South Africa, a rural community.
Photograph of Bushbuckridge in South Africa, a rural community. CC

The other group of females, ranging from age 43 to 64 years, was from the urban Soweto area, where the median BMI was 36.52 compared to 31.17 in the Bushbuckridge cohort.

Photograph of Soweto, South Africa, an urban community.
Photograph of Soweto, South Africa, an urban community. CC

Benefitting from the data of previous microbiome studies, this investigation was able to reveal that their cohorts’ gut microbiomes had features of both taxa that tended to be found in Western microbiomes and in non-Western microbiomes, which has captured the transition state of the human gut during this transition period in the diets of this population. Some of the genera typical of hunter-gatherer societies are present, but so too are genera typical of people who consume a Western diet.

Given the issues of obesity in many Western societies and the growing issues of obesity in parts of Africa, particularly were Western culinary cultures are being adopted, Oduaran et al., (2020) specifically compared the obese and lean females within their study. They determined that there were no statistically significant within site differences for the Soweto cohort. Some differentially abundant taxa were observed in obese samples from Bushbuckridge, including association of Oscillibacter, which has been reported to be associated with obesity in a European cohort.

Photograph of Ovokeraye Achinike-Oduaran, lead author of the research study being discussed in this blog.
Photograph of Ovokeraye Achinike-Oduaran, lead author of the research study being discussed in this blog.

This study is an important pilot investigation. It shows that these intermediate states of human gut microbiomes are present in the current populations in South Africa, both in the urban setting and rural communities. The consequences of this mixed gut microbiome on human health is unknown, however through preliminary sequencing investigations such as this, the groundwork is set to be able to do more with larger cohort groups in cooperation with communities that are under-represented in our current research datasets. Future research will also be able to capture more sequence data information using enhanced sequencing technologies, perhaps in the not too distant future using robust long read sequencing on its own to generate high quality metagenomic data routinely.

This and other studies being conducted by researchers that are looking at under-represented populations of humans and under-studied bacterial species are important. We must not lose sight of the wonderous diversity of our planet that provides us with ever more to learn as biologists.

Biology Week Topic: Genomics, Earthquakes, and Cholera

Bacterial Genetics and Genomics book Discussion Topic: Chapter 11, question 14

Although I had planned to continue to work through the Discussion Topics from Bacterial Genetics and Genomics that are all related to the investigations in bacterial genetics and genomics using bioinformatics tools can be conducted outside of the lab, October here in the UK includes Biology Week. In addition to my research and teaching at the University, I also like to do outreach activities and public engagement in science, which includes going to schools to talk about biology, my career as a woman in science, and the cool things that we can do with genomics. One of the stories that the students and teachers find particularly fascinating is one that I was asked to deliver for Biology Week this year, but remotely, due to COVID-19. So, for this month’s blog, I have combined by regular blog with a video that I have produced for Biology Week about the outbreak of cholera in Haiti and how genomics helped to solve the mystery of its origin.

Bacterial Genomics – Cholera in Haiti

In January 2010, an earthquake struck Haiti causing widespread devastation. The international aid community responded, with the UN coordinating sending troops to help rebuild needed infrastructure and help the population. However, in October 2010, there was an outbreak of cholera that was traced back to the UN aid worker station and unsanitary conditions of waste disposal there. Ultimately, genome sequencing of the Vibro cholerae bacteria showed that it was nearly identical to bacteria from a recent outbreak in Nepal (doi: 10.1128/mBio.00157-11), as shown in the phylogenetic tree figure below. This genomic information links the outbreak to the arrival in October 2010 of UN aid workers from Nepal. Although none of these people who had come to help had symptoms, this emphasizes the role that asymptomatic carriage has in the spread of infections during an outbreak.

Phylogenetic tree showing cholera bacteria from Haiti are nearly identical to cholera from Nepal. Hendriksen et al., mBio, 2011.

Using metagenomics, the sequencing of environmental samples of DNA, researchers investigated the prevalence of cholera in Haiti in the years following the initial outbreak. Monika A. Roy and co-authors published in 2018 their metagenomic approach to evaluating surface water quality in Haiti (doi: 10.3390/ijerph15102211). In their publication, they note that since the earthquake in January 2010 and the subsequent cholera outbreak in October 2010, there had been over 9,000 deaths from cholera in Haiti.

During the term of their investigation, cholera cases had declined, but were still occurring and one of the issues was the inability to predict where new cases might occur due to gaps in surveillance. Through use of metagenomics to monitor groundwater it was hoped this might identify sources of potential cholera contamination.

The cost of using genomics for metagenomics monitoring of environmental samples is mentioned in the Introduction of the paper by Roy et al., 2018. They mention the availability of handheld MinION devices, such as those discussed in my previous blog entry, but cite that the regents to run one flowcell cost $1000. They concede that if samples are multiplexed, mixed together with molecular barcodes to differentiate them so more samples can be run at once, then the per sample cost can be reduced to $80 per sample.

Water samples were collected in 2017 as a pilot investigation and then again from specific locations in triplicate in 2018. The water was filtered using 0.22 μm filters. This is the same size filter we would use in the lab to make solutions that cannot be autoclaved, but which we need to be sterile. Therefore, bacteria in the water would be trapped onto the filters and the DNA from these was later extracted in the lab using a QIAGEN kit. This DNA is a mixture of DNA from the environment, thus being a metagenomic sample, which was sequenced using an Ion Torrent sequencer. The data was processed using CosmosID to gain the results that were interpreted by the team.

Map of regions of Haiti where water samples were taken for metagenomic sequencing. Roy et al., Int J Environ Res Public Health, 2018.

Another issue is the bioinformatics, but again the authors acknowledge that there are improvements here in making workflows easier to use and better optimized. Using the popular Krona visualization tool for metagenomics analysis, the diversity of organisms revealed is presented. They noted that V. cholerae was present in most of the replicates of water samples, although at varying abundance. The researchers are careful in their experimental design and note that V. cholerae is not always pathogenic; this is also an environmental bacterial species and when investigating metagenomics, markers must be selected to identify pathogenic versus non-toxigenic strains. The Haitian strain HE-45 was detected and samples with the virulence gene V. cholerae intI1 were identified. Cholera toxin converting phage was detected in one sample from a source not known to be used as drinking water, however these places are used for bathing, washing clothes, and for other household activities that could result in accidental ingestion of contaminated water. In addition to the cholera toxin gene, sequences for Shiga toxin were also detected. A distinct advantage of a metagenomics approach is the identification of sequences that were perhaps not part of an original hypothesis; if the initial concern had been about cholera and the investigation had just looked for cholera, these other sequences of concern would have been missed. Shiga toxin is made by EHEC E. coli such as O157:H7 and can cause life threatening disease.

Krona visualization of Haiti water source metagenomic data showing diversity of organisms present. Roy et al., Int J Environ Res Public Health, 2018.

The conclusion of the metagenomics study was that no V. cholera O1 or O139 strains were detected, which was consistent with the decline in cholera cases, however the presence of cholera toxin genes and Shiga toxin genes in the water was concerning. These could indicate potential risks for the populations using this water.

The last case of cholera in Haiti was seen in January 2019. Control of the outbreak has taken a tremendous effort in re-establishing clean water and sanitation that was destroyed by the hurricane and improving upon the infrastructures of Haiti, as well as implementing other control measures such as vaccination and rapid tracing and treating systems. In April 2019, there were concerns that the apparent eradication of cholera in Haiti should not be taken as a sign that the nation could cease being vigilant. High standards in potable water, sanitation, and healthcare must still be emphasized as of fundamental importance for preventing outbreaks across the globe.

Bacterial genomes, then and now

Bacterial Genetics and Genomics book Discussion Topic: Chapter 17, question 13

The publications of the first bacterial genome sequences were 25 years ago. The technology has come a long way since then, both in the lab and computationally. One of the first bacterial genome sequencing projects started was one undertaken to sequence the complete Escherichia coli genome. Ultimately, it was completed and published in 1997. To-date, this publication has been cited by 2,469 other articles, including a recent investigation into E. coli that are present on the surface of the human eye (Ranjith et al., 2020).

E. coli are predominantly found in the intestinal tract of humans and animals, but they are also present elsewhere, including on the ocular surface. There are, in fact, many bacterial species that reside as commensal, non-pathogenic, bacteria on the surface of the eye. These bacteria can cause opportunistic ocular infections when there has been trauma to the eye or due to other issues that compromise the immune system. To understand more about this type of E. coli, 10 eye isolates were genome sequenced. The genome sequences were analyzed for SNP variation to find single nucleotide polymorphisms between the sequences. The isolates were sorted into the nearest E. coli phylogenetic group and pathotype, as well as being assessed for antimicrobial resistance genes, prophages, and other factors that might be involved in pathogenicity.

The E. coli isolates came from two cases of conjunctivitis, two cases of bacterial keratitis, five cases of endophthalmitis, and one from orbital cellulitis. DNA was extracted from overnight cultures of the E. coli using a QIAGEN DNA isolation kit and genome sequenced using an Illumina HiSeq. The data was mapped against the reference genome sequence E. coli strain K-12 substrain MG1655 using BWA. The de novo assemblies used Velvet.

From this study, antimicrobial resistance genes were identified, which correlated with antimicrobial investigations in the laboratory. Five out of the 10 isolates were resistant to more than three classes of antibiotics. Presence of plasmids, prophages, and virulence genes were also identified, including some that may be characteristic of ocular isolates.

The different presence and absence of virulence genes, resistance genes, prophages, and plasmids were examined using BRIG, the BLAST Ring Image Generator, an example of which is described in Chapter 17 of Bacterial Genetics and Genomics and shown in Chapter 17 Figure 12 (shown below).

BRIG, BLAST Ring Image Generator, example image from Bacterial Genetics and Genomics, Chapter 17, Figure 12. Comparative visualization of E. coli genome sequences.

The 10 isolates were also able to be categorized into three of the seven known phylogenetic groups: A (2 isolates); B2 (7 isolates): and C (1 isolate). SNP analysis agreed with these relationships. Additional analysis associated the ocular isolates with four of the eight pathotypes, grouping six isolates with ExPEC (extra-intenstinal pathogenic E. coli), two with EPEC (enteropathogenic E. coli), on wit ETEC (enterotoxigenic E. coli), and one with UPEC (uropathogenic E. coli)strains. There was therefore a lack of concordance between the phylogenetic groups (A, B2, and C) and the pathotypes (ExPEC, EPEC, ETEC, and UPEC).

This study, and the over 2,000 others that have cited the original E. coli genome sequence paper, as well as many others, have shown the power of genome sequence data in revealing both differences and similarities between bacterial strains. There is tremendous diversity in the microbial world. The more we sequence, the more that is apparent. We have come a long way in 25 years. Not only are many genome sequence papers published currently that include more than one genome, but such studies are able to comparatively analyze the data, identify features that we did not know were present when investigations first started, generate the data and analyze it in a fraction of the time, and do so using much less starting material than previously required, including generating sequence data without first culturing bacteria in the lab.

Research and planning outside of the lab: working with restriction enzymes.

Bacterial Genetics and Genomics book Discussion Topic: Chapter 16, question 15

For this blog, I have decided to look at the Discussion Topic from Bacterial Genetics and Genomics, Chapter 16, question 15, which discusses restriction enzymes and encourages us to try finding digest sites for these enzymes ourselves in a gene of interest, using on-line tools. In the spirit of the last few blogs (see Doing research and making discoveries outside of the lab and More research outside of the lab, protein structure predictions), since many of us are working at home during the pandemic, either exclusively or with limited access to our labs, I have decided to take an approach that uses a minimum of tools to achieve the final goal. This means that regardless of your set up on your computer, you should be able to complete the task of identifying two restriction enzyme recognition sites, the places where the enzyme would digest the DNA. We want enzymes that would generate sticky ends and that will cut only once in the gene sequence, with one enzyme cutting at the beginning and one at the end of our gene of interest.

I’m going to use the same hypothetical gene that I investigated in the previous blogs. So far, I have been using the predicted protein sequence. This is easily extracted out of an annotation. Annotation files are plain text files and can be opened with a text editor file. I recommend Notepad for those using Windows because Notepad doesn’t add any formatting; a word processing programme like Word does add formatting, even to plain text files, so keep it simple with something like Notepad that opens .txt type files when you want to open FASTA and annotation files such as those downloaded from GenBank. Google tells me that Mac users can use TextEdit, but you need to be sure the format is plain text. I’m not a Mac person, so I don’t know for sure, but maybe someone can verify in the comments.

Since the annotation will list the amino acid sequence as part of the annotation, it is simple to copy and paste out the predicted protein sequence. The sequence of the annotated coding region, the CDS, which is the prediction for what might be a gene, is indicated by the base locations. The actual sequence of the CDS is part of the DNA sequence information as a whole. So, to get the specific sequence of CDS or a specific gene, we need to extract it out.

If we are working in a public database like NCBI GenBank, we can retrieve the sequence data or restrict it to just show the specific base range we need for the CDS. However, I am working with my own sequence information, which I have as the plain text, flat file of the annotation, in GenBank format. I could use a specialist program to open it and extract out the DNA sequence for the gene I need. But, I don’t have to do that. And, the purpose of this blog is to show you that you don’t need that and it can be done on any computer, with a bit of internet and some copying and pasting. No special software required.

In my case, I already know what sequence I want. I know from the annotation that it is present at “complement (226614..228458)”.

Screenshot of a portion of a GenBank format annotation of sequence for a CDS designated as a hypothetical protein.
Screenshot of a portion of a GenBank format annotation of sequence for a CDS designated as a hypothetical protein.

Interesting. The designation “complement” and the numbers in parentheses mean that the sequence of this predicted gene is going to be in the DNA in reverse complement. That means that the 226614 is going to be the end of the CDS and the 228458 is going to be the beginning. This is not what we might have expected, so something to watch out for when working with DNA sequences. Remember, the genes can be on either strand of the double helix. If the CDS was on the forward strand it would have just had the numbers to the right of the designation CDS without “complement” and the “()”.

I scroll through my file in Notepad and find the DNA and then the location of the CDS. It is possible then to highlight the CDS and copy the needed DNA sequence. Yes, this is the very low tech solution for doing this. There are other ways and programs that will achieve the same thing for you, but I wanted to show what can be done, by virtue of sequence files being plain text and therefore easy to manipulate on any computer.

Note that the first bases highlighted are TCA, which in reverse complement are TGA, a termination codon or stop codon. The last three bases are CAT, which would be the ATG reverse complemented initiation codon of the CDS. This is a good way to double check that you have the right area highlighted; check that you start and stop with an initiation and termination codon and that these are on the strand you expected based on the annotation.

Screenshot of a portion of a GenBank format of sequence data showing the DNA, with a portion highlighted in blue representing the CDS under investigation.
Screenshot of a portion of a GenBank format of sequence data showing the DNA, with a portion highlighted in blue representing the CDS under investigation.

If this was on the forward strand, I could paste it straight into a new Notepad file and make it a FASTA file of my own. But, it is in reverse complement, so I need to fix that first. There is a handy site to do this, which has not changed in many years: https://www.bioinformatics.org/sms/rev_comp.html. This does the basic operation needed.

Screenshot of the web interface for Reverse Complement on-line program to covert DNA to its reverse complement. Shown is a window to paste in the sequence of interest, a button to Submit for conversion, and a button to Clear the entry.
Screenshot of the web interface for Reverse Complement on-line program to covert DNA to its reverse complement. Shown is a window to paste in the sequence of interest, a button to Submit for conversion, and a button to Clear the entry.

All of the non-DNA characters (the numbers from the annotated file) are deleted during the process. This is good. The output is the DNA in the right orientation, from the ATG start to the TGA stop.

Screenshot of the output from Reverse Complement on-line DNA conversion.
Screenshot of the output from Reverse Complement on-line DNA conversion.

The sequence is copied out of the Sequence Manipulation Suite output into a new Notepad file and made into a FASTA file through my addition of a line at the top that starts:

>Blog CDS for restriction digestion

Now that I am all set with the sequence of a gene – or in this case a sequence that has been predicted to be a gene – I can see which restriction enzymes might cut it.

The recommendation from Chapter 16 is to try NEBcutter:

Screenshot of NEBCutter v2.0 on-line web-based interface. Input options include selecting a file, accession number, or pasting a sequence into the space provided, before submitting the sequence to identify restriction enzyme cut sites.
Screenshot of NEBCutter v2.0 on-line web-based interface. Input options include selecting a file, accession number, or pasting a sequence into the space provided, before submitting the sequence to identify restriction enzyme cut sites.

I paste my sequence into the box to see which NEB enzymes will cut and get this output:

Screenshot of the NEBCutter output for the gene of interest, showing a graphical representation of the locations of restriction enzyme cut sites along the length of the sequence. There are links for other options at the bottom of the page.
Screenshot of the NEBCutter output for the gene of interest, showing a graphical representation of the locations of restriction enzyme cut sites along the length of the sequence. There are links for other options at the bottom of the page.

So, that’s the first part of Discussion topic 16.15 done. I have identified the enzymes that would cut the sequence. This on-line tool gives me a graphical output of the length of the CDS and shows where along it the various enzymes would cut. There’s more investigation that can be done from here, including zooming in and refining what is shown.

The Discussion topic wants me to identify those restriction enzymes that cut the CDS only once. This is easily done here on the NEBcutter tool. There is a link under the List heading that says “1 cutters”. I press that and all of the single cutters are listed, either alphabetically or by cut position.

The next parameters that the Discussion topic wants me to explore is to find:

  • One enzyme that cuts at the start of the gene and generates sticky ends
  • One enzyme that cuts at the end of the gene and generates sticky ends
  • Both of the identified enzymes need to work at the same temperature
  • Both of the identified enzymes need to work in the same buffer

To achieve the end goal I will change the sort order from alphabetical to cut position. Looking through the information on the first few, I am going to look first at XbaI. Why? This is an enzyme that I recognize and I think we have some in the freezer, so that will save on time and cost to use something we already have. I could pick anything and order something new, but I might as well use an enzyme from the freezer if it works for the experiment. Here the XbaI cuts T CTAGC with a 5’ overhang at 37°C in CutSmart Buffer at position 232/236.

Now I need one at the other end. There I find HindIII catches my eye, again as something we likely have in the freezer. It cuts A AGCTT with a 5’ overhang at 37°C in NEBuffer2.1 at position 1580/1584. That doesn’t look great, being in a different buffer from XbaI and at first I think I might need to look for a different enzyme on the list. However, there is more information on each enzyme on the product list page and I’ve just been looking at a summary.

Delving deeper, I see that HindIII has 100% activity in NEBuffer 2.1, but only 50% in CutSmart Buffer, so no help there. However, checking the product page for XbaI, I find out that it has 100% activity in CutSmart Buffer, but it also works at 100% in NEBuffer 2.1. So in NEBuffer 2.1 at 37°C I can do a double digest of the CDS under investigation with XbaI and HindIII, which will generate incompatible sticky ends and delete out 1348 bases of the coding region. I could then also digest an antibiotic resistance cassette marker with XbaI and HindIII to join with the cut ends and make selection of deletion mutants possible. Homologous recombination between the flanking regions brings the resistance marker into the chromosome, deleting the gene in the process, and the bacterial cells that are resistant to the antibiotic are those that have the gene deleted.

All of the digestion, ligation, and generation of the mutants would, of course, happen in a lab, but the planning of the experiments, such as investigating the presence of the restriction digest sites, can be done outside of the lab. Good planning of experiments is essential to making sure that experiments work well. Take the time now to plan experiments carefully. Think through protocols, write them out, and ensure they are really robustly planned, so that when you do go into the lab you have done everything you can to ensure success.

Figure extracted from Bacterial Genetics and Genomics showing insertion of a resistance marker into a gene of interest to generate a construct, which is then transferred into the chromosome to generate a knockout mutant.
Figure extracted from Bacterial Genetics and Genomics showing insertion of a resistance marker into a gene of interest to generate a construct, which is then transferred into the chromosome to generate a knockout mutant.

There are several other ways to achieve knock-outs and generate other mutations that are described in Bacterial Genetics and Genomics, as well as other uses for restriction enzymes. There are also great resources available on-line to support your use of the book, including slides with the figures from the book, like the one above, and flashcards to assist with learning terms.

That’s it for my blogs for Chapter 16, Gene Analysis Techniques. Next month I will get started on Chapter 17, Genome Analysis Techniques in my continuing theme on supporting research outside of the lab.

More research outside of the lab, protein structure predictions

Bacterial Genetics and Genomics book Discussion Topic: Chapter 16, question 14

Continuing on from the blog post last month, I am keeping on the topic of research that can be done at home, or at the computer, without needing to do experiments in the lab. Quite a lot of genetics and genomics research today involves the investigation of data and analysis of that data using computational approaches. This requires a lot of care, time, and attention at the computer, so there is plenty that can be done outside of the lab to advance our research.

This week, I have drawn from the Discussion Topic question at the end of Chapter 16 in Bacterial Genetics and Genomics.  This chapter focuses on gene analysis techniques. This second Discussion Topic asks us to look at what we can learn about the structure of a bacterial protein from just its amino acid sequence.

As we know from the rest of the book, the amino acid sequence itself is based on the codons encoded in the DNA sequence. These form the string of amino acids that we can read on a page, but this is not how the amino acids are present in the protein. Those amino acids, joined together by peptide bonds, are folded and twisted in upon each other, to form a three-dimensional structure, maybe on its own, maybe with other copies of the same protein, or maybe with other proteins.

Using the same amino acid sequence that I used in last month’s blog, I am going to see if I can find a structure to the hypothetical protein that I investigated. Since it is hypothetical, it is highly unlikely that anyone has crystalized and experimentally determined the structure of this specific protein; it is not likely to have been previously investigated, since the function hasn’t been determined. But, there may be another protein that is similar to it for which a structure is known.

As for last month, I have the protein sequence in FastA format, which has a first like starting with “>” followed by some information about the sequence, and then the sequence data starting on the second line:

>Hypothetical protein for blog post analysis


One strategy is to do a BLASTP search. You may be thinking – but Dr. Snyder, you did that last month. Yes, I did, but this time, I will alter the settings somewhat.

On the BlastP screen, I have pasted my FastA format sequence into the query field. In the Choose Search Set, rather than searching the Non-redundant protein sequences (nr) as I did last month, today I am searching the Protein Data Bank proteins (pdb). This is a repository of 3D protein structures and other large biological molecules.

Screenshot image of BlastP landing page. Hypothetical protein sequence has been entered as the Query Sequence and Protein Data Bank has been chosen as the Search Set from a pull down menu.

When I press BLAST, I get this result. Not great, since there is only one hit and it is to an algae Euglena gracilis.

BlastP result from the previous image, showing one hit to the Query sequence: Chain G, subunit b from Euglena gracilis.

The E value is terrible at 9.3. The query coverage is only 7%, as is graphically evident in the graphic summary tab:

BlastP search results showing the Graphic Summary. This displays the E. gracilis hit region of similarity against the length of the Query. There is a very short black bar visible near 400 where the sequences align.

The small black bar under the 400 position is the area where there is some similarity between my hypothetical protein and the E. gracilis protein hit. This is the alignment, which shows just how few amino acids align between the two proteins.

BlastP search results showing the Alignment between the E. gracilis hit and the Query. A short region of the Query is shown from 396 to 444 amino acids that have some similarity to the E. gracilis protein sequence (35% identity, 19/54).

There might possibly be enough similarity in this region to suspect that the structure of this part of the protein, maybe the folds involved there, might be similar, but I’d be much happier if I had come across the structure of something with some much closer similarly.

However, for the purposes of illustration in the blog, let’s have a look. Returning to the Descriptions tab, there is information about the E. gracilis hit, including the Accession number 6TDV_G.

BlastP results shown previously with the one E. gracilis hit. Data for the hit includes Max. Score 28.9, Total Score 28.9, Query Cover 7%, E value 9.3, Per. Ident 35.19%, and Accession 6TDV_G.

Clicking on this link takes me to the entry for this sequence and structure data.

Accession 6TDV_G entry for E. gracilis Chain G, subunit b protein. The GenPept format entry is shown to the left and an image of a 3-dimensional protein structure is in the right margin under the heading Protein 3D Structure.

Clicking on the Protein 3D Structure picture at the right brings me to the 3D model, which is available in formats that mean I can set it to spin and show the full 3D rotation display in a full-featured 3D viewer.

Cryo-EM structure of E. gracilis mitochondrial ATP synthase, membrane region. Page for Accession 6TDV_G includes a detailed description of the whole protein of 29 subunits and shows the 3D protein structure.

Since there is so little similarity and since what little similarity there is matches a small portion of this larger structure, I am going to leave that bit of analysis and try something else. You might have noticed from the BlastP results that on the Graphic Summary tab there was some additional information. Note where it says: Putative conserved domains have been detected, click on the image below for detailed results. This is generated because when a BlastP is run, it also runs a Conserved Domain search (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). This result is not always present; it only shows up when the Conserved Domain search finds a conserved domain in the query protein sequence.

BlastP result Graphic Summary display, as previously. Above the previously noted graphic showing where in the Query protein the hit aligns, is the graphic indicating there have been Conserved Domains detected.

Clicking on the image for the conserved domains, I can see that there is a UvrD-like helicase C-terminal domain.

Conserved Domain search results. Listed are the protein domains identified, here UvrD_C_2 and a short description “UvrD-like helicase C-terminal domain. This domain is found at the C-terminus of a wide variety…”. Descriptions are expanded by pressing the {+} to the left of the name.

To learn more, I click on the [+] next to the name of the domain. The additional information tells me that this domain is found at the C-terminus of a wide variety of helicase enzymes and that the domain has an AAA-like structural fold. This may fit with some of the PSI-BLAST results from last month, which hit on some AAA family ATPases, and should be investigated further to find out about these types of proteins and the importance of this domain and its structure.

Out of curiosity, to see if any additional information can be yielded about the protein from other sources, I tried the PredictProtein search (open.predictprotein.org). This web-based search includes “whatever can reasonably be predicted from protein sequence with respect to the annotation of protein structure and function.” Because it does so many analyses, this took a while, but there were some results that came from it.

ProteinProtect results for the query hypothetical protein. There are several layers of graphical output displayed horizontally, followed by a summary that indicates the Sequence Length is 614 and the Number of Aligned Proteins is 31. A pie chart displays the Amino Acid composition.

In the blue message bar at the top it says, “What am I seeing Here? This viewer lays out predicted features that correspond to regions within the queried sequence. Mouse over the different coloured boxes to learn more about the annotations.” Doing that across the boxes above the solid blue lines, where there is a row of red and blue boxes, I find that the red boxes indicate potential helices and blue boxes are potential strands. So, that gives us some structural information already, which will be based on the potential of the amino acid sequence and the properties of the side chains of those amino acids. The next line down has yellow and blue boxes. The blue boxes here are regions of the protein that are predicted to be ‘exposed’ as in they are surface exposed on the 3D protein, while the yellow boxes are those that are ‘buried’ within the protein once it is folded. It is interesting to compare the data from the first line with the second here.

There is a lot to explore here. Clicking on the link Secondary Structure and Solvent Accessibility in the menu on the left shows that information in more detail.

PredictProtein result for Secondary Structure and Solvent Availability. This focuses on some of the data shown in the previous image, where red and blue blocks in one row show predicted helices and strands and the next row’s yellow and blue blocks show predicted exposed and buried regions. Pie charts are here for these features. On the left is Secondary Structure Composition with strand, helix, and loop wedges. On the right is Solvent Accessibility with exposed, buried, and intermed.

Clicking on Subcellular Locations from the left menu gives a prediction of where the protein might be located in the cell. I need to select the Bacteria tab here, because the default assumption is Eukarya, but the analysis has been done and the results are there waiting for me. Since the results so far suggest that this hypothetical protein might have enzymatic activity, it is not surprising that the location prediction is ‘cytoplasm’.

PredictProtein results for Subcellular Location. The Bacteria tab has been selected. A graphic of a rod-shaped bacterial cell is shown. Below this is written, “Predicted localization for the Bacteria domain: Cytoplasm (GO term ID: GO:0005737) Prediction confidence 70

In each case where a prediction is made the evidence is presented and there are references cited at the bottom of the page for the tools used to generate the predictions, so that both the PredictProtein and original tools references can be cited in any publications that might result from research using this web resource.

There are a variety of additional tools that can be used on protein sequences to analyse them and perhaps understand something more about the sequence. To see if perhaps I can understand any more about the possible structure of this protein, I decided to try Phyre2 (www.sbg.bio.ic.ac.uk/phyre2), Protein Homology/analogY Recognition Engine V 2.0. This uses remote homology detection methods combined with analysis of the primary amino acid sequence data to construct a 3D protein structure.

Home page for Phyre2 where users can enter their e-mail address, optional job description, amino acid sequence, modeling mode (normal or intensive), and tick as appropriate being ‘not for profit’, ‘for profit (commercial)’, or ‘other’. There is a Search button and Reset button.

Again the results took some time to generate. Remember, these are computationally intense processes being done at PredictProtein and Phyre2, so be patient. In fact, Phyre2 gives users the option of running the modelling in Normal or Intensive Mode. I chose Normal for the sake of time, but for research, I would likely got back and do Intensive. In the end it was worth waiting for the results, because I got a lovely image of a potential protein structure to associate with my hypothetical protein of interest. It can be viewed in 3D mode as well, so I can move it around with my mouse and have a look at the structure from all angles.

Phyre2 results from hypothetical protein query. A protein structure is shown on the left. On the right is information for the top model: Model (left) based on template c3gp8A. Top template information is also displayed, including: PDB header: hydrolase/dna and Confidence 100% and Coverage 47%. There is a link for Interactive 3D view in JSmol.

More models are presented farther down the page in a table, displayed in order of decreasing confidence scores. Beside each is a graphic indicating the portion of the input protein sequence that has been represented by the model.

Phyre2 results displaying additional models in a table.. Column 1 numbers the models, column 2 gives the Template for the model, column 3 is a graphic of the Alignment Coverage, column 4 is images of 3D models, column 5 is Confidence as a percentage, column 6 is the percent i.d., and column 7 has template information.

All of the results I have been looking at in this blog are predictions. The protein, when made in the bacterial cell, may fold very differently from these predictions and it should be remembered that biologically protein structures can and do change due to a variety of factors like temperature, substrate binding, and phosphorylation. However, prediction can be used as a guide for experiments and investigations. If, for example, I was investigating a gene containing a SNP, which changed an amino acid in the encoded protein, I might want to know where that amino acid was located in the final protein structure. Predictions like these might help identify the location of the changed amino acid. Is it embedded inside a membrane? Is it buried within the folded protein? Or is it prominently on the surface of the protein where it might be important for interacting with other proteins or within what is believed to be the active site of an enzyme where it is involved in the binding of substrate?

I hope that this blog and the one before has been useful in demonstrating some of the tools available for doing research outside of the lab. This theme will continue next month when I tackle the last discussion topic of Chapter 16 and investigate restriction enzyme digest sites.

Top structure model from Phyre2.

Create your website with WordPress.com
Get started