Genomics euphoria: ramblings of a scientist on genetics, genomics and the meaning of life

Tag Archives: high-throughput sequencing

Letters from the trenches of war on cancer (Part I)

As I get older, cancer surpasses a scientific curiosity and morphs itself into a harsher reality. As our parents start to get worried about every mole and lump, we also accompany them through the ensuing emotional roller coaster. Working close to a hospital is not helping either… while the tumor samples you see every day are assigned random numbers, it is quite impossible not to see the human suffering behind every biopsy. While I still firmly and deeply believe in the fact that ultimately it is the basic research that can revolutionize health and medicine, I can also sense the urgency of now and the need to act on that front. It is this dichotomy that has shaped my research for the past few years, the fruits of which are finding their way into the annals of science.

It is not news to anyone that I study the biology and regulation of RNA (see the two previous posts on this very blog: here and here). I have specifically focused on developing computational and experimental frameworks that help reveal the identity of post-transcriptional regulatory programs and their underlying molecular mechanisms. Towards the end of my tenure as a graduate student, building upon the work by talented postdocs in the Tavaozie lab at Princeton University (namely Olivier  and Noam who published their work back in 2008) and with the help of my genius friend Hamed, we developed, benchmarked and validated a computational method named TEISER that extends motif-finding algorithms into the world of RNA by taking into account the local secondary structure of RNA molecules as well as their sequence.

When I started out as a postdoc, my goal was to study post-transcriptional regulation using cancer metastasis as a model. In addition to its clinical impact, studying metastasis also has the added benefit of access to a large compendium of high-quality datasets as well as rigorous in vivo and in vitro models for downstream validation of interesting findings.

When it comes to tumorigenesis in general, there is a large body of work focusing on the role of transcriptional regulation, specifically  transcription factors as suppressors and promoters of oncogenesis. However, other aspects of RNA life-cycle are substantially understudied. The success of our lab and many others in revealing novel and uncharacterized regulatory networks based on the action of various miRNA in driving or suppressing metastasis highlights the possibility that heretofore uncharacterized post-transcriptional regulatory programs may play instrumental roles in tumorigenesis.

Given the success of miRNA regulation and my previous work on RNA stability, performing differential transcript stability measurements between highly metastatic cells relative to their poorly metastatic parental populations seemed like a logical step. Using thiouridin pulse-chase labeling and capture followed by high-throughput RNA-seq, we estimated decay rates for every detectable transcript (~13000 transcripts total). It was around this dataset that we built an ambitious study, pushing ourselves to dig deeper at every step. We generated, analyzed, and interpreted heaps of data of various kinds: in silico, in vitro, and in vivo. The results of this study was the discovery of a novel post-transcriptional regulatory program that promotes breast cancer metastasis. Our results were recently published in Nature, however, I also gained insights that could not be included in a 4-page paper. As such, in the upcoming posts, I’ll try and expand on various aspects of this study that I found fascinating. Stay tuned…

RNA Structurome

The weekly or monthly updates that appear in my e-mail account from various journals that I have subscribed to serve as a reminder that every single day we are expanding our knowledge and adding to the repertoire of scientific conquest. Sometimes reading these papers, however, is a chore… Not every paper is well-structured, not every project deserves the attention that it receives, and not every study stands the test of time. Every now and then however, I read papers that leave a profound mark on how I view biological systems. These studies are not necessarily large-scale or even complex but the mere act of reading them changes my way of thinking. The transformation may be nuanced or not even noticeable, but the effects will remain… for a while. If pressed, each scientist may come up with a unique collection of such publications–what we find exciting is ultimately a subjective matter–but I think we all, to some extent, can appreciate the underlying attraction.

The late January issue of Nature carried a few papers of this type for me. Rouskin et al. and Ding et al. reported the use of DMS (dimethyl sulfate)-based modification of exposed ribonucleotide bases coupled with high-throughput sequencing to provide a snap-shot of RNA structural preferences in vivo (in yeast, mammalian cells, and Arabidopsis). Despite the need to overcome certain technical hurdles, the methods themselves are logical extensions of the methods that were published previously for low throughput and in vitro RNA structure determination. What I found intriguing, however, was how Rouskin et al. turned their observations into an actionable hypothesis. Given the nature of the data they had gathered, this paper could have easily turned into a descriptive publication. But the authors took a step further and put forth a hypothesis that best explained the major trends in their data. I am confident it would have been easier for them not to do so… I am also confident that because of this hypothesis, they had a harder time convincing the reviewers than they would’ve otherwise. But they clearly didn’t shy away from going were the data had taken them and they should be applauded for doing so. They put this hypothesis front and center; early on in their paper they state:

“Comparison between in vivo and in vitro data reveals that in rapidly dividing cells there are vastly fewer structured mRNA regions in vivo than in vitro. Even thermo-stable RNA structures are often denatured in cells, highlighting the importance of cellular processes in regulating RNA structure. Indeed, analysis of mRNA structure under ATP-depleted conditions in yeast shows that energy-dependent processes strongly contribute to the predominantly unfolded state of mRNAs inside cells.”

For me, it all comes down to the phrase: “the importance of cellular processes in regulating RNA structure.” We have read about numerous examples where the structure of RNA acts as cis acting factors in RNA biology, however, thinking of RNA structure itself as an intermediate target of regulatory programs on a whole-transcriptome level is very intriguing. I always suspected this much but reading this sentence just toggled a switch in my head–in a good way.


DMS signal in RPL33A mRNA shows a region that is unstructured in vivo but forms a stable structure in vitro.

DMS signal in RPL33A mRNA shows a region that is unstructured in vivo but forms a stable structure in vitro (Rouskin et al, 2014).

Based on their own DMS-seq data, Ding et al similarly report:

“…mRNAs of cold and metal ion stress-response genes folded significantly differently in vivo from their unconstrained in silico predictions (Fig. 4c, d and Extended Data Fig. 8a, b). Interestingly, these stresses are known to affect RNA structure and thermostability.”

This statement, despite being more descriptive, tells a similar story. And I think this is a very important hypothesis. Understanding RNA structure as a dynamic phenomenon in the cell, and not just a byproduct of thermodynamics coded within the sequence, with far-reaching regulatory consequences opens up a new field of research studying transcriptome-wide consequences of factors that affect RNA structure and their functional consequences.

I should also mention that in the same issue, a study by Howard Chang, Eran Segal and colleagues reported:

“Comparison of native deproteinized RNA isolated from cells versus refolded purified RNA suggests that the majority of the RSS [–RNA secondary structure] information is encoded within RNA sequence.”

On the surface this statement contradicts those reported by the Weissman lab. However, this latter study was using de-proteinized RNA and as Rouskin et al. have clearly stated: “analysis of mRNA structure under ATP-depleted conditions in yeast shows that energy-dependent processes strongly contribute to the predominantly unfolded state of mRNAs inside cells.” So the observation made by Wan et al. is a consequence of the in vitro nature of their study. If it turns out that the differences between in vivo and in vitro RNA secondary structures are pervasive, as Rouskin et al. suggest them to be so, we need to rethink how much stock we’re willing to put into the descriptive studies that have reported on RNA structure using in vitro methods.


  1. Rouskin et al., 2014.  Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705
  2. Ding et al, 2014. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 505, 696–700.

The rise of circular RNAs: a whole swath of circular Spongebobs

Recently, we’ve been bombarded by high-profile studies about a class of RNAs, called circular RNAs. Resulting from non-canonical splicing events (see below), circRNAs seem to be more prevalent than previously thought. They’re identified in mammals, plants and even archaea.

The formation and identification of circRNAs

The formation and identification of circRNAs

The recent papers in Nature (Memczak et al. and Hansem et al.) argue for a broad, even tissue specific, functionality for these type of RNAs. Memczak et al. report a comprehensive atlas of thousands of circRNAs in various organisms through a computational approach, to which they assign an impressive 75% sensitivity and very low false-discovery rate.

circRNA statistics according to Memczak et al.

circRNA statistics according to Memczak et al.

The significantly high stability of these RNAs, according to these authors, puts them in perfect position to function as post-transcriptional regulators through sponging other regulatory trans factors. They focused on miRNA sites to find circRNAs that show higher than expected occurrence of these elements. Ant they in fact find circRNAs that can bind and trap miR-7 loaded RISC, results that are corroborated in other recent papers.

Personally, I find sponging a very low-complexity function… meaning, they arise after the fact, with the cell taking advantage of non-coding RNAs that are already available. This means either that circRNAs first arose as aberrant splicing events, i.e. mistakes in donor-acceptor identification or either them or their splicing partners play other, more complex roles that we should be able to identify soon.


Memczak S, et al. (2013). Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. doi:10.1038/nature11928

Hansen TB, et al. (2013). Natural RNA circles function as efficient microRNA sponges. Nature. doi:10.1038/nature11993

Synthesizing a genome like a boss

Despite recent leaps in artificial synthesis of long custom DNA sequences and the hype surrounding similar large-scale projects (e.g. the first synthetic genome), the process is far from mainstream. Synthesis of genes or even minigenes is still expensive, slow and tedious. As a lazy scientists who despises cloning and prefers to synthesize the whole construct at the touch of a button (or a click of a mouse), I am all for methods that advance this area of biotech. So, I am very excited about this paper that recently showed up in Nature methods (and I think it’s pretty smart).

The rationale here is based on the fact that short-DNA (i.e. oligonucleotide) synthesis, which forms the basis for longer molecules, is still very error prone and finding a molecule without mutations requires the direct sequencing of many instances from the final product (only a small fraction of the final products are mutation-free). Now, here, what they have accomplished is that they have successfully shifted the sequencing step to the oligo level. Basically, they tag all the oligos to be synthesized with random barcodes. They sequence the whole population using high-throughput sequencing, identify the correct oligos and use their barcodes to enrich them from the initial population using specific primers and PCR.

I assume companies that synthesize custom DNA can take advantage of a sequencing run to find the correct oligos for multiple independent orders, thus significantly reducing the cost of sequencing. However, high-throughput sequencing is still slow. So, I assume this method doesn’t significantly cut the time requirement of the production phase… but I’m not too familiar with the industrial pipeline, and maybe it does help. I think we’ll know soon enough.

Reading an Ancient Genome

Recently, a paper appeared in Science magazine describing a multinational effort to sequence the genome of an archaic individual (an 80,000-year-old Denisovan girl). It actually created a fait bit of hype with news snippets abound (e.g. this one) and a nice wired blog post. Much of the hype, I think, was warranted and this study offers a blueprint for how studies are shaped in the age of genomics and whole-genome sequencing. I will first talk about why I think this study tackles an important problem and then move on to the methodology and results.

Following the genetic trail: the tale of the third chimpanzee

Looking back, I think my introduction to human evolution was mainly through an outstanding book written by Jared Diamond, called “The third chimpanzee“. A lot has changed since then (although, I think the book is still very relevant and a very good read). Many more fossils have been discovered around the world, from Lucy (Australopithecus afarensis) dated to about 3 million years ago to Homo heidelbergensis and Homo erectus specimens dated to less half a million years. These at times partial fossils tell a convincing, albeit incomplete, story of human evolution. However, it was the “Neanderthal Genome Project”, reporting the whole-genome sequence of a 38,000 year old sample from the femur of a Neanderthal specimen, that turned a page on studying the genetics of human evolution. DNA is a vast collection of information and comparing these collections between different species portrays a more vivid picture of their evolutionary trajectories with outstanding details. This information goes significantly beyond what we can learn based on the shape of the fossils and their dates and the circumstances of their finding. It is like finding the black box of a fallen plane: rife with key information that truly shapes our understanding of the events. For example, the Neanderthal Genome Project showed that there had been very little if any interbreeding between humans and Neanderthals (1-4%) with insignificant effects on the evolutionary trajectory of the human genome. Our new-found ability to look into the DNA information has enabled us to reconstruct the evolutionary trajectories with unprecedented resolution. Why is this important? Genetically speaking, I think it is based on where we came from that we can learn where we are headed as a species. And we owe this knowledge to recent advances in high-throughput sequencing.


Sequencing old DNA

DNA is one of the key building blocks of life and its use as genetic material stems, in part, from its surprising stability. Nevertheless, DNA is susceptible to erosion and degradation. This degradation results in very poor DNA quality when extraction is attempted on fossils. Another important point is that conventional methods for preparing samples for high-throughput sequencing relies on double stranded DNA, while DNA degradation results in single stranded DNA becoming a significant portion of the population in fossils. Relying on double-stranded DNA methods not only loses this sizable fraction of DNA, but also results in an enrichment of exogenous contaminant DNA from bacteria or even humans. For example, in the Neanderthal genome project, a significant correlation was observed between the length of the fragment and its similarity to modern humans implying that large fragments (which come from higher quality DNA) were in fact from contaminants. This issue is the exact problem that this study has tackled. They have developed a sequencing strategy that involves single stranded DNA rather than double stranded ones. This method would better capture the degraded samples and it is due to this enhancement that they had actually succeeded in producing a rather high-quality sequence of the ancient fossil. They achieved more than 20-fold coverage of the genome on average, meaning that each position in the genome was read 20 times independently which significantly increases the accuracy of the sequence. In comparison the Neanderthal project scored a 1.5-fold coverage of the genome. This surprising jump in quality is a testament to the effectiveness of their proposed method in sequencing fossilized DNA.

Why does it matter?

This level of coverage and accuracy in sequence enables us to make key inferences both about the individual and the population where she came from. While the fossil was really only a partial bone from a finger and a couple of teeth, the researchers have determined the color of her eyes, hair and skin. But more importantly, with such accuracy, we can tell apart the maternal chromosomes (those coming from the mother) from paternal ones (those coming from the father). What can we do with this information? For starters we can determine whether the parents where close relatives or not (which in this case they weren’t). However, while the parents were not closely related, their genome shows portion of significant similarity. This observation implies that the population in which this girl was living, showed very low genetic diversity. This can be due to the population size. Small populations results in an effect called “bottlenecking” in which few individuals shape the markup of the whole population resulting in very low diversity. Another important finding is that Denisovans (which this girl is a member of) split from modern humans around 180,000 years ago.

What else?

We can gain important insights by comparing this ancient genome to those of modern humans in terms of differences. Looking at the two genomes side-by-side, these researchers observe that a significant fraction of these changes affect brain development and function. While this may not be a surprising observation (since we already associate modern humans with higher brain size and function), it underscores the potential role of brain morphology and development in the evolution of our species.

The plethora of information and knowledge that we gain from a single sequenced genome is outstanding. While we should be cautious of generalizing our findings based on a single individual to whole populations across Asia or even the world, there are in fact no other sources where this knowledge can be found. Similar studies can portray a detailed picture of our genomes and its evolution through ages. A low hanging fruit is to use this novel sequencing methodology to sequence other available samples (including the Neanderthal ones).