Genomics euphoria: ramblings of a scientist on genetics, genomics and the meaning of life

Category Archives: Genomics

Functional impacts of RNA modifications

My good friend and all-around awesome scientist Claudio has been working on this very cool idea the fruits of which is now available to all. I had the pleasure of being on his team and I’m personally fascinated by the problem. The broad question that Claudio is tackling is “what to RNA modifications do in the cell?” In particular he’s been focused on m6A modification and its role in miRNA processing.

But the real story is actually more fascinating. For the longest time, Claudio was looking for the molecular mechanism through which the miRNA mir-126 is down-regulated in highly metastatic cells. A possible solution formed in the form of the gene METTL3 which methylated RNA. Knocking down METTL3 indeed reduced mir-126 levels. But while doing the necessary controls, Claudio noticed that the reduction was not limited to mir-126 but was actually a more global effect impacting a large fraction of miRNAs. It was from this initial observation that his grand hypothesis was formed: RNA methylation (m6A) has a direct role in miRNA biogenesis. And this is were I came in…. a quick look at miRNA sequences showed that m6A sites were located close to but not on primary miRNA sequences. This suggested that m6A markings could serve as a beacon for recruiting the miRNA processing machinery (specifically the dsRNA-binding protein DGCR8). Claudio then used a series of focused experiment to prove this hypothesis (as much as anything can be proven in science). You can read all about this in his very nice paper in Nature.

However, as is usually the case in science, solving one problem leads to even more questions. As I mentioned earlier, m6A sites are not directly recognized by DGCR8, so there was a missing link between RNA modification and the recruitment of processing machinery. To approach this problem, Claudio did the an IP-mass spec of m6A-modified RNA and found a very good candidate in the form of the ubiquitous RNA-binding protein HNRNPA2B1. What was especially important was that the RGAC motif targeted by METTL3 actually has similarities to HNRNPA2B1 binding sequence. In fact the RGAC motif is very much enriched among the binding sites of HNRNPA2B1. Now the question is whether methylating these sequences can impact HNRNPA2B1 binding. In other words, are there sites where modifying the A to m6A will increase affinity to HNRNPA2B1. In a series of experiments (both high- and low-throughput) we showed that this is in fact the case. In general we observed a broad functional entanglement between METTL3 and HNRNPA2B1. These results were recently published in Cell and I invite everyone to read this paper.

HNRNPA2B1 Is a Mediator of m6A-Dependent Nuclear RNA Processing Events

HNRNPA2B1 Is a Mediator of m6A-Dependent Nuclear RNA Processing Events

My thoughts? I think this is just the beginning. There are two points to consider: (i) HNRNPA2B1 is not the only reader of m6A and (ii) m6A is not the only RNA modification. Together, I think these studies and those of other groups on m6A (and RNA modification in general) suggests the birth of a new field of research with broad functional consequences on gene expression regulation.

Letters from the trenches of war on cancer (Part I)

As I get older, cancer surpasses a scientific curiosity and morphs itself into a harsher reality. As our parents start to get worried about every mole and lump, we also accompany them through the ensuing emotional roller coaster. Working close to a hospital is not helping either… while the tumor samples you see every day are assigned random numbers, it is quite impossible not to see the human suffering behind every biopsy. While I still firmly and deeply believe in the fact that ultimately it is the basic research that can revolutionize health and medicine, I can also sense the urgency of now and the need to act on that front. It is this dichotomy that has shaped my research for the past few years, the fruits of which are finding their way into the annals of science.

It is not news to anyone that I study the biology and regulation of RNA (see the two previous posts on this very blog: here and here). I have specifically focused on developing computational and experimental frameworks that help reveal the identity of post-transcriptional regulatory programs and their underlying molecular mechanisms. Towards the end of my tenure as a graduate student, building upon the work by talented postdocs in the Tavaozie lab at Princeton University (namely Olivier  and Noam who published their work back in 2008) and with the help of my genius friend Hamed, we developed, benchmarked and validated a computational method named TEISER that extends motif-finding algorithms into the world of RNA by taking into account the local secondary structure of RNA molecules as well as their sequence.

When I started out as a postdoc, my goal was to study post-transcriptional regulation using cancer metastasis as a model. In addition to its clinical impact, studying metastasis also has the added benefit of access to a large compendium of high-quality datasets as well as rigorous in vivo and in vitro models for downstream validation of interesting findings.

When it comes to tumorigenesis in general, there is a large body of work focusing on the role of transcriptional regulation, specifically  transcription factors as suppressors and promoters of oncogenesis. However, other aspects of RNA life-cycle are substantially understudied. The success of our lab and many others in revealing novel and uncharacterized regulatory networks based on the action of various miRNA in driving or suppressing metastasis highlights the possibility that heretofore uncharacterized post-transcriptional regulatory programs may play instrumental roles in tumorigenesis.

Given the success of miRNA regulation and my previous work on RNA stability, performing differential transcript stability measurements between highly metastatic cells relative to their poorly metastatic parental populations seemed like a logical step. Using thiouridin pulse-chase labeling and capture followed by high-throughput RNA-seq, we estimated decay rates for every detectable transcript (~13000 transcripts total). It was around this dataset that we built an ambitious study, pushing ourselves to dig deeper at every step. We generated, analyzed, and interpreted heaps of data of various kinds: in silico, in vitro, and in vivo. The results of this study was the discovery of a novel post-transcriptional regulatory program that promotes breast cancer metastasis. Our results were recently published in Nature, however, I also gained insights that could not be included in a 4-page paper. As such, in the upcoming posts, I’ll try and expand on various aspects of this study that I found fascinating. Stay tuned…

RNA rises

These are exciting times to be an RNA biologist. Next generation sequencing revolutionized genetics, but now the RNA methodologies have caught up. For every DNA technique, we have developed an equivalent RNA method and then some. For example, there is CLIP-seq and Par-CLIP replacing ChIP-seq in RNA studies but then there is also recently developed high-throughput methods for probing the secondary structure of RNA in vivo (Roushkin et al. 2013, Nature). Last year the first ever large scale binding information for a compendium of RNA-binding proteins (RBPs) was published (Ray et al, 2013, Nature). The computational methods are also gaining, from SeqFold (Ouyang et al, 2013, Genome res) to our TEISER (Goodarzi et al, 2012, Nature). Did I mention these are exciting times?!

It is in light of these advances that making sense of the underlying post-transcriptional regulatory networks that control different aspects of RNA life-cycle and behavior has become ever more important. Five years ago, we embarked on a path to catalog the sequences in RNA that play substantial regulatory roles, by providing linear or structural information for trans factors to recognize and act on. Given the state of technology at the time, we were limited by the diversity of the library we could generate. So, we decided to focus on 3′ UTR sequences that are conserved across vertebrates. We synthesized these sequences in short spans on a custom-designed Agilent array and cloned them downstream of mCherry in a bidirectional promoter which also drives the expression of GFP as an endogenous control. Our goal was to then use FACS to choose the sub-populations that show higher/lower relative expression of mCherry. We could then amplify the cloning site in the selected populations and re-hybridize them back to our Agilent array for quantification (Figure below). It was all good on paper, but as is always the case, we ran into myriad technical problems, ranging from generating a library with enough independent cells (high coverage) to reproducible FACS measurements. By the time we were done trouble-shooting these problems, a lot had changed in the field. For example, sequencing had really become the staple of RNA biology (which we decided to use instead of array hybridization for quantification purposes), Agilent had started to provide custom oligo libraries directly to consumers (which means that this approach can easily be implemented in every lab) and more importantly, FlpIn system (Invitrogen) appeared that significantly affected the reproducibility of our measurements (since all clones in the library are inserted in a unique site in the genome). As is always the case with method developments, we needed to perform innumerable validation assays to evaluate the efficacy of our approach in finding known and novel regulatory elements. Our findings were published last week in Cell reports (Oikonomou et al, 2014) which I encourage you to read. Interestingly, David Erle’s group also published a similar approach which beat our paper by a few days (Zhao et al, 2014, Nature biotech).

These reporter based approaches, insulate each element and studies their effect in isolation; however, real transcripts carry many elements and the fate of the RNA is decided as a cumulative consequence of all the interacting factors. Knowing the initial building blocks, however, enable us to then construct networks and modules of regulatory elements that likely interact and function in an overlapping space (which we tried to infer in our paper using our information-theoretic tools).


Systematic dissection of conserved 3′ UTR sequences in endogenous transcripts

In the end, I wanted to mention that the downside to all the current attention in the RNA field seems to be a fast-paced publication cycle which results in mostly descriptive papers. There is nothing wrong with descriptive studies per se, but sometimes the downstream or underlying mechanisms are so very very much missing. I think, we are also guilty of this to some extent. Our goal was really to identify novel trans factors that interact with the elements we identified using our approach. This is something we are still trying to do and hopefully will manage to better functionally annotate the cis elements and the molecular mechanisms through which they exert their regulatory roles.

War on cancer: Notes from the frontlines

Last month, I was fortunate enough to attend the tumor heterogeneity and plasticity symposium. The conference which was jointly organized by CNIO and Nature, was held in Madrid. I thought it would be a good idea for me to write some notes on what I heard at the conference.


  1. There were two keynote speakers, Kornelia Polyak from Dana Farber and José Baselga from across the street from us at MSKCC. Jose’s talk was very much clinical as he enumerated the MANY trials that they are conducting. Kornelia’s talk, on the other hand, was more basic-research-y (is that a word? if not, it should be…). I really cannot distill down a whole keynote into a few sentences, but the bottom-line was: (i) diversity is bad; (ii) we can develop rational experimental approaches to find cancer drivers that are sub-clonal; (iii) sub-clonality brings forth the idea of growth-promoters vs. competitors. All in all, a very good and complex talk… Maybe one of the few talks at the conference with significant mechanistic and functional observations.
  2. The field, given how young it is, is very descriptive. It largely involves researchers making cool observations… don’t get me wrong, I don’t mean it in a derogatory sense; what I am trying to say is that it would probably take years before we can make sense of many of the observations.
  3. On the sequencing side, Elaine Mardis talked about very deep whole-genome sequencing of matched primary and metastatic tumors, followed by rigorous validations of each genetic variation. She talked about tumor lineage and pressures exerted by therapeutic manipulations and how they shape the heterogeneity of the metastatic sites. One interesting thing that she mentioned was that very few single-nucleotide variations are actually expressed (I think she said something like 44%).
  4. Sean Morrison gave a very rigorous presentation on heterogeneity and metastasis. He had used genomic tools plus xenografting in NSG mice to study melanoma.
  5. Dana Pe’er talked about analyzing mass-cytometry (Cy-TOF) data using ViSNE. Cy-TOF follows the same logic as FACS, with the main difference that instead of fluorophores, other elements with distinct and sharp mass-spec peaks are conjugated to antibodies. About 40 markers can be measured simultaneously for each cell (compared to 7-8 for FACS). However, making sense of a 40-dimensional dataset is not straightforward, which is why ViSNE comes into play. This tool however, can be used for other types of dimensionality reduction approaches as well. Think of it as non-linear PCA…
  6. Charles Swanton spoke about spatial heterogeneity in tumors where multiple biopsies from the same tumor were sequenced. The level of heterogeneity was scary… For example, we find driver mutations based on their clonality and we aim to target them therapeutically, but there were sub-populations in the tumor that had already lost the driver mutations even in the absence of therapeutic selection pressure.
  7. A number of speakers touched on this idea that the resistant population is already present in the primary tumor and does not necessarily arise after treatment.

All in all, I met some new people and listened to some amazing talks…

The rise of circular RNAs: a whole swath of circular Spongebobs

Recently, we’ve been bombarded by high-profile studies about a class of RNAs, called circular RNAs. Resulting from non-canonical splicing events (see below), circRNAs seem to be more prevalent than previously thought. They’re identified in mammals, plants and even archaea.

The formation and identification of circRNAs

The formation and identification of circRNAs

The recent papers in Nature (Memczak et al. and Hansem et al.) argue for a broad, even tissue specific, functionality for these type of RNAs. Memczak et al. report a comprehensive atlas of thousands of circRNAs in various organisms through a computational approach, to which they assign an impressive 75% sensitivity and very low false-discovery rate.

circRNA statistics according to Memczak et al.

circRNA statistics according to Memczak et al.

The significantly high stability of these RNAs, according to these authors, puts them in perfect position to function as post-transcriptional regulators through sponging other regulatory trans factors. They focused on miRNA sites to find circRNAs that show higher than expected occurrence of these elements. Ant they in fact find circRNAs that can bind and trap miR-7 loaded RISC, results that are corroborated in other recent papers.

Personally, I find sponging a very low-complexity function… meaning, they arise after the fact, with the cell taking advantage of non-coding RNAs that are already available. This means either that circRNAs first arose as aberrant splicing events, i.e. mistakes in donor-acceptor identification or either them or their splicing partners play other, more complex roles that we should be able to identify soon.


Memczak S, et al. (2013). Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. doi:10.1038/nature11928

Hansen TB, et al. (2013). Natural RNA circles function as efficient microRNA sponges. Nature. doi:10.1038/nature11993

Genome editing via the CRISPR system: the triumph of basic research

I wanted to write a quick note about the use of CRISPR/Cas system in gene editing. Several labs in parallel have developed a CRISPR system (Clustered Regularly Interspaced Short Palindromic Repeats) for gene editing purposes in a variety of organisms from zebra fish to humans. The CRISPR/Cas systems and their function as an immunity-type response in bacteria is on its own very exciting and I encourage you to read about it, if you haven’t (e.g. see this paper in Science from 2010 or simply visit wikipedia).

In short, this system records foreign DNA and uses an RNA intermediate (crRNA) to target other encounters of the same invasive DNA species via specific nucleases (i.e. Cas). But how unbelievably “cool” this system is aside, recently it has been adopted for gene editing. In this context, the crRNA is replaced with a target sequence (a sequence that we want to alter in the genome). The activity of the CAS/crRNA complex, if properly expressed, then results in double stranded breaks at the site of interest. The cell then uses end-joining repair system to correct the break; however, the error-prone nature of this mechanism results in deletions at the site of action. Now if the target sequence was selected from an active gene, this mechanism would effectively mutate the gene into an inactive copy.

CAS system structure

CAS system structure

Even better, by modifying the CAS enzymes, we can limit the nuclease activity to a single nick in the DNA, as opposed to double-stranded breaks. In this case, the cell employs homologous recombination to repair the nick and if we provide a mutated homologous sequence in trans, the system may use it as a template to correct the nicked site and in effect transfer the mutation to the genome with surgical precision.

Not only these are all very exciting, this is a prime example of how important basic research is. Just imagine the first grant written for studying the CRISPRs, and I’m paraphrasing here, “ahem…, there are these repetitive sequences in bacteria and some obscure archaea… we have no idea what they do, but we kind of wanna know… so fund us may be?”. Pursuing this simple curiosity  however, has resulted in a promising method for genome editing in humans and is poised to transform how we do genetics (mainly due to its low cost of implementation). This is a very good example of where targeted funding of “translational research” fails. Ultimately, leaps in life sciences (and science in general) come from systems that we don’t even know exist. The same goes for other amazing tools that have become mainstays of molecular biology. Similarly, I assume the proposal for studying fluorescent proteins went something like this: “well… we have this cool organism that glows in the dark. We kind of wanna know why. Will it cure cancer? probably not…”. But all kidding aside, these are all very good reminders of how important basic research is to our collective knowledge.

Sources: “Cong et al, 2013, Multiplex Genome Engineering Using CRISPR/Cas Systems, Science DOI: 10.1126/science.1231143″ among others.

Solving the directionality problem of RNA polymerase

Every now and then, a study appears that reminds us how little we know about some of the most basic subjects in molecular biology, while at the same time expanding the connotations associated with these seemingly simple mechanisms. A recent paper in Science by a multinational collaborative team was a perfect example of one such moment for me. The problem statement is relatively simple: how does RNA polymerase recognize the orientation of DNA; in other words, how does it know towards which direction it should be heading? The answer as I knew it, was two parts: (i) there are certain promoter elements that are in of themselves directional, meaning the transcription complex specifically recognizes one strand and not the other (e.g. the world famous lac promoter is one such example). (ii) in cases where there is no directionality coded in the DNA or the epigenome, the polymerase in fact does go the wrong way, which produces the myriad anti-sense RNAs in the cell. Granted, there might be functionalities associated with these anti-sense RNAs, however, established examples are few and far between.

The more important observation, however, is the fact that there are genetic components to when the anti-sense RNA is transcribed and when it isn’t. The aforementioned study starts from one such mutant (ssu72) and goes on to dissect the mechanism through which Ssu27 establishes directionality of the RNA polymerase complex. The results are very simple and elegant: Ssu27 is a part of a bridging complex that demarcates the start and end of the gene, and consequently the correct direction for transcription (below you can see the figure from the main paper).

Ssu72-mediated loop formation

Ssu72-mediated loop formation

Now one might be wondering why all promoters are not directional at the sequence level? The short answer, I think, is “regulation”. There are a variety molecular mechanisms through which promoter directionality can be used in gene regulation, both for the downstream gene as well as the upstream ones. For the immediate gene, losing half of initiation complexes to the wrong direction ensures lower expression, a fraction that can very well be modulated (e.g. through regulating ssu72 in this example). And for the upstream of genes (as well as the downstream one), the presence of anti-sense RNA spells some form of doom or desist.

Synthesizing a genome like a boss

Despite recent leaps in artificial synthesis of long custom DNA sequences and the hype surrounding similar large-scale projects (e.g. the first synthetic genome), the process is far from mainstream. Synthesis of genes or even minigenes is still expensive, slow and tedious. As a lazy scientists who despises cloning and prefers to synthesize the whole construct at the touch of a button (or a click of a mouse), I am all for methods that advance this area of biotech. So, I am very excited about this paper that recently showed up in Nature methods (and I think it’s pretty smart).

The rationale here is based on the fact that short-DNA (i.e. oligonucleotide) synthesis, which forms the basis for longer molecules, is still very error prone and finding a molecule without mutations requires the direct sequencing of many instances from the final product (only a small fraction of the final products are mutation-free). Now, here, what they have accomplished is that they have successfully shifted the sequencing step to the oligo level. Basically, they tag all the oligos to be synthesized with random barcodes. They sequence the whole population using high-throughput sequencing, identify the correct oligos and use their barcodes to enrich them from the initial population using specific primers and PCR.

I assume companies that synthesize custom DNA can take advantage of a sequencing run to find the correct oligos for multiple independent orders, thus significantly reducing the cost of sequencing. However, high-throughput sequencing is still slow. So, I assume this method doesn’t significantly cut the time requirement of the production phase… but I’m not too familiar with the industrial pipeline, and maybe it does help. I think we’ll know soon enough.

Of horses and men: the genetic makeup of racehorses

Horse locomotion and speed is one of the most complex behaviors that people seem to be interested in (for obvious reasons). There is some correlation between how a horse runs and how fast it runs. In other words, it seems that there are successful styles of running and these styles can be treated as phenotypes (or traits) and effectively studied through genetics. In general, genetic studies of dogs or horses have been significantly more successful than those of human, partly due to very controlled mating across the breeds and also excellent record keeping by the breeders throughout many generations. Now there is a paper out in Nature that looks at pacing in icelandic horses, which has a high heritability in this breed, and successfully maps this phenotype to a nonsense mutation in Dmr3.

The study contains an association study between 30 horses that don’t pace and 40 that do which resulted in the discovery of a highly significant SNP (single-nucleotide polymorphism) on chromosome 23. Genome re-sequencing in this region showed a nonsense mutation in Dmr3 as the likely candidate.

What distinguishes this study from similar ones I had read over the years is the fact that they closely follow up on the functionality of Dmr3 and its mutated form. They make the case that this protein functions in neural development using mouse models. And this is the part that gets me excited… this study sets a bar for genetic projects. It’s not enough to just list mutations in a bunch of genes along with their contribution to the phenotype. We need more mechanistic and functional results that can actually augment our knowledge to a degree that a simple gene/mutation list cannot. I am sure this is not a perfect project either and if we look closely there are things that could be done differently/better. Nevertheless, it signals the arrival of a new kind of genetic studies, one that is more function oriented.

The role of regulatory genome in human disease

A recent paper in Science perfectly captures the post-ENCODE mood of the community. It seems like we suddenly realized the coding genome is not actually that important. We have remarked over and over in the past couple of weeks that the majority of whole-genome association studies (GWAS) actually map to non-coding DNA as opposed to coding sequences. And now, armed with the knowledge that the non-coding genome has far-reaching regulatory consequences, it is very likely that the genetic component of many complex human diseases are in fact driven by regulatory interactions. And this Science paper very clearly portrays this idea. They use DNAse I hypersensitivity data as a proxy for the parts of the genome are bound by proteins in vivo. They then look at the overlap between DNase I hypersensitive sites (DHSs) and the available phenotypic and disease data. They show that many variants at these sites have regulatory consequences and they make the case that the role of regulatory genome in disease is ubiquitous and profound.

As I said, this study very well captures the consensus view-point of the community and I think we’ll see an explosion in these type of studies that would put the regulatory genome front and center as opposed to the coding DNA. With the low-hanging fruits already discovered in genetic studies, and the emergence of effective methods based on high-throughput sequencing, we are now poised to better understand regulatory networks in all their glory.