Genomics euphoria: ramblings of a scientist on genetics, genomics and the meaning of life

Category Archives: Whole-genome sequencing

War on cancer: Notes from the frontlines

Last month, I was fortunate enough to attend the tumor heterogeneity and plasticity symposium. The conference which was jointly organized by CNIO and Nature, was held in Madrid. I thought it would be a good idea for me to write some notes on what I heard at the conference.


  1. There were two keynote speakers, Kornelia Polyak from Dana Farber and José Baselga from across the street from us at MSKCC. Jose’s talk was very much clinical as he enumerated the MANY trials that they are conducting. Kornelia’s talk, on the other hand, was more basic-research-y (is that a word? if not, it should be…). I really cannot distill down a whole keynote into a few sentences, but the bottom-line was: (i) diversity is bad; (ii) we can develop rational experimental approaches to find cancer drivers that are sub-clonal; (iii) sub-clonality brings forth the idea of growth-promoters vs. competitors. All in all, a very good and complex talk… Maybe one of the few talks at the conference with significant mechanistic and functional observations.
  2. The field, given how young it is, is very descriptive. It largely involves researchers making cool observations… don’t get me wrong, I don’t mean it in a derogatory sense; what I am trying to say is that it would probably take years before we can make sense of many of the observations.
  3. On the sequencing side, Elaine Mardis talked about very deep whole-genome sequencing of matched primary and metastatic tumors, followed by rigorous validations of each genetic variation. She talked about tumor lineage and pressures exerted by therapeutic manipulations and how they shape the heterogeneity of the metastatic sites. One interesting thing that she mentioned was that very few single-nucleotide variations are actually expressed (I think she said something like 44%).
  4. Sean Morrison gave a very rigorous presentation on heterogeneity and metastasis. He had used genomic tools plus xenografting in NSG mice to study melanoma.
  5. Dana Pe’er talked about analyzing mass-cytometry (Cy-TOF) data using ViSNE. Cy-TOF follows the same logic as FACS, with the main difference that instead of fluorophores, other elements with distinct and sharp mass-spec peaks are conjugated to antibodies. About 40 markers can be measured simultaneously for each cell (compared to 7-8 for FACS). However, making sense of a 40-dimensional dataset is not straightforward, which is why ViSNE comes into play. This tool however, can be used for other types of dimensionality reduction approaches as well. Think of it as non-linear PCA…
  6. Charles Swanton spoke about spatial heterogeneity in tumors where multiple biopsies from the same tumor were sequenced. The level of heterogeneity was scary… For example, we find driver mutations based on their clonality and we aim to target them therapeutically, but there were sub-populations in the tumor that had already lost the driver mutations even in the absence of therapeutic selection pressure.
  7. A number of speakers touched on this idea that the resistant population is already present in the primary tumor and does not necessarily arise after treatment.

All in all, I met some new people and listened to some amazing talks…

Synthesizing a genome like a boss

Despite recent leaps in artificial synthesis of long custom DNA sequences and the hype surrounding similar large-scale projects (e.g. the first synthetic genome), the process is far from mainstream. Synthesis of genes or even minigenes is still expensive, slow and tedious. As a lazy scientists who despises cloning and prefers to synthesize the whole construct at the touch of a button (or a click of a mouse), I am all for methods that advance this area of biotech. So, I am very excited about this paper that recently showed up in Nature methods (and I think it’s pretty smart).

The rationale here is based on the fact that short-DNA (i.e. oligonucleotide) synthesis, which forms the basis for longer molecules, is still very error prone and finding a molecule without mutations requires the direct sequencing of many instances from the final product (only a small fraction of the final products are mutation-free). Now, here, what they have accomplished is that they have successfully shifted the sequencing step to the oligo level. Basically, they tag all the oligos to be synthesized with random barcodes. They sequence the whole population using high-throughput sequencing, identify the correct oligos and use their barcodes to enrich them from the initial population using specific primers and PCR.

I assume companies that synthesize custom DNA can take advantage of a sequencing run to find the correct oligos for multiple independent orders, thus significantly reducing the cost of sequencing. However, high-throughput sequencing is still slow. So, I assume this method doesn’t significantly cut the time requirement of the production phase… but I’m not too familiar with the industrial pipeline, and maybe it does help. I think we’ll know soon enough.

Of horses and men: the genetic makeup of racehorses

Horse locomotion and speed is one of the most complex behaviors that people seem to be interested in (for obvious reasons). There is some correlation between how a horse runs and how fast it runs. In other words, it seems that there are successful styles of running and these styles can be treated as phenotypes (or traits) and effectively studied through genetics. In general, genetic studies of dogs or horses have been significantly more successful than those of human, partly due to very controlled mating across the breeds and also excellent record keeping by the breeders throughout many generations. Now there is a paper out in Nature that looks at pacing in icelandic horses, which has a high heritability in this breed, and successfully maps this phenotype to a nonsense mutation in Dmr3.

The study contains an association study between 30 horses that don’t pace and 40 that do which resulted in the discovery of a highly significant SNP (single-nucleotide polymorphism) on chromosome 23. Genome re-sequencing in this region showed a nonsense mutation in Dmr3 as the likely candidate.

What distinguishes this study from similar ones I had read over the years is the fact that they closely follow up on the functionality of Dmr3 and its mutated form. They make the case that this protein functions in neural development using mouse models. And this is the part that gets me excited… this study sets a bar for genetic projects. It’s not enough to just list mutations in a bunch of genes along with their contribution to the phenotype. We need more mechanistic and functional results that can actually augment our knowledge to a degree that a simple gene/mutation list cannot. I am sure this is not a perfect project either and if we look closely there are things that could be done differently/better. Nevertheless, it signals the arrival of a new kind of genetic studies, one that is more function oriented.

The role of regulatory genome in human disease

A recent paper in Science perfectly captures the post-ENCODE mood of the community. It seems like we suddenly realized the coding genome is not actually that important. We have remarked over and over in the past couple of weeks that the majority of whole-genome association studies (GWAS) actually map to non-coding DNA as opposed to coding sequences. And now, armed with the knowledge that the non-coding genome has far-reaching regulatory consequences, it is very likely that the genetic component of many complex human diseases are in fact driven by regulatory interactions. And this Science paper very clearly portrays this idea. They use DNAse I hypersensitivity data as a proxy for the parts of the genome are bound by proteins in vivo. They then look at the overlap between DNase I hypersensitive sites (DHSs) and the available phenotypic and disease data. They show that many variants at these sites have regulatory consequences and they make the case that the role of regulatory genome in disease is ubiquitous and profound.

As I said, this study very well captures the consensus view-point of the community and I think we’ll see an explosion in these type of studies that would put the regulatory genome front and center as opposed to the coding DNA. With the low-hanging fruits already discovered in genetic studies, and the emergence of effective methods based on high-throughput sequencing, we are now poised to better understand regulatory networks in all their glory.

Decoding the ENCODed DNA: You get a function, YOU get a function, EVERYBODY gets A function

It has been almost half a century… since we started drilling the concept of “central dogma” (which is DNA->RNA->protein in some sense equals life) into the psyche of the scientific community and human population as a whole. The idea was that everything which makes us human, or a chimp a chimp, is encoded in As, Gs, Cs and Ts, efficiently packaged into the nuclei of every cell. Every cell, it went, has the capacity to reproduce the complete organism. What seemed to be missing in our daily conversations (or conveniently omitted) was how is it that the cells in our body have such different cellular fates, if they start with the same information which they hang on to for the entirety of their lifespan. The answer came, miraculously enough, from the Jacob and Monod and their work on lac operon in E. coli: it is not the book, but how it is read that defines the fate of every cell. Which parts of this genomic library is transcribed (into RNA) and expressed (via the protein products) is ultimately decided by the “regulatory” agents toiling away in the cell. These regulatory agents come in many forms, the first generation were themselves proteins (first repressors and then enhancers). Then came micro-RNAs, small RNA molecules that can locate specific target sequences on RNA molecules and affect their expression (for example through changing the life-span of an RNA molecule). And now, we have identified an arsenal of these regulatory mechanisms: chromatin structure (how DNA is packaged and marked affects its accessibility), transcription factors, miRNAs, long non-coding RNAs and… In the end of the day, it seems that the complexity of an organism largely stems from the diversity and complexity of these regulatory agents rather than the number of protein-coding genes in the genome. It’s like chemistry: the elements are there but what you do with them and how you mix them in what proportions gives you a functional and miraculous product.

Genome Project

The “Human genome project” was the product of the classic “central dogma” oriented view-point. Don’t get me wrong… this was a vital project and what we know now largely depended on it; however, this project was initially sold as the ultimate experiment. If we read the totality of the human DNA, the reasoning went, we’ll know EVERYTHING about humans and what makes them tick. But obviously, that wasn’t the case. We realized that it is not the DNA but the regulatory networks and interactions that matter (hence the birth and explosion of the whole genomics field).

The ENCODE project


The ENCODE project was born from this more modern and regulation-centric view of genomics. And the recent nature issue has published a dozen papers from ENCODE along with accompanying papers in other journals. This was truly an accomplishment for science this year, rivaled only by the discovery of Higgs boson (if it is in fact Higgs boson) and the Curiosity landing on Mars. At the core, what they have done in this massive project is simple: let’s throw whatever we have in terms of methods for mapping regulatory interactions at the problem. From DNAse I footprints to chromatin structure and methylation. And what they report as their MAIN big finding is the claim that there are in fact no junk DNA in the genome, since for 80% of the genomic DNA they find at least one regulatory interaction, which they claim as “functional”.

As I said, this was a great project and will be a very good resource for our community for many years to come. But there are some issues that I want to raise here:

  1. I think we’re over-hyping this. Not every observed interaction means “functionality”. We already know from ChIP-seq datasets that for example, transcription factors bind to regions other than their direct targets. Some of these sites are in fact neutral and their interactions may very well be a biochemical accident. Now one might claim that if the number of transcription factors is limited, these non-functional sites may show some functionality through competing with actual sites to decrease the effective concentration of the transcription factor in vivo.
  2. The take-home message from the ENCODE project seems to be debunking the existence of “junk-DNA”. But to be honest, not many of us thought the genome had significant amount of junk anyways. I am sure that ENCODE provided us with a great resource, but pointing to this as its major achievement does not seem logical. To be honest, I think a resource project like this doesn’t really have an immediate obvious ground breaking discovery; however, the policy makers want to see something when they fund these types of projects… and this is one way of giving it to them.
  3. Funding is another issue here. This was a very expensive endeavor (200 million dollars, was it?). Now I am all for spending as much money on science as possible; however, this is not happening and funding in biosciences seems to be tight nowadays. We can legitimately ask if this amount of money may have been better spent on 200 projects in different labs as opposed to one big project. A project, let me remind you, that would have been significantly cheaper to do in near future due to the plummeting sequencing costs. I’m not saying ENCODE was a waste of money, I just think we’re at a point that things like this should be debated across the community.

Nevertheless, the ENCODE consortium should be commended on performing one of the most well-coordinated projects in the history of biosciences with astounding quality. I think compared to the human genome project, this was a definite success. I have never seen the community this amped up, with everyone poring through the gorgeous interactive results, going over their favorite genes and making noise on twitter. This is a proud moment to be a biologist… I think we have officially entered the post-“central dogma” age of biology.

Reading an Ancient Genome

Recently, a paper appeared in Science magazine describing a multinational effort to sequence the genome of an archaic individual (an 80,000-year-old Denisovan girl). It actually created a fait bit of hype with news snippets abound (e.g. this one) and a nice wired blog post. Much of the hype, I think, was warranted and this study offers a blueprint for how studies are shaped in the age of genomics and whole-genome sequencing. I will first talk about why I think this study tackles an important problem and then move on to the methodology and results.

Following the genetic trail: the tale of the third chimpanzee

Looking back, I think my introduction to human evolution was mainly through an outstanding book written by Jared Diamond, called “The third chimpanzee“. A lot has changed since then (although, I think the book is still very relevant and a very good read). Many more fossils have been discovered around the world, from Lucy (Australopithecus afarensis) dated to about 3 million years ago to Homo heidelbergensis and Homo erectus specimens dated to less half a million years. These at times partial fossils tell a convincing, albeit incomplete, story of human evolution. However, it was the “Neanderthal Genome Project”, reporting the whole-genome sequence of a 38,000 year old sample from the femur of a Neanderthal specimen, that turned a page on studying the genetics of human evolution. DNA is a vast collection of information and comparing these collections between different species portrays a more vivid picture of their evolutionary trajectories with outstanding details. This information goes significantly beyond what we can learn based on the shape of the fossils and their dates and the circumstances of their finding. It is like finding the black box of a fallen plane: rife with key information that truly shapes our understanding of the events. For example, the Neanderthal Genome Project showed that there had been very little if any interbreeding between humans and Neanderthals (1-4%) with insignificant effects on the evolutionary trajectory of the human genome. Our new-found ability to look into the DNA information has enabled us to reconstruct the evolutionary trajectories with unprecedented resolution. Why is this important? Genetically speaking, I think it is based on where we came from that we can learn where we are headed as a species. And we owe this knowledge to recent advances in high-throughput sequencing.


Sequencing old DNA

DNA is one of the key building blocks of life and its use as genetic material stems, in part, from its surprising stability. Nevertheless, DNA is susceptible to erosion and degradation. This degradation results in very poor DNA quality when extraction is attempted on fossils. Another important point is that conventional methods for preparing samples for high-throughput sequencing relies on double stranded DNA, while DNA degradation results in single stranded DNA becoming a significant portion of the population in fossils. Relying on double-stranded DNA methods not only loses this sizable fraction of DNA, but also results in an enrichment of exogenous contaminant DNA from bacteria or even humans. For example, in the Neanderthal genome project, a significant correlation was observed between the length of the fragment and its similarity to modern humans implying that large fragments (which come from higher quality DNA) were in fact from contaminants. This issue is the exact problem that this study has tackled. They have developed a sequencing strategy that involves single stranded DNA rather than double stranded ones. This method would better capture the degraded samples and it is due to this enhancement that they had actually succeeded in producing a rather high-quality sequence of the ancient fossil. They achieved more than 20-fold coverage of the genome on average, meaning that each position in the genome was read 20 times independently which significantly increases the accuracy of the sequence. In comparison the Neanderthal project scored a 1.5-fold coverage of the genome. This surprising jump in quality is a testament to the effectiveness of their proposed method in sequencing fossilized DNA.

Why does it matter?

This level of coverage and accuracy in sequence enables us to make key inferences both about the individual and the population where she came from. While the fossil was really only a partial bone from a finger and a couple of teeth, the researchers have determined the color of her eyes, hair and skin. But more importantly, with such accuracy, we can tell apart the maternal chromosomes (those coming from the mother) from paternal ones (those coming from the father). What can we do with this information? For starters we can determine whether the parents where close relatives or not (which in this case they weren’t). However, while the parents were not closely related, their genome shows portion of significant similarity. This observation implies that the population in which this girl was living, showed very low genetic diversity. This can be due to the population size. Small populations results in an effect called “bottlenecking” in which few individuals shape the markup of the whole population resulting in very low diversity. Another important finding is that Denisovans (which this girl is a member of) split from modern humans around 180,000 years ago.

What else?

We can gain important insights by comparing this ancient genome to those of modern humans in terms of differences. Looking at the two genomes side-by-side, these researchers observe that a significant fraction of these changes affect brain development and function. While this may not be a surprising observation (since we already associate modern humans with higher brain size and function), it underscores the potential role of brain morphology and development in the evolution of our species.

The plethora of information and knowledge that we gain from a single sequenced genome is outstanding. While we should be cautious of generalizing our findings based on a single individual to whole populations across Asia or even the world, there are in fact no other sources where this knowledge can be found. Similar studies can portray a detailed picture of our genomes and its evolution through ages. A low hanging fruit is to use this novel sequencing methodology to sequence other available samples (including the Neanderthal ones).