“Junk DNA” Debate Aside, ENCODE Papers Ride High in Citations

October 2013

ENCODE is a very spiffy acronym—for the Encyclopedia of DNA Elements—but the word itself makes searching for information about it extremely difficult. It is ubiquitous, in all sorts of different fields. schematic of a stretch of DNA containing the sequences known as exonsPerhaps that was the point, a high-level metaphor for the difficulty of sieving specific bits of information from the gigantic and rapidly growing pool of DNA data. Unlikely; I suspect it was just too attractive a name to pass up. And searchable or not, one recent result of work by more than 400 scientists currently ranks as the most-cited biology paper published in the last two years, as the Web of Science attests, with 100 citations tallied during a recent two-month period and an overall total exceeding 400—at this writing, barely a year after publication. (See table, paper #3.)

ENCODE: Selected Highlights

(Papers reporting or utilizing ENCODE data, 2004 to 2012, listed by citations)

Rank Paper Citations
1 ENCODE Project Consortium, “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project,” Nature, 447(7146): 799-816, 2007. [88 institutions worldwide] 1,903
2 ENCODE Project Consortium, “The ENCODE (ENCyclopedia of DNA elements) Project,” Science, 306(5696): 636-40, 2004. [NHGRI, Bethesda, MD] 561
3 ENCODE Project Consortium, “An integrated encyclopedia of DNA elements in the human genome,” Nature, 489(7414): 57-74, 2012. [85 institutions worldwide] 422
4 modENCODE Consortium, “Identification of functional elements and regulatory circuits by Drosophila modENCODE,” Science, 330(6012): 1787-97, 2010. [41 institutions worldwide] 227
5 D.W. Craig, et al., “Identification of genetic variants using bar-coded multiplexed sequencing, Nature Methods, 5(10): 887-93, 2008. [Translat. Genom. Res., Phoenix, AZ; Illumina, San Diego, CA] 135
6 K.R. Rosenbloom, et al., “ENCODE whole-genome data in the UCSC Genome Browser,” Nucleic Acids Res., 38(1): D620-5, 2010. [U. Calif., Santa Cruz; Queensland Facil. Adv. Bioinformat, Brisbane, Australia; Washington U., St. Louis, MO] 102
7 F. Denoeud, et al., “Prominent use of distal 5 ‘ transcription start sites and discovery of a large number of additional exons in ENCODE regions,” Genome Res., 17(6): 746-59, 2007. [7 institutions worldwide] 95
8 ENCODE Project Consortium, “A user’s guide to the Encyclopedia of DNA Elements (ENCODE),” PLOS Biology, 9(4): No. 1001046, 2011. [59 institutions worldwide} 89
9 R.D. Hernandez, et al., “Demographic histories and patterns of linkage disequilibrium in Chinese and Indian rhesus macaques,” Science, 316(5822): 240-3, 2007. [9 US and Danish institutions] 75
10 D.Y. Zheng, et al., “Pseudogenes in the ENCODE regions: Consensus annotation, analysis of transcription, and evolution,” Genome Res., 17(6): 839-51, 2007. [12 institutions worldwide] 73
SOURCE: Thomson Reuters Web of Science


Paper #3, “An integrated encyclopedia of DNA elements in the human genome,” is the tip of an iceberg produced by The ENCODE Project Consortium. The Consortium started off with a pilot project in 2003, almost before the ink was dry on the first blueprint of the human genome, and by 2007 had soundly knocked on the head the idea that the bulk of the genome is in some sense “junk” (see ScienceWatch May/June 2008.) That view, which stems from the now-ancient discovery that perhaps as much as 98% of the DNA sequence does not code directly for proteins, has long been a problem for evolutionary biologists, who struggled to work out why, if it had no function, all that extra DNA existed. Moving swiftly on from its original proof of concept, based on just 1% of the human DNA sequence, the ENCODE Project adopted a raft of the most up-to-date experimental and computational techniques, some of them not even thought of at the end of the pilot phase, “to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.”

For example, genome-wide association studies (GWAS) have come into their own as rapid DNA sequencing and computational power combined to allow researchers to pinpoint areas of the genome associated with complex diseases, such as multiple sclerosis, systemic lupus erythematosus, autism, and schizophrenia. But while DNA markers, known as GWAS SNPs, have been identified, almost 90% of those markers fall between known genes or in regions within genes that do not contribute to the final gene product. The ENCODE Project examined almost 5,000 of these GWAS SNPs, and found that they tend to be much more common in genome regions that are under the control of one or another of the many transcription factors that modify gene expression. GWAS SNPs are also associated with so-called DNase I hypersensitive sites (DHS), areas where the tightly coiled chromatin structure that keeps stretches of DNA inactive is open for business, as it were, and where genes can be transcribed. The Consortium concludes that “an appreciable proportion of SNPs identified in initial GWAS scans are either functional or lie within the length of an ENCODE annotation (~500 bp on average) and represent plausible candidates for the functional variant.” This will encourage researchers hoping to use GWAS SNPs to look for the root causes of disease.

This is just one example of the way in which the ENCODE project has sought to show that there is no junk DNA, a claim that has resulted in acrimonious debate. One side in the debate challenges ENCODE’s definition of “functional,” saying, in essence, that just because a stretch of DNA is transcribed or regulates transcription, that does not mean it is functional. The discussion, as so often, circulates around what different people mean by “functional.” At its weakest, DNA may be functional if it is involved in some sort of biochemical activity. At its strongest, that biochemical activity is exposed to natural selection, such that variants have differential reproductive success. The kind of selection that gets rid of variants is called purifying selection, and—depending on your sources—between 3 and 10% of bases in the human genome are constrained in this way by evolution. ENCODE studied these regions, and finds that many of the elements peculiar to the primate evolutionary line fall within its definition of functional regions. That is not surprising.

The whole discussion of whether ENCODE’s results “prove” that the concept of junk DNA can be consigned to the dustbin of history harks back to the distinction between junk and garbage made in 1998 by Sydney Brenner, a 2002 Nobel laureate in Medicine. To paraphrase, junk is useless but harmless, and we keep it because it may come in handy. “[J]unk that takes up too much space, or is beginning to smell, is instantly converted to garbage by one’s wife, that excellent Darwinian instrument.”

On this basis, much of the DNA could still be junk in Brenner’s sense, despite ENCODE’s claims, and that debate, even though it concerns one of the claims most associated with ENCODE, threatens to obscure ENCODE’s real importance. Any search for how DNA works, and especially what happens when it doesn’t work properly, is going to be easier with the ENCODE data than without.

Dr. Jeremy Cherfas is a science writer based in Rome, Italy.

The data and citation records included in this report are from Thomson Reuters Web of ScienceTM. Web of ScienceTM is a registered trademark of Thomson Reuters. All rights reserved.