T H E   N I H    C A T A L Y S T     M A Y  –  J U N E   2007

 
 
 

dbGaP, Clinical Trials.gov, CellMiner, et al.

SOFTWARE HEROES: NIH HOMEGROWN TREASURIES
ENRICHING SCIENTISTS THE WORLD OVER

 

by Christopher Wanjek

 

Every day more than two million users of the National Library of Medicine website download over 3.5 terabytes of data. That adds up to an entire Library of Congress’ worth of information delivered every three days.

The website gets over 3,200 hits a second, and none of it for vapid gossip, unless one considers yeast cell division akin to the latest Hollywood scandal.

Draw up a list of top NIH intramural achievements—vaccine and chemotherapy development, PET imaging, early HIV work—and there’s a good chance you might forget to add software and database development. Yet this is the backbone of much of the basic biomedical research that flows from NIH.

No one likely will win a Nobel Prize for developing such tools, but the next Nobel Prize in Physiology or Medicine will surely have relied on this homegrown innovation.

Modern Stacks: Computer room in the National Library of Medicine, including servers for PubMed

Some Heavy Hitters

PubMed is perhaps the most famous of NLM’s databases, with more than 17 million citations from more than 19,000 life-science journals dating back to 1950.

The National Center for Biotechnology Information (NCBI) is the arm of the NLM tasked with managing the PubMed retrieval system and the increasing volume and complexity of raw scientific data. The gene repository GenBank and the BLAST gene-search tool are two other NCBI gems that have enabled the genomic revolution.

The NLM’s Lister Hill National Center for Biomedical Communications maintains the largest trial registry—ClinicalTrials.gov—with 36,249 studies from nearly 140 countries. As highlighted in the May 16, 2007, issue of JAMA, ClinicalTrials.gov can serve as a standardization tool to ease the widespread problem of incomplete or delayed reporting of clinical trial results.

Over at NCI, a smaller enterprise, the Genomic and Bioinformatics Group, has developed its own set of software, the Miner Suite, with its focus on cancer research yet tied into NCBI’s vast web of databases.

The NCI Cancer Genetic Markers of Susceptibility (CGEMS) project has developed the data architecture and analytical pipeline to conduct genome-wide association studies (GWAS), which can typically include more than 1.5 billion data points. Currently, several GWAS have incorporated components of the data analysis pipeline. CGEMS studies in breast and prostate cancer have appeared in two recent publications in Nature Genetics.

In the coming months, NCBI and NCI plan not only to make these resources richer and more integrated but also to better educate the scientific community on how to use them.

Organizing the Organizing

As popular as NCBI’s tools are, by and large its users are failing to see the dazzling interconnectivity, says Jim Ostell, chief of the NCBI Information Engineering Branch. Ostell’s team often performs demonstrations for researchers on how to narrow literature searches on a human disease and to link, for example, to GenBank for annotated text about a newly discovered gene mentioned in an article, and to further link for nucleotide or taxonomy information, until ultimately one might stumble upon the same gene and its function in a microorganism.

"The almost uniform response from the audience is, ‘Oh, I didn’t know you could do that,’" said Ostell, who like a maestro can produce a symphony of data within seconds with his frenzied keystrokes and mouse movements. "You can get to these things by clicking down the links, but you have to kind of know that they’re there."

From such audience feedback, as well as by tracing search-request patterns, the NCBI came to realize that most users don’t go beyond retrieving top-level results from their query. In other words, scientists are using the database retrieval system more like Google, typing and retyping queries. To counter this, NCBI has launched the Discovery Initiative with the goal of making the user more aware of related data.

"The data in molecular biology are growing exponentially," Ostell said. "How do we build an information resource that not only can handle exponential growth but can also use it in some kind of positive way, as opposed to being basically destroyed by it?"

One of the first elements of the Discovery Initiative to be implemented is the prominent display of Abstract Plus by default, which through an advanced matching program finds articles similar to the one the user highlights and then displays their titles neatly to the right of the search. Abstract Plus is built on keywords and biological relationships, running circles around simple popularity matches on sites such as Amazon.com that merely provide Dylan fans a link to Grateful Dead merchandise.

Abstract Plus has existed for years, but its new prominence has resulted in about 30 percent of the users clicking on a related link, compared with only 3 percent before. Similarly, NCBI will soon highlight another cloaked feature—sophisticated searches based on related biological function, not text.

NCBI Director David Lipman, the originator and co-developer of BLAST, describes the Discovery

Initiative as key to "making the kinds of connections that underlie the discovery process," bridging the collective knowledge of the genome over billions of years of evolution and bringing together scientists who otherwise do not read the same journals, go to the same meetings, or specialize in the same organism or disease.

The Next Big Thing: dbGaP

Jim Ostell, chief of the NCBI Information Engineering Branch
John Weinstein, NCI lab chief and head of the Genomics & Bioinformatics Group

NCBI’s latest creation, dbGaP, short for the database of genotype and phenotype, will enable comparisons of the genome-wide association studies expected to dominate genetic research. This database is designed to serve as a central depository for archiving and distributing genotype and phenotype data and can provide analyses of the level of statistical association between genes and selected phenotypes.

"I think dbGaP is the single most exciting project we have, because it is connecting the promise of all the investment in the human genome to trying to come up with clinical information from it," Ostell said.

Currently dbGaP contains only two studies: the NEI Age-Related Eye Disease Study and the NINDS Parkinsonism Study. Starting in June, dbGaP takes a massive step forward with the gradual addition of several major projects: Genetic Association Information Network (GAIN); Genetics and Environment Initiative (GEI); the Framingham Heart Study; the Women’s Health Study; ongoing NINDS studies on stroke, epilepsy, and ALS; medical resequencing studies from NHGRI that pinpoint rare mutations causing rare diseases; and kidney data from NIDDK.

By year’s end, NCBI hopes to have thousands of human genomes archived for comparison, perhaps finding commonality in seemingly unrelated diseases and giving old data a new life.

"Framingham started in the ’40s," said Ostell. "By this step, Framingham moves into the molecular age . . . . That’s the molecular biology revolution. All the phenotype ideas you had—suddenly you have a paradigm shift when you go through sequence data."

Cancer Tools

Among the bioinformatic enterprises that make extensive use of NCBI’s tools is the Genomics & Bioinformatics Group, headed by John Weinstein, of the NCI-CCR Laboratory of Molecular Pharmacology. The group has developed the Miner Suite of web-based software, which, as the name implies, focuses on data mining. The Miner Suite tools are inherently generic but rigorously and widely used by cancer researchers.

SpliceMiner, for example, is a web interface for working with data from NCBI’s Entrez Gene and Evidence Viewer tools to analyze splice variants, which may pop up in a microarray experiment. Cancer has been referred to as a disease of splicing. That is probably an overstatement, Weinstein says, but SpliceMiner does provide a solid, user-friendly platform for figuring out what roles splicing really does play in the disease.

MedMiner expedites PubMed searches for gene-gene and gene-drug relationships; GoMiner leverages the Gene Ontology project to provide functional interpretation for microarray experiments; and CIMminer creates the clustered heat maps that have become the ever-present visual icon of "postgenomic" research.

Weinstein is now working with Steven Chanock, a senior investigator in the NCI-CCR Pediatric Oncology Branch, and with others to apply the Miner software and databases to the genome-wide association studies and dbGaP.

For example, if a particular chromosome region shows up as important in the association between mutations and a particular type of cancer, the question becomes, "Which gene in that region is driving the association and which genes are just passengers," Weinstein said.

When that question arises, CellMiner databases can find out which genes are overexpressed, duplicated, rearranged, or otherwise abnormal, suggesting "impaired-driver status," Weinstein said. CellMiner includes molecular profiles at the DNA, RNA, protein, chromosomal, and small-molecule levels, reflecting the richest, most varied molecular profiling of any set of cells in existence.

The compilation reflects what Weinstein has termed "integromics." The integromic hypothesis poses that looking at the cancer cell from many different molecular angles can yield additional insight into the biology and pharmacology. Much of CellMiner’s data are on 60 diverse cancer cell types—the NCI-60—used by the NCI’s Developmental Therapeutics Program to screen more than 100,000 chemical compounds and natural products.

Data Quality

With advances in bioinformatics, Weinstein hopes to strike a balance between the observation-driven biology of Linnaeus and Darwin and the hypothesis-driven research that dominated the latter half of the 20th century. Access to information on tens of thousands of genes, hundreds of thousands of splice variants, and millions of protein states has placed researchers once more in the role of taxonomists. Yet real value will be realized, Weinstein said, when scientists can integrate all this information smoothly and meaningfully into hypothesis generation and testing.

We are now at a crossroads, Weinstein said, where the data are given short shrift. Genomic data are collected, but perhaps a half a year passes as the researchers attempt to find a hypothesis-driven story worthy of journal publication, and another half a year passes as that hypothesis is validated. In the end, the data release is delayed and "the tail ends up wagging the dog," Weinstein said, with an article focusing on downstream hypothesis testing related to one or a few of the genes.

"There are many kinds of contributions that intramural scientists can make in addition to curing a disease," Weinstein said. "Less attention is given to the kinds of research based on databases and bioin-formatics. Very often, those contributions set the table for other researchers focused on particular genes or disease states."

Weinstein has faced such prejudice from editors in publishing data from the NCI-60 cancer cell lines, with submissions rejected because they lacked a hypothesis. The audience certainly exists, Weinstein said. Six out of seven of his group’s most influential papers in the last 10 years were initially rejected at least once, yet they have collectively garnered thousands of literature citations.

Clement McDonald, director, NLM Lister Hill National Center for Biomedical Communications

Can We Talk?

Acute attention must be paid to interoperability to avoid the Babel effect—locally invented and idiosyncratic codes—the main barrier to deploying fluent database-management tools, according to Clement McDonald, director of NLM’s Lister Hill National Center for Biomedical Communications.

NLM Director Donald Lindberg led the development of the Unified Medical Language System (UMLS) at NIH in the early 1980s, foreseeing the necessity to retrieve information from disparate sources, syntaxes, and vocabularies. The UMLS now maintains a list of more than five million names for more than a million concepts, as well as more than 12 million relations among these concepts.

Two offspring of the UMLS concept are are the homegrown RxNorm, helping researchers and consumers make sense of myriad pharmaceutical trade names and chemical names, and LOINC, short for Logical Observation Identifiers, Names, and Codes, to standardize the language of laboratory observations.

In this spirit, Lister Hill maintains ClinicalTrials.gov, originally created to inform the public of NIH-funded clinical trials but now serving as an even more important tool for researchers. Although not all the pieces are fully in place, ClinicalTrials.gov is positioned to meet policy initiatives calling for the development of a database of trial results that is informative and current and meets ethical concerns about consent and privacy, according to Deborah Zarin, a senior scientist at Lister Hill and lead author of the May 16 JAMA report.

The investment in database management tools will be wasted if users cannot access all that is available, a situation analogous to providing workers BlackBerrys just to tell time. The designers must continuously receive input from the end users to assess usability, an obvious but remarkably overlooked principle in systems design, McDonald said.

The NLM is constantly reinventing itself, meeting the demands of data influx but also of the user. McDonald calls the book Wicked Problems, Righteous Solutions: A Catalogue of Modern Software Engineering Paradigms by Peter DeGrace and Leslie Hulet Stahl required reading for anyone who designs such systems. Too often, he said, the user is blamed for a poorly engineered system.

The Web

As NCBI takes researchers from abstract to zygote, Ostell strives to provide content that informs rather than distracts, for much of the Internet has lost its web-ness and become a series of treacherous and tedious one-way streets.

There are "underlying principles behind the projects that we take on and the way that we put them together," Ostell said. "We try to hit that gray zone where things are in transition from research to resource. We have to be there as they move out of the research phase."

NCBI’s web is captured by an animated diagram at this website.

NCI’s Miner Suite is detailed at this website.


Return to Table of Contents