Scroll Top

Mind the Gaps: Ben Busby on the Future of Variant Annotation

KnowledgeGalaxyImage (1)

Despite the massive strides made in genomic science and AI over the last two decades, no single variant annotation tool offers a fully comprehensive solution for all research and clinical needs, although robust and reliable tools are widely available and routinely used for high-throughput variant annotation tasks. Genomic data is more accessible than ever, with sequencing costs plummeting and multigene panels becoming routine in clinical care. Yet, when it comes to accurately interpreting the variants identified by those panels, especially in the context of complex diseases, we are, in many ways, still in the dark.

To understand why, I spoke with Dr. Ben Busby, a veteran of computational genomics with two decades of experience spanning model organism research, public genomic data stewardship at NCBI, and, more recently, industry partnerships at NVIDIA. Today, as the Genomics Alliances Manager at NVIDIA, Busby helps industry partners design and scale omics solutions.

At the start of the interview, Busby made one thing clear: current tools dramatically underserve the majority of human disease. “Some estimates say that  8% of diseases have single-locus genomic predispositions,” he said. “That means 92% of human disease is essentially untouched by resources like ClinVar or ClinGen.” While these resources are vital for well-characterized Mendelian conditions, they simply don’t capture the whole landscape of disease risk, especially when it comes to polygenic, context-dependent, or environmentally modulated phenotypes.

The challenge of variant annotation is not due to a lack of tools, but rather the fact most existing tools operate in silos, leaving researchers with the task of manually combining disparate outputs and interpreting system-level effects. While hundreds of annotation tools exist, very few attempt to integrate these layers into a single interpretive ecosystem. Instead, the researcher is left with the burden of piecing together results, often manually, in a landscape where distinguishing signal from noise remains a significant difficulty.

This fractured approach is not only inefficient, it’s incomplete. Busby emphasized that modern annotation must go beyond genomic coordinates and variant-level predictions. 

Comprehensive variant annotation must incorporate systems biology, reflecting how disease typically arises from networks of interacting variants shaped by background genetic architecture, ancestry, environmental exposures, and epigenetic influences.

“Annotation isn’t just about marking up a genome,” he said. “It’s about understanding systems, from structure to ethnicity, from environment to deep molecular context.”

He recalled preliminary experiments by students at Carnegie Mellon who uncovered distinct population-specific clusters in the TNF-alpha region, a finding invisible to traditional annotation tools. These results, though early, underscore the importance of incorporating ancestry-informed and structure-aware models into any annotation pipeline. Ignoring population structure and multigenicity is like trying to diagnose a city’s traffic problem by looking at one intersection at a time.

So why don’t more people use integrated tools? “The challenge is not just reluctance, but the real-world complexity of launching, managing, and maintaining multiple tools with diverse dependencies, documentation standards, and interoperability issues.

He referenced a 2017 talk by Steven Brenner, which highlighted how combining multiple annotation tools leads to better interpretive outcomes. Even when researchers recognize the need for layering information, the friction of managing inputs, outputs, and software dependencies can be overwhelming.

In Busby’s view, the way forward is clear: automation, modularity, and AI-assisted orchestration. He envisions a system composed of distributed, intelligent agents with each tasked with a specific annotation domain, such as phenotype association, enhancer mapping, or proteomic correlation. These agents would operate semi-autonomously, launching the appropriate tools, aggregating their outputs, and feeding results into a shared structure: a knowledge graph that doesn’t just store data, but actively validates and updates relationships as new evidence emerges.

“If I had unlimited resources,” he said, “I’d build a multi-agent swarm, a distributed system of tools trained to interpret genomic, phenotypic, proteomic, and environmental data across different contexts.” He described the final product as a dynamic, navigable “knowledge galaxy”! The “knowledge galaxy” is a conceptual space where different branches of biological insight orbit around shared anchors, validating one another and illuminating relationships between distant data points.

Toward the end of our conversation, Busby offered one of the most evocative metaphors I’ve heard for the current state of variant annotation. “Think of it like an octopus reading a book,” he said. “Each arm is picking up part of the story. One’s got the clinical data, another’s pulling from proteomics, another from enhancer predictions, but they’re not coordinated yet. We need a brain to bring them together and understand the plot.”

This image of a multi-armed octopus, intelligent but uncoordinated, surrounded by the swirling data of a knowledge galaxy perfectly depicts both the promise and the fragmentation of modern genomics. We’ve built powerful tools, each capable of grasping a specific aspect of the problem. Until we develop systems that can synthesize those insights into something actionable, we’ll remain stuck with partial answers.

Some projects, like OpenCRAVAT, have begun to move in this direction. An open-source platform developed at Johns Hopkins, OpenCRAVAT, enables users to build custom annotation pipelines by selecting from a library of annotators for clinical, molecular, population, and regulatory domains. Busby praised it as “a great starting point”. It has a modular framework that allows users to mix and match tools according to their needs. However, even this approach is only a step toward the kind of intelligent, adaptive infrastructure that the field ultimately needs.

Busby also highlighted proteomics as a crucial dimension of interpretation. In Alzheimer’s research, for example, proteomic profiling has identified at least five distinct molecular subtypes of the disease, each with its own trajectory and potential therapeutic implications. “We have a foothold in proteomics to deconvolute the relationship between SNPs and disease,” he said. The same is true in other domains: for example, triple-negative breast cancer, which resists classification through traditional genomic methods alone, is likely to yield more answers when multi-omic approaches are applied.

Ultimately, Busby’s message was one of optimism, not frustration. “There’s real hope that in five years we might understand 80% of diseases,” he said. “But it’s going to take multi-modal data.” The work ahead involves not just building better tools, but building better systems. He hopes for a system that recognizes the inherently layered nature of biology and allows researchers to explore those layers. In a field often defined by its silos (bioinformatics, clinical genomics, systems biology, population genetics), Busby’s call to action is refreshingly integrative. The octopus has the limbs. The galaxy of knowledge is there, waiting. Now we just need the coordination to bring them together.

We would like to thank Pankaj Vats and everyone else at NVIDIA who contributed to editing the final blog post.