Breaking out of bioinformatic data silos

The oxbow effect

Trevor Manz

Garrett Ng

, and

Nezar Abdennur

May 16, 2023

Specialized file formats create a lot of friction for computational biologists. We present a solution called oxbow: a data unification bridge to unlock native bioinformatics data for greater accessibility and high-performance analytics. Join us!

Image of oxbow lakes — An oxbow lake forms when a winding river **redirects to a straighter path**, leaving behind a unique, cut-off body of water.

It’s an exciting, yet challenging time to be a data scientist in biology. Over the last decade, we have seen revolutionary developments in genomic and multiomic technologies, accompanied by immense growth in the field traditionally known as bioinformatics. With the onset of the genomic information age, the bioinformatics community has invented numerous highly specialized file formats to manage the continuous influx of data. Despite their ubiquity, many of these forward-thinking data storage technologies have been largely insulated from a rapidly evolving ecosystem that is transforming modern data science and AI/ML. To illustrate the issue succinctly, a 2015 career profile brochure estimates that bioinformatics scientists spend half of their work day converting data between file formats1. How riveting!

A complicated web of file conversions to and from various specialized file formats in bioinformatics. ref: https://raw.githubusercontent.com/bioconvert/bioconvert/main/doc/conversion.png — A snapshot of just some of the tangled web of bioinformatic format copy & conversion. See the bioconvert project (image source) for more details.

Specialized file formats create a lot of friction for computational biologists. The most critical file formats for high-volume genomic data are so bespoke that they are only accessible through file manipulation libraries written in low-level languages. Researchers wanting to interface with such files must choose between grappling directly with these libraries or settling for alternatives that are more intuitive, but much less efficient. Indeed, a considerable amount of code in bioinformatic software is dedicated to reading and writing these files (and plenty of custom ones, too!). In turn, it’s commonplace for users to contort their data to accommodate their tools’ requirements. As a result of this friction, most users – especially newcomers and those with less technical backgrounds – conceptualize their analyses through “off-the-shelf” building blocks, shuffling bioinformatics data through meandering workflows rather than directly accessing problem-specific information.

This image depicts the complexity of transitioning from specialized file formats to a variety of applications. At the bottom of the image, there are diverse bioinformatics file formats, which are the base of the data transformation process. In the middle, there's a visual representation of complexity, symbolizing the intricate process of data manipulation and transformation required to make these formats usable. This cloud of complexity encapsulates the challenges and technical barriers encountered during this transformation process. At the top, three categories of applications are displayed, signifying the destinations for the transformed data: 'Exploratory Analysis & AI/ML', 'Cross-Language Applications & Web Services', and 'Cloud Computing & Data Pipelines'. These categories represent the broad spectrum of tools and platforms in which the data can be utilized once the complexity of file formats has been navigated.

The friction created by data silos creates downstream inefficiencies and stifles innovation. Computational methods, algorithms, and models become confined within the boundaries dictated by data formats. However, these constraints are not direct consequences of the formats themselves, but rather stem from a fragmented ecosystem of software that tightly couples specialized I/O (i.e., reading and writing data) to analytics. Addressing this situation requires enhancing accessibility, interoperability, and reusability of our file formats to empower a new and wider community with direct access to their underlying data, all while respecting their critical role in existing workflows and data sharing practices.

We present oxbow: a data unification layer to unlock native bioinformatics data for high performance analytics. Oxbow is not a new file standard; instead, it is a low-level bridge designed to complement existing workflows and democratize data science innovation in genomics.

The key to oxbow is the realization that most specialized bioinformatic formats are just big tables in disguise. Beneath their disparate on-disk representations lies a fundamentally tabular structure – a commonality often obscured by format-specific byte-layouts and clever mechanisms to compress and enable more efficient access to a large number of records. Oxbow materializes this common structure by channeling data from specialized formats into a unified in-memory representation. In other words, oxbow exposes a cross-language ‘API for multiomics data’.

At the heart of oxbow’s design is Apache Arrow: a standardized in-memory representation of tabular-like data for high-performance computing. Its language-agnostic nature has spurred widespread adoption across various computational fields. Arrow provides a columnar layout that is optimized for computational performance, enabling analytics libraries to be parallelized by default and eliminating the need to fit entire datasets into memory. Moreover, it accelerates data sharing between tools and over a network, promoting the reuse of algorithms across libraries and languages.

In addition to Arrow, a confluence of other recent technical innovations has paved the way for oxbow. Principally, efficient file manipulation libraries have emerged in modern low-level programming languages. These libraries not only provide performance-focused and memory-safe alternatives for building other command line tools, but the corresponding language ecosystems more readily integrate with higher-level languages common to data science. The Rust programming language stands out in this landscape, offering modern development tooling, managed packaging, and excellent Python and R integration. We built oxbow on top of noodles, a standards-compliant, independent Rust implementation of readers and writers for the big bioinformatic formats that are stewarded by the Global Alliance for Genomics and Health (GA4GH). Oxbow serves as a minimal layer to bridge file formats to Arrow, empowering other higher-level tools to take advantage of this tabular mapping.

Since Arrow is cross-language, we can seamlessly connect our Rust-based bridge to the most popular computational data frame libraries in use today. Here’s an example of oxbow in action in both Python and R:

Python

import oxbow as ox

arrow_ipc = ox.read_bam(“sample.bam”, “chr1:1-1000000”)

# Read into polars
import polars as pl
df = pl.read_ipc(arrow_ipc)

# Read into pandas
import io
import pyarrow.ipc
df = pyarrow.ipc.open_file(io.BytesIO(ipc)).read_pandas()

Python Jupyter Notebook loading a BAM file as a data frame using oxbow.

# Read into native R data.frame
arrow_ipc <- oxbow::read_bam(“sample.bam”, ”chr1:1-1000000”)
df <- arrow::read_ipc_file(arrow_ipc)
df

RStudio session loading BAM file as a dataframe with oxbow.

With oxbow, loading genomic data into computational notebooks, web apps, and other applications becomes effortless. This allows developers and data engineers to focus on innovating and scaling their applications, rather than wrestling with file formats. Researchers and data scientists can efficiently query and access data directly using familiar tools, circumventing the need for file converters, redundant code, or expertise in low-level storage details. Reads, alignments, variants, even entire DNA sequences, can really just be (structured) data!

Though still in its nascent stages, oxbow has already shown promising potential – and we are just getting started! Our initial open-source proof-of-concept was so compelling it immediately inspired derived works, adding to our excitement about the project’s future.

We are eager to share our vision for oxbow and invite you to be part of our collaborative journey. No matter your background – be it as a student, researcher, data scientist, developer, TechBio enthusiast, or simply interested party – we warmly welcome your involvement. Join our community, and let’s collectively shape the future of bioinformatics!

Surely an underestimate!

Acknowledgements: We thank Justine Pinskey for writing support. We also thank Vedat Yilmaz, Michael Macias, Isaac Virshup, Danila Bredikhin, Peter Kerpedjiev, Simon Grosse-Holz, Aleksandra Galitsyna, and Eeshit Dhaval Vaishnav for feedback.

Life in Bytes

Discussion about this post