It's been quiet here for quite a while, I know. But a lot has happened on the software side of the lab! I was planning to do a 2024 review post before New Year’s and didn't get around to it, but it's my resolution to post here more often, so here it is — better late than never…
Milestones for Open2C
2024 was an eventful year for Open2C. For those not in the know, the Open Chromosome Collective (Open2C) is a collaborative community that several colleagues and I spun out of the interconnected software packages we had co-developed, most of us originally as graduate students. Our libraries focus on the processing and analysis of data from Chromosome Conformation Capture (3C/Hi-C) and related 3D genomics technologies, leveraging the Python data ecosystem. We also develop tools to support functional genomics analysis more broadly.
As the year began, our four most popular packages were granted NumFOCUS affiliation. NumFOCUS is a non-profit organization that supports open-source projects in research and scientific computing, including many of the packages you probably use on a regular basis. Our affiliated projects are:
Bioframe brings genomic interval operations natively to Pandas DataFrames. This is in contrast to the file-centric approach of the canonical CLI program, bedtools. Bioframe is therefore similar to GenomicRanges in the R/Bioconductor ecosystem, but operates directly on data frames rather than specialized data structures or wrapper objects.
Pairtools is a critical suite of command line tools for processing and interpreting sequence alignments from proximity ligation NGS assays, like Hi-C.
Cooler is a scalable file format and support package for storing multi-scale contact matrices and other genomic interaction maps.
Cooltools provides common downstream analysis for Hi-C data.
We published! While cooler was published back in 2020 and has accumulated almost 1500 citations to date, our team had let preprints for the other packages simmer during the pandemic and we finally got around to submitting them for publication. The bioframe manuscript was published in Bioinformatics in February, 2024 while cooltools and pairtools came out in PLoS Computational Biology in May.
A spring hackathon
In April, my lab and Nils Gehlenborg's HIDIVE lab hosted scverse's first US-based hackathon at Harvard Medical School, with focus themes on data infrastructure and visualization. It was a great event with over 20 participants, bringing together talent from around the Boston area and the scverse core team. Projects included improving support for SpatialData in Vitessce, implementing specialized features for a pseudotime trajectory explorer widget, interactive reprs for AnnData in Jupyter, and improving access to curated Bioconductor databases through Python. See the tweet summary.
A summer of code
Open2C was also accepted as a mentor organization for the 2024 edition of the Google Summer of Code! We had three outstanding mentees working on projects related to genome assembly metadata, more efficient storage and processing of pairs data, and tutorials for bioframe. You can learn more about their projects here.
A big cheer for BigWigs
In March, we released Jack Huey's Bigtools, a Rust implementation of the UCSC BigWig and BigBed formats, along with a preprint. The application note was published in Bioinformatics in June. Bigtools represents the first feature-complete alternative implementation of BigWig and BigBed, supporting reading and writing of both formats, and boasts better performance and ergonomics. It provides a command line interface — a drop-in replacement for the ubiquitous UCSC binaries — and Python bindings via pybigtools. Unlike the reference BigWig/BigBed implementation, which is embedded in the monolithic UCSC genome browser C source tree and was never designed for reuse as a standalone library, Bigtools is a bona fide, modular Rust crate (library) that can be used as a dependency in other Rust programs or exposed to other languages through bindings. Notably, we use Bigtools in Oxbow to provide efficient access to BigWig and BigBed data as Apache Arrow-backed data structures (think data frames!).
Connecting it all together with widgets
Data visualization is something we care a lot about, and our close collaborator in the HIDIVE lab, Trevor Manz, spent his PhD working on making interactive data visualization of biological data accessible to the various user personas who need it. The environment we use most often for data analysis and machine learning is Jupyter, through the various platforms that support its architecture, including JupyterLab, Colab, and VS Code. Despite the huge potential of the platform to support advanced data visualization and interactivity, through the experience of maintaining several data visualization packages as Jupyter widgets, we learned firsthand how difficult and fragile the fragmented ecosystem for widget development was.
Enter anywidget! Anywidget is a specification and toolset for making reusable web-based visualizations and user interfaces that work in interactive computing environments. Originally focused on the Jupyter compatibility issue, anywidget allows authors to write a single module for the front-end logic that works across all Jupyter-compatible platforms and does not require building and distributing specialized front-end code for each platform. It enables rapid prototyping — lowering the barrier to entry — and today has become the recommended toolkit for authoring Jupyter widgets. Furthermore, by providing a narrow specification based on modern web standards, various platforms for dashboarding and online publishing have begun to support anywidget integration, even adopting anywidget front-end modules natively as a plugin framework.
Trevor gave a great presentation on anywidget at the SciPy 2024 conference in Tacoma, WA. He and I also delivered a tutorial on anywidget which was extremely well received. We published two manuscripts on anywidget, a conference proceeding detailing how the Jupyter compatibility problem was solved, and a JOSS article describing the project architecture and goals more broadly. Oh, and Trevor also successfully defended his PhD! 🎉
Towards a composable future
I presented some of my group’s ideas on “Composable Bioinformatics” at the Bioinformatics Open Source Conference (BOSC) track of the ISMB/ICSB conference 2024 in Montreal. Our colleague, Fritz Lekschas also gave a great complementary talk on composable data visualization in the BioVis track (video not yet available).
Watch my talk for a preview about where we think the bioinformatic and genomic software ecosystem needs to be heading, and some of what we’re working on to make it happen:
More on composability to come in 2025!