CRAM: The Genomics Compression Standard


Compress. Connect. Collaborate.

CRAM: A COMMUNITY EFFORT

As genomic sequencing increases around the globe, it becomes vital to store this data efficiently and sustainably. GA4GH’s CRAM file format for genomic data compression tackles this challenge and helps facilitate global collaboration.

Scroll through the videos below to learn how it has benefited existing users and how you can get involved, too. 

“CRAM is a fundamental part of the GA4GH suite of standards. It’s how we think about storing DNA sequence and it works as a package with other standards to allow scientists, healthcare professionals, and commercial researchers to access the information they want when they want it.”
Ewan Birney

Chair of GA4GH, Director of EMBL-EBI

 

WHAT IS CRAM?

CRAM is a database that uses various algorithms to compress the data it stores. Some of these algorithms are universal, others leverage the unique fact that most human genomes are very similar to the reference human genome.

CRAM files store data in columns that are aligned to the reference sequence, allowing users to extract information efficiently from particular subsets of the file on particular chromosomes — one of the major use cases of DNA data.

 

“Even if an organization is not producing large volumes of data, it can be very beneficial to use the same file formats that other organizations are using. The community has a vital role in making sure that a standard is suitable for everybody and not just one individual or one group.”
James Bonfield

Wellcome Sanger Institute

CRAM IS…

GLOBAL

Anthony Philippakis

Broad Institute

SUSTAINABLE

Tiffany Boughtwood

Australian Genomics

ECONOMICAL

Malachi Griffith

Variant Interpretation for Cancer Consortium

EFFICIENT

Nicola Mulder

H3Africa

TRANSFORMATIVE

Thomas Keane

EMBL-EBI

OPEN SOURCE

Peter Counter

Genomics England

INTEROPERABLE

Paul Flicek

Ensembl

EVOLVING

Albert Vernon-Smith

University of Michigan

THE FUTURE

Pär Lundin

SciLifeLab

COLLABORATIVE

Ira Hall

McDonnell Genome Institute

“The CRAM file format is essential toward reducing the footprint of genomic data files enabling more efficient, large-scale analyses and queries and also supporting population scale sized projects such as Genomics England. Importantly, the CRAM format was developed by genomics experts within the community to solve a unique challenge in scaling experiments and applications. The adoption of CRAM shows the user benefit when tools are developed by the community for the community.”
Susan Tousi

Illumina, Inc.

Benefits of CRAM

  • Reduces disk space and storage costs by 30-50%
  • Interoperable with other industry standards and best practices
  • Accurately tracks reference genome, improving integration with the field
  • Easily transfer and share data with collaborators
  • Continuous community effort to upgrade the format
  • Free to the community
“It becomes absolutely critical that [genomic] data is in a format that can be easily and effectively shared amongst many investigators. In other words, it really becomes very wasteful if an investigator has to go and reprocess or reformat the data everytime they get a new dataset.”
Stacey Gabriel

The Broad Institute of MIT and Harvard; NIH All of Us Research Program

Start the Conversation at your Institute

There are many ways to get involved. Start the dialogue at your organization or institute around using CRAM.

    1. Adopt the CRAM specification
    2. Join the CRAM development community
    3. Share this page with your colleagues
“Open and encumbered file formats are essential for the science and growing commerce of this industry. The innovations in genomics that will serve humanity are in the interpretation of the data. GA4GH standards allow data and algorithm sharing between institutions, vital for a vibrant informatics ecosystem.”
Warren Kaplan

Garvan Institute

Used globally by:

 

LIBRARIES & TOOLS SUPPORTING CRAM

Software Libraries

htslib | htsjdk | PySam | Bio::DB::HTS | RustBio

Tools

Samtools | GATK | Picard | IGV | Crumble

Data Archives

European Nucleotide Archive (ENA) | European Genome-phenome Archive (EGA)

Genome Browsers

ENSEMBL | JBrowse | UCSC Genome Browser