Nothing Special   »   [go: up one dir, main page]

Bonfield, 2022 - Google Patents

CRAM 3.1: advances in the CRAM file format

Bonfield, 2022

View PDF
Document ID
5932725982371527274
Author
Bonfield J
Publication year
Publication venue
Bioinformatics

External Links

Snippet

Motivation CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. Results With Illumina data CRAM 3.1 is 7–15% smaller than the …
Continue reading at academic.oup.com (PDF) (other versions)

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/30286Information retrieval; Database structures therefor; File system structures therefor in structured data stores
    • G06F17/30312Storage and indexing structures; Management thereof
    • G06F17/30321Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/30067File systems; File servers
    • G06F17/30129Details of further file system functionalities
    • G06F17/3015Redundancy elimination performed by the file system
    • G06F17/30153Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/21Text processing
    • G06F17/22Manipulating or registering by use of codes, e.g. in sequence of text characters
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/30861Retrieval from the Internet, e.g. browsers
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/30943Information retrieval; Database structures therefor; File system structures therefor details of database functions independent of the retrieved data type
    • G06F17/30946Information retrieval; Database structures therefor; File system structures therefor details of database functions independent of the retrieved data type indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/30017Multimedia data retrieval; Retrieval of more than one type of audiovisual media
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F19/00Digital computing or data processing equipment or methods, specially adapted for specific applications
    • G06F19/10Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
    • G06F19/22Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or SNP [Single-Nucleotide Polymorphism] discovery or sequence alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F19/00Digital computing or data processing equipment or methods, specially adapted for specific applications
    • G06F19/10Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
    • G06F19/28Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for programme control, e.g. control unit
    • G06F9/06Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
    • HELECTRICITY
    • H03BASIC ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same information or similar information or a subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity

Similar Documents

Publication Publication Date Title
Bonfield CRAM 3.1: advances in the CRAM file format
Cox et al. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform
Benoit et al. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
Jones et al. Compression of next-generation sequencing reads aided by highly efficient de novo assembly
Tembe et al. G-SQZ: compact encoding of genomic sequence and quality data
JP7079786B2 (en) Methods, computer-readable media, and equipment for accessing structured bioinformatics data in access units.
Hach et al. SCALCE: boosting sequence compression algorithms using locally consistent encoding
Zhu et al. High-throughput DNA sequence data compression
US10790044B2 (en) Systems and methods for sequence encoding, storage, and compression
Holt et al. Merging of multi-string BWTs with applications
Patro et al. Data-dependent bucketing improves reference-free compression of sequencing reads
Bose et al. BIND–An algorithm for loss-less compression of nucleotide sequence data
Saha et al. NRGC: a novel referential genome compression algorithm
Holley et al. Dynamic alignment-free and reference-free read compression
Shi et al. High efficiency referential genome compression algorithm
Tatwawadi et al. GTRAC: fast retrieval from compressed collections of genomic variants
Deorowicz et al. AGC: compact representation of assembled genomes with fast queries and updates
Wertenbroek et al. XSI—a genotype compression tool for compressive genomics in large biobanks
Kim et al. MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
Sirén et al. GBZ file format for pangenome graphs
CN110168652B (en) Method and system for storing and accessing bioinformatic data
El Allali et al. MZPAQ: a FASTQ data compression tool
Cánovas et al. CSAM: compressed SAM format
JP7362481B2 (en) A method for encoding genome sequence data, a method for decoding encoded genome data, a genome encoder for encoding genome sequence data, a genome decoder for decoding genome data, and a computer-readable recording medium
Habib et al. Modified HuffBit compress algorithm–an application of R