Bonfield, 2022 - Google Patents
CRAM 3.1: advances in the CRAM file formatBonfield, 2022
View PDF- Document ID
- 5932725982371527274
- Author
- Bonfield J
- Publication year
- Publication venue
- Bioinformatics
External Links
Snippet
Motivation CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. Results With Illumina data CRAM 3.1 is 7–15% smaller than the …
- 238000007906 compression 0 abstract description 64
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30312—Storage and indexing structures; Management thereof
- G06F17/30321—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30067—File systems; File servers
- G06F17/30129—Details of further file system functionalities
- G06F17/3015—Redundancy elimination performed by the file system
- G06F17/30153—Redundancy elimination performed by the file system using compression, e.g. sparse files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/21—Text processing
- G06F17/22—Manipulating or registering by use of codes, e.g. in sequence of text characters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30861—Retrieval from the Internet, e.g. browsers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30943—Information retrieval; Database structures therefor; File system structures therefor details of database functions independent of the retrieved data type
- G06F17/30946—Information retrieval; Database structures therefor; File system structures therefor details of database functions independent of the retrieved data type indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30017—Multimedia data retrieval; Retrieval of more than one type of audiovisual media
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F19/00—Digital computing or data processing equipment or methods, specially adapted for specific applications
- G06F19/10—Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
- G06F19/22—Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or SNP [Single-Nucleotide Polymorphism] discovery or sequence alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F19/00—Digital computing or data processing equipment or methods, specially adapted for specific applications
- G06F19/10—Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
- G06F19/28—Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
-
- H—ELECTRICITY
- H03—BASIC ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same information or similar information or a subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bonfield | CRAM 3.1: advances in the CRAM file format | |
Cox et al. | Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform | |
Benoit et al. | Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph | |
Jones et al. | Compression of next-generation sequencing reads aided by highly efficient de novo assembly | |
Tembe et al. | G-SQZ: compact encoding of genomic sequence and quality data | |
JP7079786B2 (en) | Methods, computer-readable media, and equipment for accessing structured bioinformatics data in access units. | |
Hach et al. | SCALCE: boosting sequence compression algorithms using locally consistent encoding | |
Zhu et al. | High-throughput DNA sequence data compression | |
US10790044B2 (en) | Systems and methods for sequence encoding, storage, and compression | |
Holt et al. | Merging of multi-string BWTs with applications | |
Patro et al. | Data-dependent bucketing improves reference-free compression of sequencing reads | |
Bose et al. | BIND–An algorithm for loss-less compression of nucleotide sequence data | |
Saha et al. | NRGC: a novel referential genome compression algorithm | |
Holley et al. | Dynamic alignment-free and reference-free read compression | |
Shi et al. | High efficiency referential genome compression algorithm | |
Tatwawadi et al. | GTRAC: fast retrieval from compressed collections of genomic variants | |
Deorowicz et al. | AGC: compact representation of assembled genomes with fast queries and updates | |
Wertenbroek et al. | XSI—a genotype compression tool for compressive genomics in large biobanks | |
Kim et al. | MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression | |
Sirén et al. | GBZ file format for pangenome graphs | |
CN110168652B (en) | Method and system for storing and accessing bioinformatic data | |
El Allali et al. | MZPAQ: a FASTQ data compression tool | |
Cánovas et al. | CSAM: compressed SAM format | |
JP7362481B2 (en) | A method for encoding genome sequence data, a method for decoding encoded genome data, a genome encoder for encoding genome sequence data, a genome decoder for decoding genome data, and a computer-readable recording medium | |
Habib et al. | Modified HuffBit compress algorithm–an application of R |