BGZip slow performance near end of chromosomes #153

davmlaw · 2019-07-26T05:41:17Z

It can take over a minute to retrieve a few bases:

$ time faidx --no-rebuild Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 7:117199563-117199564
>7:117199563-117199564
GG

real	1m8.888s
user	0m31.042s
sys	0m37.785s

Low coordinates are fine:

$time faidx --no-rebuild Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 7:10000-11000 > /dev/null

real	0m0.295s
user	0m0.241s
sys	0m0.056s

You said in a previous issue:

there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution.

I can't find that issue, so am raising this one. Good luck!

The text was updated successfully, but these errors were encountered:

Maarten-vd-Sande · 2022-02-11T13:47:14Z

Seems that the lookup time just scales with the distance from the "start" of a contig. I just quickly scanned the internals, can't say I fully understand, but it seems that this is due to the way bgzip is implemented in biopython:

https://github.com/biopython/biopython/blob/master/Bio/bgzf.py#L699

It seems to read the whole part before the contig you need...?

mdshw5 · 2022-02-11T18:19:14Z

@Maarten-vd-Sande this is definitely not due to the Bio.bgzf implementation and is definitely due to my incomplete implementation of virtual offset calculations from the start of each contig. I started work to fully support using the .gzi sidecar files in #164, but have not taken the time to complete the work. From the .fai index we can know how many bases (characters) to skip, and can seek directly to the requested region in an uncompressed FASTA. For BGZF compressed FASTA in current pyfaidx implementation can know which BGZF block to seek to the beginning of a contig, but without implementing logic to incorporate the .gzi (which tells us the internal BGZF block structure of a contig) the safest thing we can do is start reading from the beginning of the sequence. This is not ideal. The alternative is to seek to the BGZF block nearest to the coordinate of interest, and then start reading from the beginning of that block. This is the relevant code that I wrote 1.5 years ago:

pyfaidx/pyfaidx/__init__.py

Lines 766 to 776 in f878775

    
           if self._bgzf:  
        
               from Bio import bgzf 
        
               insertion_i = bisect.bisect_left(self.gzi_index, bstart) 
        
               if insertion_i == 0:  # bisect_left already returns index to the left if values are the same 
        
                   start_block = self.gzi_index[insertion_i] 
        
               else: 
        
                   start_block = self.gzi_index[insertion_i - 1] 
        
               within_block_offset = start_block.ustart + bstart 
        
               print(self.gzi_index) 
        
               print(locals()) 
        
               self.file.seek(bgzf.make_virtual_offset(start_block.cstart, within_block_offset))

You can see that I was still trying to figure out how this works, and never was able to make an entire round-trip (read a .gzi file and construct and write an identical .gzi file) without some slight errors.

Maarten-vd-Sande · 2022-02-12T09:46:24Z

@mdshw5 thanks for the reply, that makes sense! I guess I'll just load the whole fasta in memory for now 😄

wsdewitt mentioned this issue Jul 30, 2020

performance issues with accessing bgzipped ancestral fastas harrispopgen/mutyper#11

Open

mdshw5 mentioned this issue Feb 19, 2022

Version in APT doesn't have correct dependenceis #187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BGZip slow performance near end of chromosomes #153

BGZip slow performance near end of chromosomes #153

BGZip slow performance near end of chromosomes #153

BGZip slow performance near end of chromosomes #153

Comments