Question

Does each read in a BAM file have approximately the same size?

0

Entering edit mode

6.2 years ago

Aleksandra • 0

I am engaged in research work, which is related to a processing of reads from a BAM file.

And I need to predict the number of reads at start of my program, but without reading the whole file, because it is too long for big input files.

I thought that if I read the first few reads, I'll estimate the approximate size of the read, then, knowing the total size of the file, I can calculate the approximate number of reads in the file. Not exact, but approximate, this also suits me.

So the question: is there a big difference in the size between the reads in the file?

For example, in my current data reads have size between 95 and 105 bytes. It's ok for me. But I'm not sure if it works for all other files.

bam • 1.6k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 6.2 years ago by Aleksandra • 0

2

Entering edit mode

is there a big difference in the size between the reads in the file?

Simple answer (which is going to sound vague) would be yes and no. Datasets come from a run that is N cycles in one direction. If data was trimmed to remove adapters then there will be a distribution of reads that could be anywhere between N and minimum read length selected during trimming. This distribution can also come from aligners soft/hard clipping reads during alignment (dropping bases that do not align).

As for number of reads present I don't think there is a way to estimate that by looking at a fraction of the total file. Using samtools flagstat may be a relatively fast way to get that number.

ADD REPLY • link 6.2 years ago by GenoMax 141k

1

Entering edit mode

Long read sequencing could further add complexity, since the assumption of N cycles doesn't work anymore.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2018-02-25

I don't think there could be a conclusive general answer for your question, because it depends on input reads (fastq files) that were provided to the mapping software. If no trimming was performed on the input reads (for example for removing low quality bases), then read length should be the same for most of the reads. But if user made trimming then it totally depends on which parameters were used for trimming. Moreover, some mapping software can split or clip the reads.

What I can suggest you is maybe to run fastqc on your bam file (this sould be is relatively fast), which will give you the read length distribution for your data.