I am engaged in research work, which is related to a processing of reads from a BAM file.
And I need to predict the number of reads at start of my program, but without reading the whole file, because it is too long for big input files.
I thought that if I read the first few reads, I'll estimate the approximate size of the read, then, knowing the total size of the file, I can calculate the approximate number of reads in the file. Not exact, but approximate, this also suits me.
So the question: is there a big difference in the size between the reads in the file?
For example, in my current data reads have size between 95 and 105 bytes. It's ok for me. But I'm not sure if it works for all other files.
Simple answer (which is going to sound vague) would be yes and no. Datasets come from a run that is
N cyclesin one direction. If data was trimmed to remove adapters then there will be a distribution of reads that could be anywhere between N and minimum read length selected during trimming. This distribution can also come from aligners soft/hard clipping reads during alignment (dropping bases that do not align).
As for number of reads present I don't think there is a way to estimate that by looking at a fraction of the total file. Using
samtools flagstatmay be a relatively fast way to get that number.
Long read sequencing could further add complexity, since the assumption of N cycles doesn't work anymore.