samtools subsampling reads
Entering edit mode
2.4 years ago
zpl015 • 0


I use samtools view -s to subsample a fraction of reads for all individuals in my dataset. I tried seed being 0 for all individuals, seed being 1 for all individuals, and random seed between 0 to 99 for each individual.

My code for subsampling:

samtools view -b -s seed.proportion inBam > OutBam

Subsampling seem to have to no problems as there is no empty output or error messages...but when I calculate the plot the average depth per individual, different seeds make almost no difference which seem to be a bit strange. Average depth comparison between different seed

Here is how I calculate the average depth:

samtools depth inBam | gzip -c > sample01.gz
COV_LEN=$(zcat sample01.gz | wc -l)
SUM=$(zcat sample01.gz | awk 'BEGIN{sum=0}{sum=sum+$3}END{print sum}')
AVE_DEP=$(\${SUM} / \$COV_LEN)

I am not sure if I have done anything wrong or that's how it supposed to be.

Thank you in advance,


samtools view • 1.7k views
Entering edit mode

Which samtools version are you using? Some time ago (I don't remember the version) you needed to use a comma delimited number in which the integer is the seed and the decimal the proportion you want to extract. I had similar issues, and posted everything here on Biostars. You can give a look at this old post and see if you find something helpful in there. Sample Sam File

Entering edit mode

Hi Fabio, Thanks for your comment! I use SAMtools/1.10-GCC-9.2.0. I read about your post and I use the same command as you. I guess using seed 0 for all individuals, seed 1 for all individuals and random seed for each individual seed is giving me different sets of reads as there is very little difference (yellow and orange dots are not completely under the green dots) in the average depth pattern...I just wasn't expecting there would be a 'pattern' in average depth, as well as reference genome coverage no matter what seed I use.

Entering edit mode

The pattern is perfectly fine, and statistically expected. Highly covered regions in the original genome, are also more likely to be sampled by random sampling, so you will always (more or less) keep the same pattern of coverage, but at lower levels.


Login before adding your answer.

Traffic: 1456 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6