Downloading 1000G file using samtools -b -L results in bam files of different sizes for the same input command
7.1 years ago
raunaq.123 • 0

Hi

We are working on the 1000 genomes data and trying to download a subsection of the data using genomic coordinates listed in a bed file.

The command we are using is

samtools view -b -L ../XYZ_coordinates.all.bed ftp://ftptrace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.ow_coverage.20120522.bam > ../1000data/HG00096.bam


The bed file contains a list of 90 different coordinates in it. We submitted a batch job for all the 1000 genomes and were able to download the bam files corresponding to the coordinates in the bed file. However, when redownloading some genomes individually, we get a bam file of a larger size. For example:

428K Sep  1 08:46 NA21090.bam


while the same command ran again gave a filesize of

3.4M Sep  8 19:57 NA21090.bam


Could someone please explain why the same command is giving different file sizes in the batch submission mode versus individual submissions? We used pbs scripting to submit the batch jobs, where batch file downloaded data from 50 ftp locations in one after another.

TIA

I think its a network issue. Nothing to do with samtools per se.

Thanks! It seems like a network issue than samtools problem. We were downloading multiple files simultaneously and all the bam files that had same time stamp of generation were usually the ones that showed this problem.

Check your free disk volumes, RAM capacity, firewalls, antiviruses. Try on different OSs: Win 7-10, Ubuntu, MacOS. Perhaps small file is incomplete --- under-downloaded or under-converted/transformed, in other words.