We are working on the 1000 genomes data and trying to download a subsection of the data using genomic coordinates listed in a bed file.
The command we are using is
samtools view -b -L ../XYZ_coordinates.all.bed ftp://ftptrace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.ow_coverage.20120522.bam > ../1000data/HG00096.bam
The bed file contains a list of 90 different coordinates in it. We submitted a batch job for all the 1000 genomes and were able to download the bam files corresponding to the coordinates in the bed file. However, when redownloading some genomes individually, we get a bam file of a larger size. For example:
Downloading for NA21090 in the batch job gave a file size of
428K Sep 1 08:46 NA21090.bam
while the same command ran again gave a filesize of
3.4M Sep 8 19:57 NA21090.bam
Could someone please explain why the same command is giving different file sizes in the batch submission mode versus individual submissions? We used pbs scripting to submit the batch jobs, where batch file downloaded data from 50 ftp locations in one after another.