Question: Picard Addorreplacereadgroups Results In Smaller File
2
gravatar for Dan Gaston
6.8 years ago by
Dan Gaston7.1k
Canada
Dan Gaston7.1k wrote:

Hi Everyone,

I have recently started doing mapping and variant calling on six whole-exome sequencing projects (6 different individuals). I have already mapped to the reference and converted the SAM files to BAM using Picard. I then added Read Group Data using AddOrReplaceReadGroups for each of the files and took the opportunity to also sort by coordinates. However, because I have added data I am a little puzzled that the resulting files are smaller in size as I started with BAM files to begin with. Each file is about 3-4 GB smaller in size. Is this normal or should I be worried? An example command line was:

java -Xmx2g -jar /usr/local/bin/AddOrReplaceReadGroups.jar INPUT=1804.bam OUTPUT=1804.sorted.bam SORT_ORDER=coordinate RGLB=8 RGPL=Illumina RGPU=1 RGSM=1804

Thanks everyone.

ADD COMMENTlink written 6.8 years ago by Dan Gaston7.1k
8
gravatar for brentp
6.8 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

What is the original file size? Sorting should aid in compression because similar things are close together. You can alway check the number of reads by doing something like:

samtools view -F 4 -c 1804.sorted.bam
samtools view -F 4 -c 1804.bam

and you should get the same thing.

ADD COMMENTlink written 6.8 years ago by brentp23k

1804.bam: 15 Gigs 1804.sorted.bam: 11G

And you were right, looks like the same number of reads. Apparently the sorted BAM files just compress further which isn't something I quite expected, but makes perfect sense once I think about it.

ADD REPLYlink written 6.8 years ago by Dan Gaston7.1k

That's an excellent explanation. Nice thinking.

ADD REPLYlink written 6.8 years ago by Matt Shirley8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1694 users visited in the last hour