Insert Size For Illumina Gaiix Paired-End Library From Sam/Bam File
2
0
Entering edit mode
10.1 years ago
bioinfo ▴ 830

From the fastq data (read 1 and read 2) from illumina GAIIx platform ( paired-end library), I created the Sam and bam file using BWA. I got the statistics of number of uniquely-paired reads and total reads mapped to my reference genome. I was wondering how to get a complete list of insert size across the data set from my Sam or bam file? It could be just for uniquely paired reads or all mapped reads. I don't want to write a long script for that. Is there any easy commands from samtools or other options that I can use?

illumina next-gen sam bam • 6.8k views
ADD COMMENT
0
Entering edit mode

BWA's command line output already gives info about the mean and the SD of insert size while its running.

ADD REPLY
0
Entering edit mode

What command are you talking about? Tha samtools flagstat usually gives a basic statistics of the mapping like uniquely-paired, total reads mapped, singletons etc. I'm not sure which step you are mentioning. Can you explain me a bit more?

ADD REPLY
0
Entering edit mode

when you ran bwa it displayed the insert size info - but if you did not save that then that information is probably lost now

ADD REPLY
1
Entering edit mode
10.1 years ago
matted 7.6k

The Picard function CollectInsertSizeMetrics will do that for you, including text output and making a histogram PDF.

ADD COMMENT
0
Entering edit mode

I can't remember why now, but I wasn't satisfied with the mean and SD that was given by picard tools. It resulted in huge reduction of the % of mapped reads. I wrote my own script to compute the inner distance mean and sd, seems to work fine.

ADD REPLY
0
Entering edit mode

I tried with the Picard function "CollectInsertMetrics"

my commands:

java -Xmx2g -jar ~/bin/picard-tools-1.70/CollectInsertSizeMetrics.jar H=GMMhist.txt DEVIATIONS=10 I=GMM.bam O=GMMinsertsize.txt R=ref.fasta VALIDATIONSTRINGENCY=LENIENT

The programme took the input as below:

[Mon Aug 20 10:00:00 BST 2012] net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAMFILE=GMMhist.pdf DEVIATIONS=10.0 INPUT=SODpalBWAGMM.PE.sorted.bam.sortedcleanedGMM.bam OUTPUT=GMMinsertsize.txt REFERENCESEQUENCE=Sodalisref.fasta VALIDATIONSTRINGENCY=LENIENT MINIMUMPCT=0.05 METRICACCUMULATIONLEVEL=[ALLREADS] ASSUMESORTED=true STOPAFTER=0 VERBOSITY=INFO QUIET=false COMPRESSIONLEVEL=5 MAXRECORDSINRAM=500000 CREATEINDEX=false CREATEMD5FILE=false [Mon Aug 20 10:00:00 BST 2012] Executing as acdguest2@euler01.black on Linux 2.6.38.4 amd64; Java HotSpot(TM) 64-Bit Server VM 1.6.022-b04; Picard version: 1.70(1215)

Error message:

WARNING 2012-08-20 09:43:49 SinglePassSamProgram File reports sort order 'queryname', assuming it's coordinate sorted anyway.

WARNING 2012-08-20 09:43:49 CollectInsertSizeMetrics All data categories were discarded because they contained < 0.05 of the total aligned paired data.

WARNING 2012-08-20 09:43:49 CollectInsertSizeMetrics Total mapped pairs in all categories: 0.0

[Mon Aug 20 09:43:49 BST 2012] net.sf.picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=506396672

I didn't really understand: "All data categories were discarded because they contained < 0.05 of the total aligned paired data."...what does it mean? how can I solve the problem?

ADD REPLY
0
Entering edit mode

What output does samtools flagstat give for your bam file?

ADD REPLY
0
Entering edit mode

Here is my flagstat ouput

6699458 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

5036221 + 0 mapped (75.17%:nan%)

6699458 + 0 paired in sequencing

3349729 + 0 read1

3349729 + 0 read2

4807142 + 0 properly paired (71.75%:nan%)

4942444 + 0 with itself and mate mapped

93777 + 0 singletons (1.40%:nan%)

22334 + 0 with mate mapped to a different chr

20960 + 0 with mate mapped to a different chr (mapQ>=5)

*My reference genome (1 chromosome, 4 plasmids and 1 phage)

ADD REPLY
0
Entering edit mode

Hm... so the error message is coming from the MINIMUM_PCT argument. The docs say: "When generating the histogram, discard any data categories (out of FR, TANDEM, RF) that have fewer than this percentage of overall reads. (Range: 0 to 1) Default value: 0.05." So it means Picard doesn't think many of your reads fit these categories (which they should, hopefully). What does Picard ValidateSamFile say about your bam? What did you use to make it? My only guess now is maybe the mate pair information isn't written in the bam in the way that Picard expects it.

ADD REPLY
0
Entering edit mode

The bam file I used initially was created by BWA and then cleaned up by GATK (realigned by " RealignerTargetCreator" and "IndelRealigner") and sorted by Samtools. Now I just checked whether there is any incompatibility between the bam file and the picard tools. So, I used the bam file created by samtools before cleaning up step with GATK. Now, I got some validation error below.

.................... ..................... I picked up a few lines of error meaage

Ignoring SAM validation error: ERROR: Record 5130866, Read name HWUSI-EAS1643R:29:FC:6:15:5339:15720, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130880, Read name HWUSI-EAS1643R:29:FC:6:25:18867:7265, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130881, Read name HWUSI-EAS1643R:29:FC:6:31:6733:1961, MAPQ should be 0 for unmapped read. Ignoring SAM validation error: ERROR: Record 5130916, Read name HWUSI-EAS1643R:29:FC:6:98:1162:2696, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130942, Read name HWUSI-EAS1643R:29:FC:6:48:19322:1813, MAPQ should be 0 for unmapped read.

................................................ ..........................................

ADD REPLY
0
Entering edit mode

Those errors are expected (see the FAQ here). My only remaining guess is that somehow the BAM sort order matters... but I don't see why that would be the case. In my scripts I only run CollectInsertSizeMetrics on coordinate-sorted BAMs, passing ASSUME_SORTED=true. Might be something to try. Sorry I can't be more helpful.

ADD REPLY
1
Entering edit mode

I also faced the same issue. Without the bam file sorted, it produces the error. After sorted, it works well.

ADD REPLY
0
Entering edit mode

I ran the Picard CollectInsertSizeMetrics function it failed at generating the histogram with this following error: Exception in thread "main" picard.PicardException: R script picard/analysis/insertSizeHistogram.R failed with return code 127

ADD REPLY
0
Entering edit mode
2.8 years ago
Gabriel R. ★ 2.8k

I have also a super simple program in C++ built on samtools/htslib to do just that:

https://github.com/grenaud/insertsize

It skips sec. alignments, QC fail. you can restrict the computation to mapped or properly paired for paired reads.

ADD COMMENT

Login before adding your answer.

Traffic: 2278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6