Question: Insert Size For Illumina Gaiix Paired-End Library From Sam/Bam File
0
gravatar for bioinfo
8.1 years ago by
bioinfo790
bioinfo790 wrote:

From the fastq data (read 1 and read 2) from illumina GAIIx platform ( paired-end library), I created the Sam and bam file using BWA. I got the statistics of number of uniquely-paired reads and total reads mapped to my reference genome. I was wondering how to get a complete list of insert size across the data set from my Sam or bam file? It could be just for uniquely paired reads or all mapped reads. I don't want to write a long script for that. Is there any easy commands from samtools or other options that I can use?

illumina next-gen sam bam • 5.2k views
ADD COMMENTlink modified 9 months ago by Gabriel R.2.8k • written 8.1 years ago by bioinfo790

BWA's command line output already gives info about the mean and the SD of insert size while its running.

ADD REPLYlink written 8.1 years ago by Arun2.3k

What command are you talking about? Tha samtools flagstat usually gives a basic statistics of the mapping like uniquely-paired, total reads mapped, singletons etc. I'm not sure which step you are mentioning. Can you explain me a bit more?

ADD REPLYlink written 8.1 years ago by bioinfo790

when you ran bwa it displayed the insert size info - but if you did not save that then that information is probably lost now

ADD REPLYlink written 8.1 years ago by Istvan Albert ♦♦ 85k
1
gravatar for matted
8.1 years ago by
matted7.3k
Boston, United States
matted7.3k wrote:

The Picard function CollectInsertSizeMetrics will do that for you, including text output and making a histogram PDF.

ADD COMMENTlink written 8.1 years ago by matted7.3k

I can't remember why now, but I wasn't satisfied with the mean and SD that was given by picard tools. It resulted in huge reduction of the % of mapped reads. I wrote my own script to compute the inner distance mean and sd, seems to work fine.

ADD REPLYlink written 8.1 years ago by Arun2.3k

I tried with the Picard function "CollectInsertMetrics"

my commands:

java -Xmx2g -jar ~/bin/picard-tools-1.70/CollectInsertSizeMetrics.jar H=GMMhist.txt DEVIATIONS=10 I=GMM.bam O=GMMinsertsize.txt R=ref.fasta VALIDATIONSTRINGENCY=LENIENT

The programme took the input as below:

[Mon Aug 20 10:00:00 BST 2012] net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAMFILE=GMMhist.pdf DEVIATIONS=10.0 INPUT=SODpalBWAGMM.PE.sorted.bam.sortedcleanedGMM.bam OUTPUT=GMMinsertsize.txt REFERENCESEQUENCE=Sodalisref.fasta VALIDATIONSTRINGENCY=LENIENT MINIMUMPCT=0.05 METRICACCUMULATIONLEVEL=[ALLREADS] ASSUMESORTED=true STOPAFTER=0 VERBOSITY=INFO QUIET=false COMPRESSIONLEVEL=5 MAXRECORDSINRAM=500000 CREATEINDEX=false CREATEMD5FILE=false [Mon Aug 20 10:00:00 BST 2012] Executing as acdguest2@euler01.black on Linux 2.6.38.4 amd64; Java HotSpot(TM) 64-Bit Server VM 1.6.022-b04; Picard version: 1.70(1215)

Error message:

WARNING 2012-08-20 09:43:49 SinglePassSamProgram File reports sort order 'queryname', assuming it's coordinate sorted anyway.

WARNING 2012-08-20 09:43:49 CollectInsertSizeMetrics All data categories were discarded because they contained < 0.05 of the total aligned paired data.

WARNING 2012-08-20 09:43:49 CollectInsertSizeMetrics Total mapped pairs in all categories: 0.0

[Mon Aug 20 09:43:49 BST 2012] net.sf.picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=506396672

I didn't really understand: "All data categories were discarded because they contained < 0.05 of the total aligned paired data."...what does it mean? how can I solve the problem?

ADD REPLYlink modified 8.1 years ago • written 8.1 years ago by bioinfo790

What output does samtools flagstat give for your bam file?

ADD REPLYlink written 8.1 years ago by matted7.3k

Here is my flagstat ouput

6699458 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

5036221 + 0 mapped (75.17%:nan%)

6699458 + 0 paired in sequencing

3349729 + 0 read1

3349729 + 0 read2

4807142 + 0 properly paired (71.75%:nan%)

4942444 + 0 with itself and mate mapped

93777 + 0 singletons (1.40%:nan%)

22334 + 0 with mate mapped to a different chr

20960 + 0 with mate mapped to a different chr (mapQ>=5)

*My reference genome (1 chromosome, 4 plasmids and 1 phage)

ADD REPLYlink modified 8.1 years ago • written 8.1 years ago by bioinfo790

Hm... so the error message is coming from the MINIMUM_PCT argument. The docs say: "When generating the histogram, discard any data categories (out of FR, TANDEM, RF) that have fewer than this percentage of overall reads. (Range: 0 to 1) Default value: 0.05." So it means Picard doesn't think many of your reads fit these categories (which they should, hopefully). What does Picard ValidateSamFile say about your bam? What did you use to make it? My only guess now is maybe the mate pair information isn't written in the bam in the way that Picard expects it.

ADD REPLYlink written 8.1 years ago by matted7.3k

The bam file I used initially was created by BWA and then cleaned up by GATK (realigned by " RealignerTargetCreator" and "IndelRealigner") and sorted by Samtools. Now I just checked whether there is any incompatibility between the bam file and the picard tools. So, I used the bam file created by samtools before cleaning up step with GATK. Now, I got some validation error below.

.................... ..................... I picked up a few lines of error meaage

Ignoring SAM validation error: ERROR: Record 5130866, Read name HWUSI-EAS1643R:29:FC:6:15:5339:15720, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130880, Read name HWUSI-EAS1643R:29:FC:6:25:18867:7265, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130881, Read name HWUSI-EAS1643R:29:FC:6:31:6733:1961, MAPQ should be 0 for unmapped read. Ignoring SAM validation error: ERROR: Record 5130916, Read name HWUSI-EAS1643R:29:FC:6:98:1162:2696, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130942, Read name HWUSI-EAS1643R:29:FC:6:48:19322:1813, MAPQ should be 0 for unmapped read.

................................................ ..........................................

ADD REPLYlink modified 8.1 years ago • written 8.1 years ago by bioinfo790

Those errors are expected (see the FAQ here). My only remaining guess is that somehow the BAM sort order matters... but I don't see why that would be the case. In my scripts I only run CollectInsertSizeMetrics on coordinate-sorted BAMs, passing ASSUME_SORTED=true. Might be something to try. Sorry I can't be more helpful.

ADD REPLYlink written 8.1 years ago by matted7.3k
1

I also faced the same issue. Without the bam file sorted, it produces the error. After sorted, it works well.

ADD REPLYlink written 5.8 years ago by a_khanzehady10
0
gravatar for Gabriel R.
9 months ago by
Gabriel R.2.8k
Danmarks Tekniske Universitet
Gabriel R.2.8k wrote:

I have also a super simple program in C++ built on samtools/htslib to do just that:

https://github.com/grenaud/insertsize

It skips sec. alignments, QC fail. you can restrict the computation to mapped or properly paired for paired reads.

ADD COMMENTlink written 9 months ago by Gabriel R.2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 961 users visited in the last hour