samtools - Finding Average Fragment size for pair?
Entering edit mode
5.7 years ago
oars ▴ 200

I'm trying to find the average fragment size for reads in my .bam file and want to also exclude read pairs with a fragment size of greater than 1kb (Note: not read length, but fragment size for the pair.) I put together a command line script but the output is all wrong; this is what i tried:

enter code here
$ samtools view -h SRR1611183.bam chr22 | awk 'length($10) > 1000 || $1 ~ /^@/' | samtools view -h - > bar2.bam

This basically gives me that same output as the idxstat command:

samtools idxstats SRR1611183.bam chr22

These are good for finding the sequence length but I'm seeking the fragment size.

Finally, I found a command line script that looked like a winner, but the output is all zeros?

samtools view -h SRR1611183.bam chr22 | awk -F'\t' '{if((/^@/) || (length($10)>1000)){print $0}}' | samtools stats | grep '^SN' | cut -f 2-

And here is the output (something is off):

raw total sequences:    0
filtered sequences: 0
sequences:  0
is sorted:  1
1st fragments:  0
last fragments: 0
reads mapped:   0
reads mapped and paired:    0   # paired-end technology bit set + both mates mapped
reads unmapped: 0
reads properly paired:  0   # proper-pair bit set
reads paired:   0   # paired-end technology bit set
reads duplicated:   0   # PCR or optical duplicate bit set
reads MQ0:  0   # mapped and MQ=0
reads QC failed:    0
non-primary alignments: 0
total length:   0   # ignores clipping
bases mapped:   0   # ignores clipping
bases mapped (cigar):   0   # more accurate
bases trimmed:  0
bases duplicated:   0
mismatches: 0   # from NM fields
error rate: 0.000000e+00    # mismatches / bases mapped (cigar)
average length: 0
maximum length: 30
average quality:    0.0
insert size average:    0.0
insert size standard deviation: 0.0
inward oriented pairs:  0
outward oriented pairs: 0
pairs with other orientation:   0
pairs on different chromosomes: 0

Anyone see where I'm off?

samtools • 5.0k views
Entering edit mode

If you do a length on field $10, you're measuring the read length, which will never be longer than 1000 unless you're working with long read technologies and therefore will never show up in your output. If you want fragment size, you need to use field $9 or calculate it yourself from the read position information, but that means you'd need to process the two lines of each pair together somehow (e.g. by sorting by read name and then comparing names when calculating the fragment size).

Entering edit mode
5.7 years ago
michael.ante ★ 3.8k

Do you have RNA-Seq or DNA-Seq data?

With DNA, you may just use bedtool's bamtobed with the bedpe option and compute the length.

For RNA-Seq you can have a look at RSEQC's Which you could adapt to your needs.

Cheers, Michael

Entering edit mode

whole-exome sequencing.

Entering edit mode

I'm not familiar with exome sequencing. Did you have splices in your aligned reads, or is the whole gene including introns covered?

With splicings have a look at RSeQC; without splicings have a look at bedtools

Entering edit mode

If I understand correctly, you target exon sequences, but might pull out a bit of attached intron sequence, since the whole DNA is sheared and only then captured. If I don't understand correctly, please ignore my rambling.


Login before adding your answer.

Traffic: 1273 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6