Question: samtools - Finding Average Fragment size for pair?
0
gravatar for oars
9 months ago by
oars150
oars150 wrote:

I'm trying to find the average fragment size for reads in my .bam file and want to also exclude read pairs with a fragment size of greater than 1kb (Note: not read length, but fragment size for the pair.) I put together a command line script but the output is all wrong; this is what i tried:

enter code here
$ samtools view -h SRR1611183.bam chr22 | awk 'length($10) > 1000 || $1 ~ /^@/' | samtools view -h - > bar2.bam

This basically gives me that same output as the idxstat command:

samtools idxstats SRR1611183.bam chr22

These are good for finding the sequence length but I'm seeking the fragment size.

Finally, I found a command line script that looked like a winner, but the output is all zeros?

samtools view -h SRR1611183.bam chr22 | awk -F'\t' '{if((/^@/) || (length($10)>1000)){print $0}}' | samtools stats | grep '^SN' | cut -f 2-

And here is the output (something is off):

raw total sequences:    0
filtered sequences: 0
sequences:  0
is sorted:  1
1st fragments:  0
last fragments: 0
reads mapped:   0
reads mapped and paired:    0   # paired-end technology bit set + both mates mapped
reads unmapped: 0
reads properly paired:  0   # proper-pair bit set
reads paired:   0   # paired-end technology bit set
reads duplicated:   0   # PCR or optical duplicate bit set
reads MQ0:  0   # mapped and MQ=0
reads QC failed:    0
non-primary alignments: 0
total length:   0   # ignores clipping
bases mapped:   0   # ignores clipping
bases mapped (cigar):   0   # more accurate
bases trimmed:  0
bases duplicated:   0
mismatches: 0   # from NM fields
error rate: 0.000000e+00    # mismatches / bases mapped (cigar)
average length: 0
maximum length: 30
average quality:    0.0
insert size average:    0.0
insert size standard deviation: 0.0
inward oriented pairs:  0
outward oriented pairs: 0
pairs with other orientation:   0
pairs on different chromosomes: 0

Anyone see where I'm off?

samtools • 727 views
ADD COMMENTlink modified 9 months ago by michael.ante2.9k • written 9 months ago by oars150
3

If you do a length on field $10, you're measuring the read length, which will never be longer than 1000 unless you're working with long read technologies and therefore will never show up in your output. If you want fragment size, you need to use field $9 or calculate it yourself from the read position information, but that means you'd need to process the two lines of each pair together somehow (e.g. by sorting by read name and then comparing names when calculating the fragment size).

ADD REPLYlink written 9 months ago by cschu1811.4k
1
gravatar for michael.ante
9 months ago by
michael.ante2.9k
Austria/Vienna
michael.ante2.9k wrote:

Do you have RNA-Seq or DNA-Seq data?

With DNA, you may just use bedtool's bamtobed with the bedpe option and compute the length.

For RNA-Seq you can have a look at RSEQC's inner_distance.py. Which you could adapt to your needs.

Cheers, Michael

ADD COMMENTlink written 9 months ago by michael.ante2.9k

whole-exome sequencing.

ADD REPLYlink written 9 months ago by oars150

I'm not familiar with exome sequencing. Did you have splices in your aligned reads, or is the whole gene including introns covered?

With splicings have a look at RSeQC; without splicings have a look at bedtools

ADD REPLYlink written 9 months ago by michael.ante2.9k

If I understand correctly, you target exon sequences, but might pull out a bit of attached intron sequence, since the whole DNA is sheared and only then captured. If I don't understand correctly, please ignore my rambling.

ADD REPLYlink written 9 months ago by cschu1811.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 774 users visited in the last hour