How To Extract Reads With Wrong Insert Size With Samtools
2
0
Entering edit mode
11.5 years ago
Assa Yeroslaviz ★ 1.8k

Hi all,

I would like to know how to extract paired reads from a bam file with the "wrong" insert size. I am extracting the reads with the "right" size with this command:

samtools view G_AGTCAA.bam | gawk ' and($2,0x0002) && and($2,0x0040)' | awk '{print $1,"\t", $2, "\t", $4,"\t" $8,"\t" $9}' |sed 's/-//g' > G_insert_size.txt

This command gives me the reads with the "correct" insert size

HWIST863:138:D0WT7ACXX:5:1101:1110:56612      15090     15326     320
HWIST863:138:D0WT7ACXX:5:1101:1168:81336      5294     5061     334
HWIST863:138:D0WT7ACXX:5:1101:1169:91247      15799     15625     275
HWIST863:138:D0WT7ACXX:5:1101:1207:97805      8616     8704     189

But what I would like to get are the reads which have the FLAGS 97 or 145, like these two:

 HWI-ST863:138:D0WT7ACXX:5:2105:15878:99962    145    MT1    4599    37    101M    =    16312    11612    CAATCTCACTTCTATGAAATAAAACTCCAGCAATACTAACTATAATCTCACTGATATTACTATCCCTAGGAGGCCTTCCACCACTAACAGGATTCTTACCA    @CDCCACAEEAFEFFFEDHHHHGHIHJIGFGCA@/HCIHJJJIIJJJIHF?FDGD?HG?IHFCJJJJJJJJJIJJIHEGAGIFGHHGJHHHHHFFFDDCC@    XT:A:U    NM:i:0    SM:i:37    AM:i:0    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:101
 HWI-ST863:138:D0WT7ACXX:5:2105:15878:99962    97    MT1    16312    0    101M    =    4599      -11612    TAATAACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTATCCCATAAACACAAAGGTTTGGTCCTGGCCTTATAATTAATTAGAGGTAAAATTAC    CCCFFFFFHHHHHIJJJJJJJJJJJJJIJJJJJJJJJJJJJJJFHIHIJIJJIJJIHIJJJIFGIIIGGIIGIHHEHHGECEDEEFEDECEDCAC>AD@C@    XT:A:R    NM:i:0    SM:i:0    AM:i:0    X0:i:2    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:101    XA:Z:MT1,+13,101M,0;

this pair has an insert size of 11612.

I would like to know what hex strings I need to use to find this reads?

Thanks

Assa

bam samtools awk • 9.7k views
ADD COMMENT
2
Entering edit mode
11.5 years ago
matted 7.8k

I'm not exactly sure what class of reads you're looking for, but spend some quality time with the SAMtools explain flags page.

Piping through gawk and manually testing the flags is crazy and totally misses the entire point of them. It's slow and error-prone. You should be doing something like:

samtools view -f66 in.bam | cut -f 1,2,4,8,9 | tr -d "-" > out.txt

This matches your original output (up to whitespace differences) and uses the samtools flag filtering.

Once you figure out exactly what you want, use the -f and -F flags to samtools view to extract the correct reads.

For example, if you want reads that were paired and both mapped, but not properly, you'd do:

samtools view -f1 -F14 in.bam.

Reads the samtools docs and the flag explainer page to verify this snippet.

ADD COMMENT
0
Entering edit mode

I thought proper pair means the the insert size is within the "correct" size. But I find it very confusing to choose the right parameters for the correct FLAG. But i'll look into it. thanks

ADD REPLY
1
Entering edit mode
11.5 years ago

The other way to do this is to use awk to screen the contents of the insert size column, and only keep the reads where the absolute value of the insert size is outside your desired parameters.

ADD COMMENT
0
Entering edit mode

and how do i do that?

ADD REPLY

Login before adding your answer.

Traffic: 2042 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6