Question: How To Extract Reads-Pairs Aligned Concordantly Exactly 1 Time?
7
gravatar for Rahul Sharma
3.9 years ago by
Rahul Sharma520
Germany and India
Rahul Sharma520 wrote:

Hi,

I want to extract read-pairs that aligned concordantly exactly 1 time to the reference genome. Is there any way to parse the output SAM file? I would really appreciate your suggestions.

Best regards,

Rahul

bowtie2 mapping • 9.5k views
ADD COMMENTlink modified 6 months ago by josm0 • written 3.9 years ago by Rahul Sharma520
2

By "map only one time", do you mean (1) don't have any equivalently good mapping or (2) don't have any other "valid" mappings (where "valid" is defined by the --score-min option in bowtie2)?

Edit: Since you updated the title to specify concordant-only alignments, the only difference there is the requirement of the 0x2 bit being set in the flag field (i.e., samtools view -f 0x2 alignments.bam would produce only concordant alignments).

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Devon Ryan76k

Thanks for your reply, actually I want to fetch out only those reads that aligned concordantly exactly 1 time using the generated SAM files. In this case I will have to go for option 2 you specified.

ADD REPLYlink written 3.9 years ago by Rahul Sharma520

Why is this (samtools view -f 0x2 alignments.bam)command different from the one below (samtools view -hf 0x2 alignments.bam)? Why is the h doing? Could that be reason for what I'm seeing here (other post: Concordant pairs bam file larger than accepted_hits bam file?)

ADD REPLYlink written 3.3 years ago by bmpbowen40

The -h simply tells samtools to print out the header as well as the reads. Thus if you were to pipe this into another bam file (by also using the -b option) you have a fully formed bam file.

ADD REPLYlink written 23 months ago by seidel6.3k

This has helped me too.  How did you run bowtie2?  I have been using the -k2 setting then removing any with the XS:i flag.

ADD REPLYlink written 3.8 years ago by Ian5.1k
13
gravatar for Devon Ryan
3.9 years ago by
Devon Ryan76k
Freiburg, Germany
Devon Ryan76k wrote:

The simplest way is likely something along the lines of:

samtools view -hf 0x2 alignments.bam | grep -v "XS:i:" > filtered.alignments.sam

As I mentioned in the comment above, the -f 0x2 part will get only "properly paired" alignments, which will effectively be the concordant alignments. For the option #2 definition of "map only once", you can take advantage of the fact that bowtie2 will add an XS auxiliary tag to reads that have another "valid" mapping. So, a quick inverse grep (grep -v) can get rid of those. One possible problem with this is if only one read of a pair can map to multiple places (e.g., the original fragment partly overlapped a simple tandem repeat) then you'd end up with orphans. The easiest fix to that (if it's a problem) would be to just check if the read names of pairs of alignments are the same. I'm sure someone will come up with a way to do that in awk, but the following python script is likely simpler:

#!/usr/bin/env python
import csv
import sys

f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
last_read = None
for line in f :
    #take care of the header
    if(line[0][0] == "@") :
        of.writerow(line)
        continue

    if(last_read == None) : 
        last_read = line
    else :
        if(last_read[0] == line[0]) :
            of.writerow(last_read)
            of.writerow(line)
            last_read = None
        else :
            last_read = line

If you saved that as foo.py and made it executable and in your PATH, then the following would solve the aforementioned issue:

samtools view -hf 0x2 alignments.bam | grep -v "XS:i:" | foo.py > filtered.alignments.sam

Note that this won't work with coordinate-sorted alignments, but bowtie2 doesn't produce those.

ADD COMMENTlink written 3.9 years ago by Devon Ryan76k

Thank you so much it worked :)

ADD REPLYlink written 3.9 years ago by Rahul Sharma520

Great reply, really useful!

ADD REPLYlink written 3.8 years ago by Ian5.1k

After further investigation I noticed that the read1 and read2 in the flagstat output where not equal.  There was a subset of reads (in my data anyway) that were designated SAM flag 256 "not primary alignment".  I removed these using -F 256 alongside -f 2.  My original bowtie parameters were --sensitive and -k2.

ADD REPLYlink written 3.8 years ago by Ian5.1k

thanks a lot, I had exactly the same problem. :)

ADD REPLYlink written 3.2 years ago by biola10

Hi,Ryan: The problem of filtering reads after bowtie2 mapping ,i conclude that situtation :first ,filter the mapped unmapped reads by using the command -F 4;second ,(when pair-end data ) ,by using the -f 2 ,-q 40 to filter ;third ,by using the sam flag XS ,to filter the reads map to multiple places.But here you say " One possible problem with this is if only one read of a pair can map to multiple places (e.g., the original fragment partly overlapped a simple tandem repeat) then you'd end up with orphans. " I could not understand this possible meaning ,why we need to filter this possible probllem reads .And i could not understand the python script function ?Could you explain this possible problem clearly . Best !

ADD REPLYlink written 18 months ago by yinyao4100

The original goal was to keep pairs that only align once (given the bowtie2 settings) in the genome. That's easy enough if both mates in the pair align only once, but it's problematic if, e.g., read #1 aligns once and read #2 aligns twice (such as to a simple tandem repeat). Since grep -v "XS:i:" would get rid of read #2, read #1 would be left orphaned (i.e., its mate is missing). My script then filters out such orphans.

ADD REPLYlink written 18 months ago by Devon Ryan76k

This didn't work for me. The number of extracted reads does not match the number in the summary made by Bowtie.

An example:

Time loading reference: 00:00:20 Time loading forward index: 00:00:49 Time loading mirror index: 00:00:26 Multiseed full-index search: 00:02:43 1020534 reads; of these: 1020534 (100.00%) were paired; of these: 98406 (9.64%) aligned concordantly 0 times 766336 (75.09%) aligned concordantly exactly 1 time 155792 (15.27%) aligned concordantly >1 times ---- 98406 pairs aligned concordantly 0 times; of these: 52167 (53.01%) aligned discordantly 1 time ---- 46239 pairs aligned 0 times concordantly or discordantly; of these: 92478 mates make up the pairs; of these: 51106 (55.26%) aligned 0 times 14846 (16.05%) aligned exactly 1 time 26526 (28.68%) aligned >1 times 97.50% overall alignment rate Time searching: 00:11:38 Overall time: 00:12:29

So I want to extract those 766336 (75.09%) that are aligned concordantly exactly 1 time.

I can get those that are concordantly aligned exactly 1 time or > 1 times (766336 + 155792 = 922128) by:

samtools view alignments.bam | grep "YT:Z:CP" | wc -l

or

samtools view -f 0x2 alignments.bam | wc -l

Which gives 1844256 = 922128*2

Your solution:

samtools view -f 0x2 alignments.bam | grep -v "XS:i:" | foo.py | wc -l

gives 1459732 = 2*729866, which is not the desired result.

ADD REPLYlink modified 14 months ago • written 14 months ago by josm0

A little bite late but I use the following awk scripts on concordantly mapped paired-end reads:

 awk '$0!~/^@/ {key=$0; getline; print key "delimiter you want" $0;next}{print $0}' mysample.sam > collapsed.sam

It actually collapse pairs of rows to put paired-end reads on the same line (ignoring the header). use a delimiter that is not present in your sam file ( '{' or '}' for illumina phred33 for example).

awk '$0!~/XS:i:/' collapsed.sam > filtered_collapsed.sam

Then filter line if the pattern "XS:i:" is found

awk '{n=split($0,a,"delimiter you want"); for (i = 1; i <= n; ++i) print a[i]}'  filtered_collapsed.sam > filtered_unique.sam

and finally split remaining lines according to the delimiter to rebuild a correct sam file. Be careful to sort your sam file by read name so reads from the same pair are consecutive.

hope it helps

ADD REPLYlink modified 10 months ago • written 10 months ago by gcorre290
0
gravatar for josm
6 months ago by
josm0
josm0 wrote:

Perhaps this works.

GitHub - Extracting read pairs that have been concordantly aligned exactly 1 time #58

ADD COMMENTlink written 6 months ago by josm0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 948 users visited in the last hour