Question: How to remove reads with hard/soft clipping along with its mate?
5
gravatar for komal.rathi
5.7 years ago by
komal.rathi3.7k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.7k wrote:

Hi everyone,

I have an output bam file from GSNAP. I want to remove all the reads that have hard (H) & soft (S) clipping in the CIGAR string AND the corresponding mate for those reads. Note, I only have properly-paired reads in the starting bam file. For the first part I use the following command:

# remove reads with hard/soft clips info in the CIGAR string

samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam

However, this only removes the read with Hard/Soft Clipping & not its corresponding mate. Here is an example:

# example read HWI-ST1006:108:D245BACXX:6:1103:17271:82507

# extract this read from sample.bam

samtools view sample.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507'

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    163    chr8    19810830    20    99M    =    19810929    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    CT:Z:2F99M0T1R4M698N95M

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    419    chr8    19810830    0    99M    =    19811630    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:2    NM:i:0    XW:i:0    XV:i:0    SM:i:0    XQ:i:35    X2:i:35

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    83    chr8    19810929    20    4M698N95M    =    19810830    -896    ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC    ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA??    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    XS:A:+

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    339    chr8    19811630    0    3S96M    =    19810830    -896    ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC    ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA??    RG:Z:C00060    MD:Z:96    NH:i:2    HI:i:2    NM:i:0    XW:i:0    XV:i:0    SM:i:0    XQ:i:35    X2:i:35

# extract the same read from sample.noclips.bam 

samtools view sample.noclips.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507'

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    163    chr8    19810830    20    99M    =    19810929    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    CT:Z:2F99M0T1R4M698N95M

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    419    chr8    19810830    0    99M    =    19811630    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:2    NM:i:0    XW:i:0    XV:i:0    SM:i:0    XQ:i:35    X2:i:35

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    83    chr8    19810929    20    4M698N95M    =    19810830    -896    ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC    ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA??    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    XS:A:+

So, how do I remove reads with hard/soft clipping along with its mate? 

gsnap rna-seq • 7.0k views
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by komal.rathi3.7k
2

How important are those secondary alignments for your downstream analysis? If they aren't important you can follow what Pierre suggested. In the other case, when secondary alignments are important for the downstream analysis, it may get little trickier as you want to make sure that you are removing the correct pair and not everything. For example, 83 and 163 will have to be removed together. Similarly 339 and 419. 

If you only one secondary alignment for each primary alignment, you can create two separate bam files using "samtools view -F 256" and then process both the files as Pierre suggested and then merge them later on. 

**Comment edited after Pierre's answer**

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Ashutosh Pandey12k

Ashutosh Pandey I do have more than one secondary alignment for primary alignments & I want to keep them. The only reason I am removing hard/soft clipping is because the downstream analysis (with R package asSeq) does not recognize these clippings. But it also wants reads to be paired. In the above example, 83 & 163 shouldn't be removed but 339 & 419 should be.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by komal.rathi3.7k

it would be easy with pysam. You can recreate a bam file with reads of your interest.

ADD REPLYlink written 5.7 years ago by geek_y11k

Hello Komal,

I am trying to use the command you have in your original post:

samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam

I have an alignment file(bam from single end). My goal is to remove the soft/hard clipped alignments from the original bam file. When I used the command you have in your original post I do not get the expected alignment in sample.noclips.bam

Meaning, when I use

samtools view sample.bam | grep read1

I get the resulting alignment. However, when I try to find the same read:

samtools view sample.noclips.bam | grep read1

I do not find it. There are very few reads in my sample.noclips.bam. Am I doing anything wrong here?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by KVC_bioinfo510
3
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

with your awk, extract and sort the name of the clipped reads:

 

samtools view sample.bam | awk -F '\t' '($6 !~ /H|S/)' | cut -f 1 | LC_ALL=C sort | uniq > names.txt

sort you original sam on names:

samtools view sample.bam | LC_ALL=C sort -t '\t' -k1,1 > tmp.sam

outer-join

samtools view -H sample.bam > new.sam
join -t '\t' -v 1 -1 1 -2 1 tmp.sam names.txt >> new.sam

 

resort on 'coordinate' if needed.

 

 

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Pierre Lindenbaum131k

Pierre,  Would it be possible to save the secondary alignments involving the read pairs where the primary alignment of one of the read shows clipping ? 

ADD REPLYlink written 5.7 years ago by Ashutosh Pandey12k

yes filter with samtools view -F 64, then with samtools view -F 128 ( https://broadinstitute.github.io/picard/explain-flags.html )

ADD REPLYlink written 5.7 years ago by Pierre Lindenbaum131k

Pierre Lindenbaum In the above example, 83 & 163 shouldn't be removed but 339 & 419 should be, because 339 has soft clipping & its mate is 419. This code removes every alignment (primary as well as other secondary) with the read name that has hard/soft clipping. 

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by komal.rathi3.7k
1

Komal, I had the exact same concern. May be I was not clear enough and Pierre misunderstood my question. An easy approach would be to store read name (column 1), position (column 4), mate position(column 8) for reads showing soft/hard clipping. This information can be used to remove all the reads with the same name and whose position is either one of the above stored positions. I am sure Pierre will come up with an elegant approach :-). I can write a python script but you will have to wait until the weekend.  

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Ashutosh Pandey12k

Ashutosh Pandey Thanks, that is exactly what I am working on right now (with awk and what not). Appreciate your help. 

Pierre Lindenbaum To extract clipped reads, I just modified your first line by removing '!' but I am still getting some reads with hard clips, and I don't have a clue for the life of me.

ADD REPLYlink written 5.7 years ago by komal.rathi3.7k

Hey Komal,

Do you have the original SAM files that were produced by GSNAP? They should be sorted by queryname and the read-pairs should be correctly paired. You can then use sth like:

 awk '{cigar=$6; sam_record=$0; getline}{if(cigar !~ /H|S/ && $6 !~ /H|S/){print sam_record,"\n",$0}}' Input.sam

and it should give you all the consecutive SAM records where both the records were aligned without using any clipping. I haven't tested it but you should be able to make small tweaks if needed.  

ADD REPLYlink written 5.7 years ago by Ashutosh Pandey12k

The BAM files I am using are filtered (not the original output of GSNAP) & sorted by coordinates. I'm trying the above awk command (maybe with a few changes). 

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by komal.rathi3.7k

They may not be sorted correctly. In the example above, 83 should be grouped with 163 (163 should follow right after 83) and 339 should be grouped with 419. My proposed solution won't work for the bam snippet that you have pasted above. The read pairs from an alignment should be grouped together. Try "samtools -n" and see if it can group them correctly. 

ADD REPLYlink written 5.7 years ago by Ashutosh Pandey12k

Sorry, I meant sorted by coordinates. 

ADD REPLYlink written 5.7 years ago by komal.rathi3.7k
3
gravatar for komal.rathi
5.7 years ago by
komal.rathi3.7k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.7k wrote:

I gave some thought to our pipeline & the downstream program (extractAsReads function of asSeq) and realize that secondary alignments won't make things any better for me. I couldn't get Pierre's code to work as is, so I had to make a few changes:

#extract name of clipped reads:

samtools view sample.bam | awk '$6 ~ /H|S/{print $1}' | sort -k1,1 | uniq > names.txt

# sort original bam on names:

samtools view sample.bam | sort -k1,1 > tmp.sam

# get header

samtools view -H sample.bam > new.sam

# outer-join (I use ctrl+v+tab to explicitly get the tab character)

join -t '   ' -v 1 -1 1 -2 1 tmp.sam names.txt >> new.sam

# convert to bam

samtools view -bS new.sam > new.bam

Nevertheless, thank you Ashutosh Pandey and Pierre Lindenbaum for taking time off to help me. 

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by komal.rathi3.7k

Great!! Thanks for the update.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Ashutosh Pandey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2091 users visited in the last hour