Question

How to remove reads with hard/soft clipping along with its mate?

7

Entering edit mode

9.1 years ago

komal.rathi ★ 4.1k

Hi everyone,

I have an output bam file from GSNAP. I want to remove all the reads that have hard (H) & soft (S) clipping in the CIGAR string AND the corresponding mate for those reads. Note, I only have properly-paired reads in the starting bam file. For the first part I use the following command:

# remove reads with hard/soft clips info in the CIGAR string
samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam

However, this only removes the read with Hard/Soft Clipping & not its corresponding mate. Here is an example:

# example read HWI-ST1006:108:D245BACXX:6:1103:17271:82507
# extract this read from sample.bam
samtools view sample.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507'

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    163    chr8    19810830    20    99M    =    19810929    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    CT:Z:2F99M0T1R4M698N95M
HWI-ST1006:108:D245BACXX:6:1103:17271:82507    419    chr8    19810830    0    99M    =    19811630    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:2    NM:i:0    XW:i:0    XV:i:0    SM:i:0    XQ:i:35    X2:i:35
HWI-ST1006:108:D245BACXX:6:1103:17271:82507    83    chr8    19810929    20    4M698N95M    =    19810830    -896    ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC    ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA??    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    XS:A:+
HWI-ST1006:108:D245BACXX:6:1103:17271:82507    339    chr8    19811630    0    3S96M    =    19810830    -896    ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC    ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA??    RG:Z:C00060    MD:Z:96    NH:i:2    HI:i:2    NM:i:0    XW:i:0    XV:i:0    SM:i:0    XQ:i:35    X2:i:35

# extract the same read from sample.noclips.bam
samtools view sample.noclips.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507'

HWI-ST1006:108:D245BACXX:6:1103:17271:82507    163    chr8    19810830    20    99M    =    19810929    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    CT:Z:2F99M0T1R4M698N95M
HWI-ST1006:108:D245BACXX:6:1103:17271:82507    419    chr8    19810830    0    99M    =    19811630    896    AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT    @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:2    NM:i:0    XW:i:0    XV:i:0    SM:i:0    XQ:i:35    X2:i:35
HWI-ST1006:108:D245BACXX:6:1103:17271:82507    83    chr8    19810929    20    4M698N95M    =    19810830    -896    ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC    ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA??    RG:Z:C00060    MD:Z:99    NH:i:2    HI:i:1    NM:i:0    XW:i:0    XV:i:0    SM:i:20    XQ:i:40    X2:i:35    XS:A:+

So, how do I remove reads with hard/soft clipping along with its mate?

gsnap rna-seq • 13k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by komal.rathi ★ 4.1k

2

Entering edit mode

How important are those secondary alignments for your downstream analysis? If they aren't important you can follow what Pierre suggested. In the other case, when secondary alignments are important for the downstream analysis, it may get little trickier as you want to make sure that you are removing the correct pair and not everything. For example, 83 and 163 will have to be removed together. Similarly 339 and 419.

If you only one secondary alignment for each primary alignment, you can create two separate bam files using "samtools view -F 256" and then process both the files as Pierre suggested and then merge them later on.

Comment edited after Pierre's answer

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Ashutosh Pandey I do have more than one secondary alignment for primary alignments & I want to keep them. The only reason I am removing hard/soft clipping is because the downstream analysis (with R package asSeq) does not recognize these clippings. But it also wants reads to be paired. In the above example, 83 & 163 shouldn't be removed but 339 & 419 should be.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

it would be easy with pysam. You can recreate a bam file with reads of your interest.

ADD REPLY • link 9.1 years ago by GouthamAtla 12k

0

Entering edit mode

Hello Komal,

I am trying to use the command you have in your original post:

samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam

I have an alignment file(bam from single end). My goal is to remove the soft/hard clipped alignments from the original bam file. When I used the command you have in your original post I do not get the expected alignment in sample.noclips.bam

Meaning, when I use

samtools view sample.bam | grep read1

I get the resulting alignment. However, when I try to find the same read:

samtools view sample.noclips.bam | grep read1

I do not find it. There are very few reads in my sample.noclips.bam. Am I doing anything wrong here?

ADD REPLY • link 6.4 years ago by KVC_bioinfo ▴ 590

Ram · Answer 1 · 2015-04-09

I gave some thought to our pipeline & the downstream program (extractAsReads function of asSeq) and realize that secondary alignments won't make things any better for me. I couldn't get Pierre's code to work as is, so I had to make a few changes:

#extract name of clipped reads:
samtools view sample.bam | awk '$6 ~ /H|S/{print $1}' | sort -k1,1 | uniq > names.txt

# sort original bam on names:
samtools view sample.bam | sort -k1,1 > tmp.sam

# get header
samtools view -H sample.bam > new.sam

# outer-join (I use ctrl+v+tab to explicitly get the tab character)
join -t '   ' -v 1 -1 1 -2 1 tmp.sam names.txt >> new.sam

# convert to bam
samtools view -bS new.sam > new.bam

Nevertheless, thank you Ashutosh Pandey and Pierre Lindenbaum for taking time off to help me.

Ram · Answer 2 · 2015-04-08

5

Entering edit mode

9.1 years ago

Pierre Lindenbaum 161k

With your awk, extract and sort the name of the clipped reads:

samtools view sample.bam | awk -F '\t' '($6 !~ /H|S/)' | cut -f 1 | LC_ALL=C sort | uniq > names.txt

Sort you original sam on names:

samtools view sample.bam | LC_ALL=C sort -t '\t' -k1,1 > tmp.sam

outer-join

samtools view -H sample.bam > new.sam
join -t '\t' -v 1 -1 1 -2 1 tmp.sam names.txt >> new.sam

Re-sort on 'coordinate' if needed.

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Pierre, Would it be possible to save the secondary alignments involving the read pairs where the primary alignment of one of the read shows clipping?

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

yes filter with samtools view -F 64, then with samtools view -F 128 ( https://broadinstitute.github.io/picard/explain-flags.html )

ADD REPLY • link 9.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Pierre Lindenbaum In the above example, 83 & 163 shouldn't be removed but 339 & 419 should be, because 339 has soft clipping & its mate is 419. This code removes every alignment (primary as well as other secondary) with the read name that has hard/soft clipping.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by komal.rathi ★ 4.1k

1

Entering edit mode

Komal, I had the exact same concern. May be I was not clear enough and Pierre misunderstood my question. An easy approach would be to store read name (column 1), position (column 4), mate position(column 8) for reads showing soft/hard clipping. This information can be used to remove all the reads with the same name and whose position is either one of the above stored positions. I am sure Pierre will come up with an elegant approach :-). I can write a python script but you will have to wait until the weekend.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Ashutosh Pandey Thanks, that is exactly what I am working on right now (with awk and what not). Appreciate your help.

Pierre Lindenbaum To extract clipped reads, I just modified your first line by removing '!' but I am still getting some reads with hard clips, and I don't have a clue for the life of me.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Hey Komal,

Do you have the original SAM files that were produced by GSNAP? They should be sorted by queryname and the read-pairs should be correctly paired. You can then use something like:

awk '{cigar=$6; sam_record=$0; getline}{if(cigar !~ /H|S/ && $6 !~ /H|S/){print sam_record,"\n",$0}}' Input.sam

and it should give you all the consecutive SAM records where both the records were aligned without using any clipping. I haven't tested it but you should be able to make small tweaks if needed.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

The BAM files I am using are filtered (not the original output of GSNAP) & sorted by coordinates. I'm trying the above awk command (maybe with a few changes).

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

They may not be sorted correctly. In the example above, 83 should be grouped with 163 (163 should follow right after 83) and 339 should be grouped with 419. My proposed solution won't work for the bam snippet that you have pasted above. The read pairs from an alignment should be grouped together. Try "samtools -n" and see if it can group them correctly.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Sorry, I meant sorted by coordinates.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by komal.rathi ★ 4.1k