How to remove reads with hard/soft clipping along with its mate?
2
7
Entering edit mode
7.1 years ago
komal.rathi ★ 4.0k

Hi everyone,

I have an output bam file from GSNAP. I want to remove all the reads that have hard (H) & soft (S) clipping in the CIGAR string AND the corresponding mate for those reads. Note, I only have properly-paired reads in the starting bam file. For the first part I use the following command:

# remove reads with hard/soft clips info in the CIGAR string

samtools view -h sample.bam | awk '6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam However, this only removes the read with Hard/Soft Clipping & not its corresponding mate. Here is an example: # example read HWI-ST1006:108:D245BACXX:6:1103:17271:82507 # extract this read from sample.bam samtools view sample.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507' HWI-ST1006:108:D245BACXX:6:1103:17271:82507 163 chr8 19810830 20 99M = 19810929 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 CT:Z:2F99M0T1R4M698N95M HWI-ST1006:108:D245BACXX:6:1103:17271:82507 419 chr8 19810830 0 99M = 19811630 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:2 NM:i:0 XW:i:0 XV:i:0 SM:i:0 XQ:i:35 X2:i:35 HWI-ST1006:108:D245BACXX:6:1103:17271:82507 83 chr8 19810929 20 4M698N95M = 19810830 -896 ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA?? RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 XS:A:+ HWI-ST1006:108:D245BACXX:6:1103:17271:82507 339 chr8 19811630 0 3S96M = 19810830 -896 ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA?? RG:Z:C00060 MD:Z:96 NH:i:2 HI:i:2 NM:i:0 XW:i:0 XV:i:0 SM:i:0 XQ:i:35 X2:i:35 # extract the same read from sample.noclips.bam samtools view sample.noclips.bam | grep 'HWI-ST1006:108:D245BACXX:6:1103:17271:82507' HWI-ST1006:108:D245BACXX:6:1103:17271:82507 163 chr8 19810830 20 99M = 19810929 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 CT:Z:2F99M0T1R4M698N95M HWI-ST1006:108:D245BACXX:6:1103:17271:82507 419 chr8 19810830 0 99M = 19811630 896 AACTACCCTCTGGACAATGTCCATCTCTTGGGATACAGCCTTGGAGCCCATGCTGCTGGCATTGCAGGAAGTCTGACCAATAAGAAAGTCAACAGAATT @<DD;DDF?FFFF4:<<FHFH9<FH>HID9AE)::C<1?C;?DF9B?D??B3D*8/==>F>@F;@@3=?CE=ACB@D;?;@ACA6-,;@>AA>@??BAA RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:2 NM:i:0 XW:i:0 XV:i:0 SM:i:0 XQ:i:35 X2:i:35 HWI-ST1006:108:D245BACXX:6:1103:17271:82507 83 chr8 19810929 20 4M698N95M = 19810830 -896 ACTGGCCTCGATCCAGCTGGACCTAACTTTGAGTATGCAGAAGCCCCGAGTCGTCTTTCTCCTGATGATGCAGATTTTGTAGACGTCTTACACACATTC ############################A>>A@A>@3DD@?6DDA;;<'@7<DB?DBBBEDC*<<>>CEAAFDBFE<EA?@<2@BC?D?DDD==DDA?? RG:Z:C00060 MD:Z:99 NH:i:2 HI:i:1 NM:i:0 XW:i:0 XV:i:0 SM:i:20 XQ:i:40 X2:i:35 XS:A:+ So, how do I remove reads with hard/soft clipping along with its mate? rna-seq gsnap • 9.5k views ADD COMMENT 2 Entering edit mode How important are those secondary alignments for your downstream analysis? If they aren't important you can follow what Pierre suggested. In the other case, when secondary alignments are important for the downstream analysis, it may get little trickier as you want to make sure that you are removing the correct pair and not everything. For example, 83 and 163 will have to be removed together. Similarly 339 and 419. If you only one secondary alignment for each primary alignment, you can create two separate bam files using "samtools view -F 256" and then process both the files as Pierre suggested and then merge them later on. **Comment edited after Pierre's answer** ADD REPLY 0 Entering edit mode Ashutosh Pandey I do have more than one secondary alignment for primary alignments & I want to keep them. The only reason I am removing hard/soft clipping is because the downstream analysis (with R package asSeq) does not recognize these clippings. But it also wants reads to be paired. In the above example, 83 & 163 shouldn't be removed but 339 & 419 should be. ADD REPLY 0 Entering edit mode it would be easy with pysam. You can recreate a bam file with reads of your interest. ADD REPLY 0 Entering edit mode Hello Komal, I am trying to use the command you have in your original post: samtools view -h sample.bam | awk '6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam

I have an alignment file(bam from single end). My goal is to remove the soft/hard clipped alignments from the original bam file. When I used the command you have in your original post I do not get the expected alignment in sample.noclips.bam

Meaning, when I use

samtools view sample.bam | grep read1

I get the resulting alignment. However, when I try to find the same read:

samtools view sample.noclips.bam | grep read1

I do not find it. There are very few reads in my sample.noclips.bam. Am I doing anything wrong here?

8
Entering edit mode
7.1 years ago
komal.rathi ★ 4.0k

I gave some thought to our pipeline & the downstream program (extractAsReads function of asSeq) and realize that secondary alignments won't make things any better for me. I couldn't get Pierre's code to work as is, so I had to make a few changes:

samtools view sample.bam | awk '$6 ~ /H|S/{print$1}' | sort -k1,1 | uniq > names.txt

# sort original bam on names:

samtools view sample.bam | sort -k1,1 > tmp.sam

samtools view -H sample.bam > new.sam

# outer-join (I use ctrl+v+tab to explicitly get the tab character)

join -t '   ' -v 1 -1 1 -2 1 tmp.sam names.txt >> new.sam

# convert to bam

samtools view -bS new.sam > new.bam

Nevertheless, thank you Ashutosh Pandey and Pierre Lindenbaum for taking time off to help me.

0
Entering edit mode

Great!! Thanks for the update.

5
Entering edit mode
7.1 years ago

with your awk, extract and sort the name of the clipped reads:

samtools view sample.bam | awk -F '\t' '(6 !~ /H|S/)' | cut -f 1 | LC_ALL=C sort | uniq > names.txt sort you original sam on names: samtools view sample.bam | LC_ALL=C sort -t '\t' -k1,1 > tmp.sam outer-join samtools view -H sample.bam > new.sam join -t '\t' -v 1 -1 1 -2 1 tmp.sam names.txt >> new.sam resort on 'coordinate' if needed. ADD COMMENT 0 Entering edit mode Pierre, Would it be possible to save the secondary alignments involving the read pairs where the primary alignment of one of the read shows clipping ? ADD REPLY 0 Entering edit mode yes filter with samtools view -F 64, then with samtools view -F 128 ( https://broadinstitute.github.io/picard/explain-flags.html ) ADD REPLY 0 Entering edit mode Pierre Lindenbaum In the above example, 83 & 163 shouldn't be removed but 339 & 419 should be, because 339 has soft clipping & its mate is 419. This code removes every alignment (primary as well as other secondary) with the read name that has hard/soft clipping. ADD REPLY 1 Entering edit mode Komal, I had the exact same concern. May be I was not clear enough and Pierre misunderstood my question. An easy approach would be to store read name (column 1), position (column 4), mate position(column 8) for reads showing soft/hard clipping. This information can be used to remove all the reads with the same name and whose position is either one of the above stored positions. I am sure Pierre will come up with an elegant approach :-). I can write a python script but you will have to wait until the weekend. ADD REPLY 0 Entering edit mode Ashutosh Pandey Thanks, that is exactly what I am working on right now (with awk and what not). Appreciate your help. Pierre Lindenbaum To extract clipped reads, I just modified your first line by removing '!' but I am still getting some reads with hard clips, and I don't have a clue for the life of me. ADD REPLY 0 Entering edit mode Hey Komal, Do you have the original SAM files that were produced by GSNAP? They should be sorted by queryname and the read-pairs should be correctly paired. You can then use sth like: awk '{cigar=6; sam_record=$0; getline}{if(cigar !~ /H|S/ &&$6 !~ /H|S/){print sam_record,"\n",\$0}}' Input.sam

and it should give you all the consecutive SAM records where both the records were aligned without using any clipping. I haven't tested it but you should be able to make small tweaks if needed.

0
Entering edit mode

The BAM files I am using are filtered (not the original output of GSNAP) & sorted by coordinates. I'm trying the above awk command (maybe with a few changes).

0
Entering edit mode

They may not be sorted correctly. In the example above, 83 should be grouped with 163 (163 should follow right after 83) and 339 should be grouped with 419. My proposed solution won't work for the bam snippet that you have pasted above. The read pairs from an alignment should be grouped together. Try "samtools -n" and see if it can group them correctly.

0
Entering edit mode

Sorry, I meant sorted by coordinates.