How to extract only 5-prime hard/soft clipped reads?
2
1
Entering edit mode
5.0 years ago
ThePresident ▴ 180

I use this command to extract a few particular fields of all hard and soft clipped reads from a SAM alignment (Illumina, paired-end):

awk -Ft '$6 ~ /H|S/{print $2,$4,$9}' input.sam> output.txt

However, I would like to extract only reads that have been clipped on the 5-prime end. Any suggestions?

TP

SAM clipping • 3.6k views
ADD COMMENT
1
Entering edit mode
5.0 years ago

Use pysam and check if the first entry in read.cigartuples has the desired operation.

ADD COMMENT
0
Entering edit mode

Never used pysam before, but when I tried I got this error:

>>> import pysam
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
  File "pysam/libchtslib.pyx", line 1, in init pysam.libchtslib (pysam/libchtslib.c:12515)
ImportError: dlopen(/Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcutils.so, 2): Library not loaded: @rpath/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcsamtools.so
  Referenced from: /Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcutils.so
  Reason: image not found

Missing libraries?

I tried to uninstall it by running pip uninstall pysam but got this:

Cannot remove entries from nonexistent file /Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/easy-install.pth
ADD REPLY
0
Entering edit mode

You might want to reinstall pysam.

ADD REPLY
0
Entering edit mode

Already tried:

pip install pysam Requirement already satisfied: pysam in ./Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg

But I've figured out another way around my problem using awk (see below). Eventually, I'll have to figure out this pysam problem since I guess I'll need it eventually.

ADD REPLY
0
Entering edit mode

You'll need to delete ./Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg.

ADD REPLY
1
Entering edit mode
5.0 years ago
ThePresident ▴ 180

I couldn't figure out how to get this using pysam, but I might have found another solution using simple awk

awk -Ft '$6 ~ /^..?S/ {print $2, $4, $9}' input.sam > output.txt

This will extract columns 2,4,9 from the SAM file whenever character S (used to mark soft clipped nucleotides in the CIGAR string) is encountered in the 1st or 2nd position in the CIGAR string (which is the case in my data set because reads are <100 bp long).

Also, I can get rid of the reads that have 3-prime clipping and only 5-prime clipping with:

awk -Ft '$6 ~ /S$/ {next} $6 ~ /^..?S/ {print $2, $4, $9}' input.sam > output.txt

TP

ADD COMMENT
1
Entering edit mode

Reads can align to either strand so you'll actually need two filters: one where the soft clip is at the end of the CIGAR and the read aligned to negative strand flag is not set, and another when the soft clip is at the start of the CIGAR and the negative strand flag is set.

ADD REPLY
0
Entering edit mode

You're right. I am just unsure for which flag I should filter (all these extracted reads are flagged either as 147, 163, 99 or 83). I'm currently looking at this, but if you have a suggestion, go ahead. Thanks for pointing out this detail.

ADD REPLY
0
Entering edit mode

You want flags in which the 5 least significant bit is either set or unset.

https://broadinstitute.github.io/picard/explain-flags.html

In awk, you'd probably want to do something like and(16, $FLAGS) != 0

ADD REPLY
0
Entering edit mode

I am not sure I follow. Following your initial suggestions, I should filter for reads:

  1. That are soft clipped at the start of the read but are mapped on reverse strand (so flagged as 83 or 147 which translates as mapped on reverse strand and being either first or second in pair respectively).
  2. That are soft clipped at the end of the read and mapped on forward strand (so flagged as 99 or 163 which translates as mapped on forward strand and being first or second in pair respectively).

In did it this way in awk (with help from folks at stackoverflow):

awk '($2 ~ /147|83/ && $6 ~ /^..?S/) || ($2 ~ /99|163/ && $6 ~ /S$/) {next;} {print $2, $4, $6, $9 }' file.sam > output.txt

However, this way I get only fully mapped reads (a.k.a all flagged as 76M, no soft clipping). Something went wrong somewhere...

EDIT: when I get rid of the reads that are not clipped, I actually see reads that satisfy all criteria and are supposedly 1) clipped at the start and mapped to forward strand (flag 99/163) and those that are clipped at the end and map to reverse strand (flag 83/147). I am currently looking if this makes sense with my respective analysis.

EDIT2: Actually, this seems to be working. Here's the final statement I used:

awk '($2 ~ /147|83/ && $6 ~ /^..?S/) || ($2 ~ /99|163/ && $6 ~ /S$/) {next;} $6 ~ /^..?S/ {print $2, $4, $6, $9 }' input.sam > output.txt
ADD REPLY

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6