How to extract only 5-prime hard/soft clipped reads?
2
1
Entering edit mode
5.0 years ago
ThePresident ▴ 180

I use this command to extract a few particular fields of all hard and soft clipped reads from a SAM alignment (Illumina, paired-end):

awk -Ft '$6 ~ /H|S/{print$2,$4,$9}' input.sam> output.txt


However, I would like to extract only reads that have been clipped on the 5-prime end. Any suggestions?

TP

SAM clipping • 3.6k views
1
Entering edit mode
5.0 years ago

Use pysam and check if the first entry in read.cigartuples has the desired operation.

0
Entering edit mode

Never used pysam before, but when I tried I got this error:

>>> import pysam
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/__init__.py", line 5, in <module>
from pysam.libchtslib import *
File "pysam/libchtslib.pyx", line 1, in init pysam.libchtslib (pysam/libchtslib.c:12515)
ImportError: dlopen(/Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcutils.so, 2): Library not loaded: @rpath/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcsamtools.so
Referenced from: /Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcutils.so


Missing libraries?

I tried to uninstall it by running pip uninstall pysam but got this:

Cannot remove entries from nonexistent file /Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/easy-install.pth

0
Entering edit mode

You might want to reinstall pysam.

0
Entering edit mode

pip install pysam Requirement already satisfied: pysam in ./Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg

But I've figured out another way around my problem using awk (see below). Eventually, I'll have to figure out this pysam problem since I guess I'll need it eventually.

0
Entering edit mode

You'll need to delete ./Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg.

1
Entering edit mode
5.0 years ago
ThePresident ▴ 180

I couldn't figure out how to get this using pysam, but I might have found another solution using simple awk

awk -Ft '$6 ~ /^..?S/ {print$2, $4,$9}' input.sam > output.txt


This will extract columns 2,4,9 from the SAM file whenever character S (used to mark soft clipped nucleotides in the CIGAR string) is encountered in the 1st or 2nd position in the CIGAR string (which is the case in my data set because reads are <100 bp long).

Also, I can get rid of the reads that have 3-prime clipping and only 5-prime clipping with:

awk -Ft '$6 ~ /S$/ {next} $6 ~ /^..?S/ {print$2, $4,$9}' input.sam > output.txt


TP

1
Entering edit mode

Reads can align to either strand so you'll actually need two filters: one where the soft clip is at the end of the CIGAR and the read aligned to negative strand flag is not set, and another when the soft clip is at the start of the CIGAR and the negative strand flag is set.

0
Entering edit mode

You're right. I am just unsure for which flag I should filter (all these extracted reads are flagged either as 147, 163, 99 or 83). I'm currently looking at this, but if you have a suggestion, go ahead. Thanks for pointing out this detail.

0
Entering edit mode

You want flags in which the 5 least significant bit is either set or unset.

In awk, you'd probably want to do something like and(16, $FLAGS) != 0 ADD REPLY 0 Entering edit mode I am not sure I follow. Following your initial suggestions, I should filter for reads: 1. That are soft clipped at the start of the read but are mapped on reverse strand (so flagged as 83 or 147 which translates as mapped on reverse strand and being either first or second in pair respectively). 2. That are soft clipped at the end of the read and mapped on forward strand (so flagged as 99 or 163 which translates as mapped on forward strand and being first or second in pair respectively). In did it this way in awk (with help from folks at stackoverflow): awk '($2 ~ /147|83/ && $6 ~ /^..?S/) || ($2 ~ /99|163/ && $6 ~ /S$/) {next;} {print $2,$4, $6,$9 }' file.sam > output.txt


However, this way I get only fully mapped reads (a.k.a all flagged as 76M, no soft clipping). Something went wrong somewhere...

EDIT: when I get rid of the reads that are not clipped, I actually see reads that satisfy all criteria and are supposedly 1) clipped at the start and mapped to forward strand (flag 99/163) and those that are clipped at the end and map to reverse strand (flag 83/147). I am currently looking if this makes sense with my respective analysis.

EDIT2: Actually, this seems to be working. Here's the final statement I used:

awk '($2 ~ /147|83/ &&$6 ~ /^..?S/) || ($2 ~ /99|163/ &&$6 ~ /S$/) {next;}$6 ~ /^..?S/ {print $2,$4, $6,$9 }' input.sam > output.txt