Question: How to extract only 5-prime hard/soft clipped reads?
1
gravatar for ThePresident
2.6 years ago by
ThePresident140
ThePresident140 wrote:

I use this command to extract a few particular fields of all hard and soft clipped reads from a SAM alignment (Illumina, paired-end):

awk -Ft '$6 ~ /H|S/{print $2,$4,$9}' input.sam> output.txt

However, I would like to extract only reads that have been clipped on the 5-prime end. Any suggestions?

TP

sam clipping • 1.9k views
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by ThePresident140
1
gravatar for Devon Ryan
2.6 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

Use pysam and check if the first entry in read.cigartuples has the desired operation.

ADD COMMENTlink written 2.6 years ago by Devon Ryan94k

Never used pysam before, but when I tried I got this error:

>>> import pysam
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
  File "pysam/libchtslib.pyx", line 1, in init pysam.libchtslib (pysam/libchtslib.c:12515)
ImportError: dlopen(/Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcutils.so, 2): Library not loaded: @rpath/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcsamtools.so
  Referenced from: /Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg/pysam/libcutils.so
  Reason: image not found

Missing libraries?

I tried to uninstall it by running pip uninstall pysam but got this:

Cannot remove entries from nonexistent file /Users/TP/Documents/Tools/HTSeq-0.9.1/.eggs/easy-install.pth
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by ThePresident140

You might want to reinstall pysam.

ADD REPLYlink written 2.6 years ago by Devon Ryan94k

Already tried:

pip install pysam Requirement already satisfied: pysam in ./Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg

But I've figured out another way around my problem using awk (see below). Eventually, I'll have to figure out this pysam problem since I guess I'll need it eventually.

ADD REPLYlink written 2.6 years ago by ThePresident140

You'll need to delete ./Documents/Tools/HTSeq-0.9.1/.eggs/pysam-0.11.2.2-py2.7-macosx-10.6-intel.egg.

ADD REPLYlink written 2.6 years ago by Devon Ryan94k
1
gravatar for ThePresident
2.6 years ago by
ThePresident140
ThePresident140 wrote:

I couldn't figure out how to get this using pysam, but I might have found another solution using simple awk

awk -Ft '$6 ~ /^..?S/ {print $2, $4, $9}' input.sam > output.txt

This will extract columns 2,4,9 from the SAM file whenever character S (used to mark soft clipped nucleotides in the CIGAR string) is encountered in the 1st or 2nd position in the CIGAR string (which is the case in my data set because reads are <100 bp long).

Also, I can get rid of the reads that have 3-prime clipping and only 5-prime clipping with:

awk -Ft '$6 ~ /S$/ {next} $6 ~ /^..?S/ {print $2, $4, $9}' input.sam > output.txt

TP

ADD COMMENTlink written 2.6 years ago by ThePresident140
1

Reads can align to either strand so you'll actually need two filters: one where the soft clip is at the end of the CIGAR and the read aligned to negative strand flag is not set, and another when the soft clip is at the start of the CIGAR and the negative strand flag is set.

ADD REPLYlink written 2.6 years ago by d-cameron2.1k

You're right. I am just unsure for which flag I should filter (all these extracted reads are flagged either as 147, 163, 99 or 83). I'm currently looking at this, but if you have a suggestion, go ahead. Thanks for pointing out this detail.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by ThePresident140

You want flags in which the 5 least significant bit is either set or unset.

https://broadinstitute.github.io/picard/explain-flags.html

In awk, you'd probably want to do something like and(16, $FLAGS) != 0

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by d-cameron2.1k

I am not sure I follow. Following your initial suggestions, I should filter for reads:

  1. That are soft clipped at the start of the read but are mapped on reverse strand (so flagged as 83 or 147 which translates as mapped on reverse strand and being either first or second in pair respectively).
  2. That are soft clipped at the end of the read and mapped on forward strand (so flagged as 99 or 163 which translates as mapped on forward strand and being first or second in pair respectively).

In did it this way in awk (with help from folks at stackoverflow):

awk '($2 ~ /147|83/ && $6 ~ /^..?S/) || ($2 ~ /99|163/ && $6 ~ /S$/) {next;} {print $2, $4, $6, $9 }' file.sam > output.txt

However, this way I get only fully mapped reads (a.k.a all flagged as 76M, no soft clipping). Something went wrong somewhere...

EDIT: when I get rid of the reads that are not clipped, I actually see reads that satisfy all criteria and are supposedly 1) clipped at the start and mapped to forward strand (flag 99/163) and those that are clipped at the end and map to reverse strand (flag 83/147). I am currently looking if this makes sense with my respective analysis.

EDIT2: Actually, this seems to be working. Here's the final statement I used:

awk '($2 ~ /147|83/ && $6 ~ /^..?S/) || ($2 ~ /99|163/ && $6 ~ /S$/) {next;} $6 ~ /^..?S/ {print $2, $4, $6, $9 }' input.sam > output.txt
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by ThePresident140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1990 users visited in the last hour