Question: extracting the soft clipped seq only from a sam file
2
gravatar for Varun Gupta
4.4 years ago by
Varun Gupta1.1k
United States
Varun Gupta1.1k wrote:

HI Everyone,

I have a sam file something like this

SRR959756.117   272     RNU6-170P       141     9       23S21M  *       0       0       TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG
SRR959756.117   272     RNU6-171P       139     9       24S20M  *       0       0       TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG
SRR959756.1740  272     RNU2-65P        349     14      2S23M19S        *       0       0       CAGCTGGGATTACAGGCATGAGCCACCACGCCTGGCACCCAGCT
SRR959756.117   16      RNU6-1295P      110     9       19S25M  *       0       0       TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG
SRR959756.212   256     RNU2-41P        38      1       25M19S  *       0       0       ATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCCGAG

(I am only showing the first 10 columns of it). I want to extract the soft clipped sequence and make a fasta file with header as >column1_column2_column3_column6. How can I do this. I am a little confused with the strand/flag column and hence having difficulty interpreting how to extract the soft clipped seq. Also where the CIGAR STRING is 2S23M19S (as in 3rd case), I just want to extract the 19S part of the seq. How can I do this.

Hope to hear soon

Regards

Varun

soft clip sam bam • 6.7k views
ADD COMMENTlink modified 2.8 years ago by Biostar ♦♦ 20 • written 4.4 years ago by Varun Gupta1.1k
3
gravatar for Devon Ryan
4.4 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

I actually wrote a tool to do something similar that you could base yours off of. It's the extractSoftclipped.c program in this github repo. It extracts soft-clipped regions from alignments in a BAM or SAM file into a fastq formatted file and uses the aligned position in the header line, but it'd be easy enough to change all of that. That program also allows filtering by MAPQ and soft-clipped region length (in the case of multiple soft-clipped regions, it'll use every region above whatever the length threshold is).

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Devon Ryan90k

HI Devon,

No idea about C. I will still look at the code to make out some thing from it..

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k

You might wanna check out Devon's solution. His perspective is more relevant than mine - I see the sequence as a string and do not make any assumptions on the features, which is not a good approach at times.

ADD REPLYlink written 4.4 years ago by RamRS21k

Hi Devon,

Your script works fine. I made a dummy bam file to test

@HD    VN:1.0    SO:unsorted
@SQ    SN:RNU6-170P    LN:304
@SQ    SN:RNU6-1295P    LN:303
@PG    ID:bowtie2    PN:bowtie2    VN:2.0.0-beta7
SRR959756.117    16    RNU6-1295P    110    9    19S25M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:50    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:25    YT:Z:UU
SRR959756.117    272    RNU6-170P    141    9    23S21M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:42    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:21    YT:Z:UU

and ran the tool as

extractSoftclipped test.bam > fest.fq.gz

and my fastq looks like

@RNU6-1295P:109
TAGAGACGGGGTCTCGCTA
+
VW[XY__c_f]^a^^a[da
@RNU6-170P:140
TAGAGACGGGGTCTCGCTATGTT
+
VW[XY__c_f]^a^^a[dad[YY

I just want read name and flag info in the header. Is it possible for you to modify it?

Is it this line in your c file where I have to modify it

fprintf(of, "@%s:%"PRId32"\n", hdr->target_name[b->core.tid], pos);

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k

Yup, you would want something like:

fprintf(of, ">%s_%i_%i\n", bam_get_qname(b), b->core.tid, b->core.flag);

That won't add the CIGAR, which actually requires a loop since it's not stored as a string (I can show you how to add this if you really want it). You can also comment out (Just add // to the beginning) lines 26-29 so you get a fasta file instead (I already changed the @ to a > above so the headers will be correct).

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

Hi Devon,

If I also wanted to include the position pos which you already had in your original code , the new code should look like this

fprintf(of, ">%s_%s_%i_%"PRId32"\n", bam_get_qname(b), hdr->target_name[b->core.tid], b->core.flag,pos+1);

Also It would be nice if you can show me how to add the CIGAR string in the header itself. Also may I ask , is this all coming from pysam and C together?

 

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k

That looks correct for adding position. I'm glad you caught the tid mistake!

To insert a CIGAR string (we'll add it to the end), you'd do something like:

Note lines 23-24, which are the main changes. I've not tested this, but something along those lines should work.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Devon Ryan90k

BTW, this is all pure C. There are other programs in that repository written in python, but this isn't one of them.

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

Hi Devon,

I am interested in extractSoftclipped C script, which unfortunately I can't run. I downloaded the extractSoftclipped.c and Makefile files to my project directory. The next step I am trying to compile the C code with make function. So I run the following: make extractSoftclipped
And I get:
make -C htslib
make: *** htslib: No such file or directory.  Stop.
make: *** [htslib] Error 2


If I compile with: gcc extractSoftclipped.c
I get following error: 'extractSoftclipped.c:1:24: fatal error: htslib/sam.h: No such file or directory
compilation terminated'

Could you please help me with this issue, as I have never done this type of work before and just don't know how ELSE to tackle the problem. Probably I am running the script from the wrong directory,

I would appreciate your help.
Thanks

ADD REPLYlink written 3.4 years ago by IrK20

you need to install the C htslib library : https://github.com/samtools/htslib

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum120k

What Pierre said, though you can just git submodule update --init while in the directory with that source code. I tend to put htslib in as a submodule on everything that uses it.

ADD REPLYlink written 3.4 years ago by Devon Ryan90k

Thank you Pierre Lindenbaum and Devon Ryan  for prompt respond,

we have all libraries at the right location, but for some reasons I get the same error message. Unfortunately, I have never been dealing with such stuff so it's very difficult for me to debug the problem.  However, I assume that it can't find a path for unknown to me reasons. Is it possible to get pre-compiled version of that particular script, because I am interest in extracting soft clipped reads. 

Or if its not difficult for you could you navigate me in debugging

Thanks once again,

ADD REPLYlink written 3.4 years ago by IrK20

If you just type the following it should work:

git clone https://github.com/dpryan79/SE-MEI.git
cd SE-MEI
git submodule update --init
make
ADD REPLYlink written 3.4 years ago by Devon Ryan90k

Hi, IrK, sorry for bothering you. Have you solved this problem? It will be very appreciated that you can give some advice.

ADD REPLYlink written 12 weeks ago by SDin0

I tried out the script. And it seems to work. Every once in a while I get a read that is peculiar. Probably, my expectation is off. This is one of the peculiar softclipped parts:

@T:2468
GTAAAACAATTTAAAA
+
JJJJJJJJJJJJJJJJ

I looked for a read, mapping on chromosome T at position 2469 in the file that I fed into the extractSoftclipped.c, yet there is none like it. This is my expectation. But maybe this is only true for reads that are softclipped at front? These are the reads, that fall into the the region of position T : 2468:

HWI-ST999:188:C49E6ACXX:8:1102:18487:21113  147 T   2443    41  23M1D3M1D25M    =   2362    -134 TTACCAAATTATAAACAAATAAGTCAATTTAATATTTTATATACGAAATCA    GIGJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJIJJJHHHHHFFFFFCCC AS:i:78 XN:i:0  XM:i:1  XO:i:2  XG:i:2  NM:i:3  MD:Z:23^T3^T7T17    YS:i:78 YT:Z:CP

HWI-ST999:188:C49E6ACXX:8:1101:12833:68128  163 T   2458    41  8M1D3M1D40M =   2563    156 CAAATAAGTCAATTTAATATTTTATATACGAAATCAATTAAATCTTAATCA B@?DDEDDBFHFHJJJIJJJIJJIJJGGJIJJIIIIJJIJJJIHIIIJGGF AS:i:78 XN:i:0  XM:i:1  XO:i:2  XG:i:2  NM:i:3  MD:Z:8^T3^T7T32 YS:i:87 YT:Z:CP

HWI-ST999:188:C49E6ACXX:8:1101:7229:39427   133 T   2471    0   *   =   2471    0   CGCCATTTATATATGTTTTTGTTCCTGATGAGCCACATAACTTATAGATTT <@?=DD;DAHDB4CE:@F>B<?EHIBC4FHCG*:ED><9DGEDD9HBF>FD YT:Z:UP

HWI-ST999:188:C49E6ACXX:8:1101:7229:39427   89  T   2471    36  7S44M   =   2471    0   TAAGTCAATTTAATATTTTATATACAAAATCAATTAAATCTTAATCACATA <GHHB<IGAGGDIGGGJJIJIIFEIFJIGJIIIGIJIGGHHDHFEDDF@@@ AS:i:74 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:7T10G25    YT:Z:UP

HWI-ST999:188:C49E6ACXX:8:1101:15690:30209  163 T   2484    36  49M2S   =   2752    318 TATACGAAATCAATTAAATATTAATCACATAAATATTAGCACAAAAAAATA CCCFFFFFHHHHHJJJJJJJJJJJJJJIJJIJJIJJIJJIJJJIJJJJJIH AS:i:74 XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:19C13A4A10 YS:i:76 YT:Z:CP

I am sorry to bother you with this.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by michaela_boell70
1
gravatar for Pierre Lindenbaum
4.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

I wrote a tool named SamExtractClip https://github.com/lindenb/jvarkit/wiki/SamExtractClip but I haven't much used it.

input is a SAM/BAM, ouput is a FASTQ

$ curl -L -s "https://raw.githubusercontent.com/samtools/samtools/develop/test/dat/mpileup.1.sam" | java -jar dist/samextractclip.jar 2> /dev/null 
@ERR013140.3521432/1:0:17:1:99
AGAGGTCCCCAACTTCTTTGCA
+
@AEDGBHIIIIIFJGIKHGHIJ
@ERR156632.12704932/2:0:17:1:163
TGGAGAAGGGGACAAGAGGTCCCCAACTTCTTTGCA
+
BFAFGFEIGFEFHHEIDKJGHHHJIIE=@KKGGKJG
@ERR156632.9601178/1:0:17:1:99
CTATGACAGGGAGGTCATGTGCAGGCTGGAGAAGGGGACAAGAGGTCCCCAACTTCTTTGCA
+
DEEEIIHHKIJILKHLHIKEKHHMKLKKJGKKKKLKLFIHEKIKL=KLJLKIILHKMH9LJJ
@ERR162872.21706338/1:0:17:1:99
CTTCTTTGCA
+
BHBFH<eifg @err243039.1049231="" 2:0:17:1:163="" tgcaggctggagaaggggacaagaggtccccaacttctttgca="" +="" aeefifhijdgigijhhiaggglgjiejhjhhfijgjjdfjig="" 

 

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Pierre Lindenbaum120k

HI Pierre

I am still working on it to download. Meanwhile I have a doubt. So if I have a CIGAR STRING as

2S23M19S then will it output only the 19S part of the seq because I see -m option which has a default of 5.

Also installing the tool with this command

git clone "https://github.com/lindenb/jvarkit.git" gave me this error

Cloning into jvarkit...
warning: templates not found /usr/local/share/git-core/templates
error: SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed while accessing https://github.com/lindenb/jvarkit.git/info/refs

fatal: HTTP request failed

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k

1) yes, this is the minimal length

2) ssl error: http://stackoverflow.com/questions/3777075/ssl-certificate-rejected-trying-to-access-github-over-https-behind-firewall

ADD REPLYlink written 4.4 years ago by Pierre Lindenbaum120k

Hi Pierre

I downloaded the tool and ran it. I used this as my dummy sam file

@HD    VN:1.0    SO:unsorted
@SQ    SN:RNU6-170P    LN:304
@SQ    SN:RNU6-1295P    LN:303
@PG    ID:bowtie2    PN:bowtie2    VN:2.0.0-beta7
SRR959756.117    16    RNU6-1295P    110    9    19S25M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:50    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:25    YT:Z:UU
SRR959756.117    272    RNU6-170P    141    9    23S21M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:42    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:21    YT:Z:UU

But I get only this output

@SRR959756.117:0:RNU6-1295P:110:16
TAGAGACGGGGTCTCGCTA
+
VW[XY__c_f]^a^^a[da

Why not the other read??

Also for some alignments, my quality string , 11th column is a *. Would that matter??

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k
0
gravatar for RamRS
4.4 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

awk should do it:

awk -F "\t" '$6 ~ /19S$/ { print ">"$1"_"$2"_"$3"_"$6"\n"substr($10,27,19); next } { print ">"$1"_"$2"_"$3"_"$6"\n"$10 }'

Assumptions:

a. All sequences are of equal length

b. All CIGAR strings have "19S" at the end when appropriate

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by RamRS21k

Hi Ram

This gives me this error

awk: $6 ~ /19S$/ { print ">"$1"_"$2"_"$3"_"$6\nsubstr($10,44,8)} { print ">"$1"_"$2"_"$3"_"$6\n$10 }
awk:                                         ^ backslash not last character on line

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k
1

You need to surround the \n with quotes:

awk -F "\t" '$6 ~ /19S$/ { print ">"$1"_"$2"_"$3"_"$6"\n"substr($10,44,8)} { print ">"$1"_"$2"_"$3"_"$6"\n"$10 }'
ADD REPLYlink written 4.4 years ago by Devon Ryan90k

Hmmm. I'll try and tweak it.

ADD REPLYlink written 4.4 years ago by RamRS21k

I made a wrong assumption. How do you pick the 19S part of a sequence? Can you demonstrate with the 3rd case as your example?

ADD REPLYlink written 4.4 years ago by RamRS21k

It's CCACGCCTGGCACCCAGCT, the last 19 bases. Aside from hard-clipped bases, soft-clipped bases will always be on the start/end of reads.

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

Hi Devon I agree with you but doesn't flag info matter. Like read mapping to forward or the reverse strand? Like in the third case read maps to the reverse strand.

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k
1

The flag tells you the orientation of the original read relative to what you're seeing. The soft-clipped region of the original read is AGCTGGGTGCCAGGCGTGG, which is just the reverse complement of what I wrote above. Whether you care to reverse complement or not depends on what you want to do next.

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

So basically i want to extract all these soft clip sequences and then map them to hg19 genome. So tools like bowtie will automatically align the read to forward or reverse strand, but AGCTGGGTGCCAGGCGTGG is basically the actual 5p part of the read SRR959756.1740 from the fastq file

@SRR959756.1740 HWI-EAS222_0038:7:1:14629:1494 length=50
AGCTGGGTGCCAGGCGTGGTGGCTCATGCCTGTAATCCCAGCTG
+SRR959756.1740 HWI-EAS222_0038:7:1:14629:1494 length=50
gggggggfgddfdefcafcZ^``\adbddfgdfdfdfbfafffg
ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k

Right, the github repo has tools that are basically doing something very similar to what you're talking about (they can be used to find novel mobile element insertions using soft-clipped single-end reads). If your end goal is to find either large deletions or chromosomal breakpoints then tracking orientation might make life simpler (it'd be easier to locate the exact breakpoint). The original tool I wrote wasn't intended for those purposes, so I didn't bother. Regardless, you'd get the same alignment of the soft-clipped region either way, just the flag would be different.

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

if that's the case, add another comparison on the strand field and reverse the string based on the strand.

ADD REPLYlink written 4.4 years ago by RamRS21k

Reverse complement, not reverse :)

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

Ah, of course. I'm god with awk, bad with biology, it would seem :)

ADD REPLYlink written 4.4 years ago by RamRS21k

I suspect I've just made that exact mistake a few more times than you, give it time :)

ADD REPLYlink written 4.4 years ago by Devon Ryan90k

Thank you for the reassurance, Devon :)

ADD REPLYlink written 4.4 years ago by RamRS21k

I am interested in any soft clipped thing which is equal to 19 or more than that like 19S,20S,21S,22S etc. The assumption for 19S and more(20S,21S etc) is random. Also is your code giving read seq for 23S as shown in the example?

ADD REPLYlink written 4.4 years ago by Varun Gupta1.1k

If you're interested in all >=19S, don't compare to strings ending in 19S, compare to strings containing the regex "(19|[2-9][0-9])S"

ADD REPLYlink written 4.4 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 646 users visited in the last hour