I have a sam file something like this
SRR959756.117 272 RNU6-170P 141 9 23S21M * 0 0 TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG SRR959756.117 272 RNU6-171P 139 9 24S20M * 0 0 TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG SRR959756.1740 272 RNU2-65P 349 14 2S23M19S * 0 0 CAGCTGGGATTACAGGCATGAGCCACCACGCCTGGCACCCAGCT SRR959756.117 16 RNU6-1295P 110 9 19S25M * 0 0 TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG SRR959756.212 256 RNU2-41P 38 1 25M19S * 0 0 ATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCCGAG
(I am only showing the first 10 columns of it). I want to extract the soft clipped sequence and make a fasta file with header as >column1_column2_column3_column6. How can I do this. I am a little confused with the strand/flag column and hence having difficulty interpreting how to extract the soft clipped seq. Also where the CIGAR STRING is 2S23M19S (as in 3rd case), I just want to extract the 19S part of the seq. How can I do this.
Hope to hear soon
Your script works fine. I made a dummy bam file to test
and ran the tool as
and my fastq looks like
I just want read name and flag info in the header. Is it possible for you to modify it?
Is it this line in your c file where I have to modify it
Yup, you would want something like:
That won't add the CIGAR, which actually requires a loop since it's not stored as a string (I can show you how to add this if you really want it). You can also comment out (Just add
//to the beginning) lines 26-29 so you get a fasta file instead (I already changed the
>above so the headers will be correct).
If I also wanted to include the position pos which you already had in your original code , the new code should look like this
Also It would be nice if you can show me how to add the CIGAR string in the header itself. Also may I ask , is this all coming from pysam and C together?
That looks correct for adding position. I'm glad you caught the tid mistake!
To insert a CIGAR string (we'll add it to the end), you'd do something like:
Note lines 23-24, which are the main changes. I've not tested this, but something along those lines should work.
BTW, this is all pure C. There are other programs in that repository written in python, but this isn't one of them.
Can you also add a feature for which end does the softclip reads comes from (5 Prime or 3 Prime) ? This one is very important for my downstream analysis and I've no idea about C...
I am interested in
extractSoftclippedC script, which unfortunately I can't run. I downloaded the
Makefilefiles to my project directory. The next step I am trying to compile the C code with
makefunction. So I run the following:
And I get:
If I compile with:
I get following error:
Could you please help me with this issue, as I have never done this type of work before and just don't know how ELSE to tackle the problem. Probably I am running the script from the wrong directory,
I would appreciate your help.
you need to install the C htslib library
What Pierre said, though you can just
git submodule update --initwhile in the directory with that source code. I tend to put htslib in as a submodule on everything that uses it.
Thank you Pierre Lindenbaum and Devon Ryan for prompt respond,
we have all libraries at the right location, but for some reasons I get the same error message. Unfortunately, I have never been dealing with such stuff so it's very difficult for me to debug the problem. However, I assume that it can't find a path for unknown to me reasons. Is it possible to get pre-compiled version of that particular script, because I am interest in extracting soft clipped reads.
Or if its not difficult for you could you navigate me in debugging
Thanks once again,
If you just type the following it should work:
Hi, IrK, sorry for bothering you. Have you solved this problem? It will be very appreciated that you can give some advice.
No idea about C. I will still look at the code to make out some thing from it..
You might wanna check out Devon's solution. His perspective is more relevant than mine - I see the sequence as a string and do not make any assumptions on the features, which is not a good approach at times.
I tried out the script. And it seems to work. Every once in a while I get a read that is peculiar. Probably, my expectation is off. This is one of the peculiar softclipped parts:
I looked for a read, mapping on chromosome T at position 2469 in the file that I fed into the extractSoftclipped.c, yet there is none like it. This is my expectation. But maybe this is only true for reads that are softclipped at front? These are the reads, that fall into the the region of position T : 2468:
I am sorry to bother you with this.