Question: SAM CIGAR parsing and position counter in R
1
gravatar for niu.shengyong
19 months ago by
niu.shengyong50 wrote:

I have a SAM file like this:

SRR400619.13    0       NC_000913.3     3440402 40      1S25M10S        *       0       0       NTGCCGTTGGTTTCCATTTCGATGACNNNNNNNNNN    %-:9986986666:5#####################    HI:i:1  NH:i:1  NM:i:0
SRR400619.14    0       NC_000913.3     3442282 20      26M10S  *       0       0       CACATTTGGCAACTTCGTCACGCAGCNNNNNNNNNN    BBCBBB@@BCACC@BBB8BB################    HI:i:1  NH:i:1  NM:i:1

I hope to build up a counter of each position. For example, in the first row, I have start position 3440402 and CIGAR string 1S25M10S. Hence, the counter will increase by 1 in the positions of 3440403 to 3440427. Is there any R package or script can do it?

PS: I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0

Simon

R sam parsing cigar • 763 views
ADD COMMENTlink modified 19 months ago • written 19 months ago by niu.shengyong50
2

Just to be sure you want to extract from cigar the alignment length (so 25 in the first row). And add it to the start position. Could you rephrase your question please ?

ADD REPLYlink written 19 months ago by Nicolas Rosewick8.5k

Hi Nicolas, thanks for your advice and fast reply! I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count array would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0

ADD REPLYlink modified 19 months ago • written 19 months ago by niu.shengyong50

What about insertion, deletion and splice junction encoded within CIGAR ? If I understand well you want to report all position where one read aligned.

ADD REPLYlink written 19 months ago by Nicolas Rosewick8.5k

Yes, you are right. I want to report all position where one read aligned. I also need to consider all other situation in CIGAR string including insertion, deletion, splice, mismatch, etc. (like the operators list here http://bioinformatics.cvr.ac.uk/blog/tag/cigar-string/

ADD REPLYlink written 19 months ago by niu.shengyong50

What is the actual problem you are trying to solve? Are you attempting to get per-base read depth information? There's already R packages that will give you that.

ADD REPLYlink written 19 months ago by d-cameron2.1k

I'm now building up a machine learning based tool. It takes the per-base read information as training dataset. The input is like the following array: [1] 1 [2] 1 [3] 0 .... etc. which means the first and second positions are mapped with one read, and there is no read mapped in the position 3.

Thanks!

ADD REPLYlink modified 19 months ago • written 19 months ago by niu.shengyong50

Using something like Rsamtools::pileup will be much easier than having to parse read CIGAR strings yourself.

ADD REPLYlink written 19 months ago by d-cameron2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 898 users visited in the last hour