SAM CIGAR parsing and position counter in R
0
1
Entering edit mode
6.0 years ago

I have a SAM file like this:

SRR400619.13    0       NC_000913.3     3440402 40      1S25M10S        *       0       0       NTGCCGTTGGTTTCCATTTCGATGACNNNNNNNNNN    %-:9986986666:5#####################    HI:i:1  NH:i:1  NM:i:0
SRR400619.14    0       NC_000913.3     3442282 20      26M10S  *       0       0       CACATTTGGCAACTTCGTCACGCAGCNNNNNNNNNN    BBCBBB@@BCACC@BBB8BB################    HI:i:1  NH:i:1  NM:i:1

I hope to build up a counter of each position. For example, in the first row, I have start position 3440402 and CIGAR string 1S25M10S. Hence, the counter will increase by 1 in the positions of 3440403 to 3440427. Is there any R package or script can do it?

PS: I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0

Simon

SAM CIGAR parsing R • 2.5k views
ADD COMMENT
2
Entering edit mode

Just to be sure you want to extract from cigar the alignment length (so 25 in the first row). And add it to the start position. Could you rephrase your question please ?

ADD REPLY
0
Entering edit mode

Hi Nicolas, thanks for your advice and fast reply! I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count array would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0

ADD REPLY
0
Entering edit mode

What about insertion, deletion and splice junction encoded within CIGAR ? If I understand well you want to report all position where one read aligned.

ADD REPLY
0
Entering edit mode

Yes, you are right. I want to report all position where one read aligned. I also need to consider all other situation in CIGAR string including insertion, deletion, splice, mismatch, etc. (like the operators list here http://bioinformatics.cvr.ac.uk/blog/tag/cigar-string/

ADD REPLY
0
Entering edit mode

What is the actual problem you are trying to solve? Are you attempting to get per-base read depth information? There's already R packages that will give you that.

ADD REPLY
0
Entering edit mode

I'm now building up a machine learning based tool. It takes the per-base read information as training dataset. The input is like the following array: [1] 1 [2] 1 [3] 0 .... etc. which means the first and second positions are mapped with one read, and there is no read mapped in the position 3.

Thanks!

ADD REPLY
0
Entering edit mode

Using something like Rsamtools::pileup will be much easier than having to parse read CIGAR strings yourself.

ADD REPLY

Login before adding your answer.

Traffic: 2507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6