Replace ambigious nucleotide with correct nucleotides
1
0
Entering edit mode
4.5 years ago

I have multi fasta nucleotide file, I am interested to replace the ambigious nucleotides (R, Y, S, K, e.t.c) with a possible combination of nucleotides and save both options. For example, the following nucleotide sequence (sudo example):

>Seq1
ATGGKCRCCGCSGT

Contain ambiguous nucleotides (K, R, S), where K= G or T, R=A or G, and S=G or C

so the Seq1 variant will be like this:

>Seq1_a
ATGGGCACCGCGGT

And Seq1_b variant will be like this:

>Seq1_b
 ATGGTCGCCGCCGT

One option is to use sed using the following command:

sed 's/K/G/g; s/R/A/g; s/S/G/g' input.txt > output_1

to generate:

>Seq1_a
ATGGGCACCGCGGT

And again use sed command:

sed 's/K/T/g; s/R/G/g; s/S/C/g' input.txt > output_2

to generate:

>Seq1_b
ATGGTCGCCGCCGT

And then combine output from both files to generate output like this:

>Seq1_a
ATGGGCACCGCGGT
>Seq1_b
ATGGTCGCCGCCGT

There must be a more elegant way to do this, Any help will be highly appreciated.

next-gen RNA-Seq • 1.4k views
ADD COMMENT
1
Entering edit mode

Note that you have 3 ambiguous nucleotides, and each of them can be 2 ATCG nucleotides, so you'll have 23 = 8 possible sequences, not 2.

ADD REPLY
0
Entering edit mode
4.5 years ago
Ram 44k

I guess you were unable to find the regex expansion algorithm I was referring to earlier. The alternative is to write your own code. Follow this algorithm:

  1. For each sequence with ambiguous nucleotides, create a dictionary that stores the position of ambiguous nucleotides as well as the ambiguous nucleotide.
  2. For each entry in the dictionary, create another dictionary where you store all combinations of the sub-sequence from (previous position + 1) to the current position.
  3. Once you have a bunch of these subsequences, you should be able to use them as layers of nodes and navigate to get to all possible full sequences.
ADD COMMENT
0
Entering edit mode

Many thanks for this guidance, I have tried to work on regex but didn't get success, So I thought instead of finding the nucleotide sequence for ambiguous codon, I can start with CDS sequence that is coding for those codons and can sort out the ambiguous nucleotide. I understand the first point but wasn't able to understand completely the second point but I will definitely give it a try.

ADD REPLY
2
Entering edit mode

I'll explain using your example.

>Seq1
ATGGKCRCCGCSGT

The dictionary would be:

5 : K
7 : R
12 : S

From the above dictionary, you'd create a new set of dictionaries:

subseq1.1 : ATGGT
subseq1.2 : ATGGG

subseq2.1 : CA
subseq2.1 : CG

subseq3.1 : CCGCG
subseq3.2 : CCGCC

subseq4.1 : GT

Now you can use these as layers of nodes to go from one to the other layer using any path:

graph with nodes

ADD REPLY

Login before adding your answer.

Traffic: 2086 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6