Question: Replace ambigious nucleotide with correct nucleotides
0
gravatar for waqaskhokhar999
5 months ago by
waqaskhokhar99980 wrote:

I have multi fasta nucleotide file, I am interested to replace the ambigious nucleotides (R, Y, S, K, e.t.c) with a possible combination of nucleotides and save both options. For example, the following nucleotide sequence (sudo example):

>Seq1
ATGGKCRCCGCSGT

Contain ambiguous nucleotides (K, R, S), where K= G or T, R=A or G, and S=G or C

so the Seq1 variant will be like this:

>Seq1_a
ATGGGCACCGCGGT

And Seq1_b variant will be like this:

>Seq1_b
 ATGGTCGCCGCCGT

One option is to use sed using the following command:

sed 's/K/G/g; s/R/A/g; s/S/G/g' input.txt > output_1

to generate:

>Seq1_a
ATGGGCACCGCGGT

And again use sed command:

sed 's/K/T/g; s/R/G/g; s/S/C/g' input.txt > output_2

to generate:

>Seq1_b
ATGGTCGCCGCCGT

And then combine output from both files to generate output like this:

>Seq1_a
ATGGGCACCGCGGT
>Seq1_b
ATGGTCGCCGCCGT

There must be a more elegant way to do this, Any help will be highly appreciated.

rna-seq next-gen • 171 views
ADD COMMENTlink modified 5 months ago by RamRS30k • written 5 months ago by waqaskhokhar99980
1

Note that you have 3 ambiguous nucleotides, and each of them can be 2 ATCG nucleotides, so you'll have 23 = 8 possible sequences, not 2.

ADD REPLYlink modified 5 months ago • written 5 months ago by RamRS30k
0
gravatar for RamRS
5 months ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

I guess you were unable to find the regex expansion algorithm I was referring to earlier. The alternative is to write your own code. Follow this algorithm:

  1. For each sequence with ambiguous nucleotides, create a dictionary that stores the position of ambiguous nucleotides as well as the ambiguous nucleotide.
  2. For each entry in the dictionary, create another dictionary where you store all combinations of the sub-sequence from (previous position + 1) to the current position.
  3. Once you have a bunch of these subsequences, you should be able to use them as layers of nodes and navigate to get to all possible full sequences.
ADD COMMENTlink written 5 months ago by RamRS30k

Many thanks for this guidance, I have tried to work on regex but didn't get success, So I thought instead of finding the nucleotide sequence for ambiguous codon, I can start with CDS sequence that is coding for those codons and can sort out the ambiguous nucleotide. I understand the first point but wasn't able to understand completely the second point but I will definitely give it a try.

ADD REPLYlink written 5 months ago by waqaskhokhar99980
2

I'll explain using your example.

>Seq1
ATGGKCRCCGCSGT

The dictionary would be:

5 : K
7 : R
12 : S

From the above dictionary, you'd create a new set of dictionaries:

subseq1.1 : ATGGT
subseq1.2 : ATGGG

subseq2.1 : CA
subseq2.1 : CG

subseq3.1 : CCGCG
subseq3.2 : CCGCC

subseq4.1 : GT

Now you can use these as layers of nodes to go from one to the other layer using any path:

graph with nodes

ADD REPLYlink modified 5 months ago • written 5 months ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 728 users visited in the last hour