Question: Replace Ambigious codons with Amino acid sequences
0
gravatar for waqaskhokhar999
6 months ago by
waqaskhokhar99980 wrote:

I have two multi-fasta files, one contains nucleotide sequences, and a subset of nucleotide sequence for ORF1 is:

>ORF1_BnaA03g18710D_S45:82:1509
ATGGCCGCCGCAGTTTCCACCGTCGGTGCCATCAACAGAGCTCCGTTGAGCTTGAACGGG
TCAGGAGCAGGAGCTGCTTCAGTCCCAGCTACGACCTTCTTGGGAAAGAAAGTTGTAACC
GCGTCGAGATTCACACAGAGCAACAACAAGAAGAGCAACGGATCATTCAAAGTGGTCGCT
GTCAAAGAAGACAAACAAACCGATGGAGACAGATGGAGGGGACTTGCCTACGACACGTCT
GATGATCAACAAGACATCACCAGAGGCAAAGGTATGGTTGACTCTGTCTTCCAAGCTCCC
ATGGGAACCGGAACTCACAATGCCGTTCTTAGCTCCTATGAGTACATTAGCCAAGGTCTT
AAGCAGTACAACTTGGACAACATGATGGATGGGCTTTACATTGCTCCTGCATTCATGGAC
AAGCTTGTTGTTCACATCACCAAGAACTTCTTGACTTTACCTAACATCAAGGTTCCACTT
ATTTTGGGTATTTGGGGAGGCAAAGGTCAAGGTAAATCCTTCCAGTGTGAGCTTGTCATG
GCCAAGATGGGCATTAACCCAATCATGATGAGTGCTGGAGAGCTTGAGAGTGGAAACGCA
GGAGAACCAGCCAAGCTGATCCGTCAAAGGTACCGTGAAGCAGCAGACATGATCAAAAAG
GGAAAAATGTGTTGTCTATTCATCAACGATCTCGACGCTGGTGCTGGTCGTATGGGTGGT
ACTACYYAGTACACAGTCAACAACCWGATGGTTAACGCAACCYTCATGAACMTTGCTGAT
AACCCAACCAACGTCCAGCTCCCGGGAATGTACAACAAGGAAGAAAACGCACGTGTCCCC
ATCATCGTCACCGGTAACGATTTCTCCACTCTCTACGCACCTCTCATCCGTGACGGGCGT
ATGGAGAARTTCTACTGGGCACCCACACGTGAGGACCKTATTGGTGTCTGCAAGGGTATC
TTCAGGACTGATAACGTTAAGGATGAAGACATTGTCACGCTTGTTGACCAGTTCCCTGGA
CAATCTATCGATTTCTTTGGTGCATTGAGGGCGAGAGTGTACGATGATGAAGTGAGGAAG
TTCGTTGAGGGACTTGGAGTKGAGAAGATAGGAAAGAGGCTGGTGAACTCTAGGGAAGGT
CCTCCAGTGTTCGAGCAACCAGCGATGACTCTTGAGAAGCTTATGGAGTACGGAAACATG
CTTGTGATGGAGCAAGAGAACGTCAAGAGAGTCCAACTTGCTGACCAATACCTTAACGAG
GCTGCCTTGGGAGACGCAAACGCGGACGCCATTGGCCGCGGAACTTTCTATGGGAAAGCA
GCACAGCAAGTGAACCTTCCTGTTCCAGAAGGGTGTACTGATCCTCAAGCCGACAACTTT
GATCCAACAGCTAGAAGTGATGATGGAACTTGTGTCTACAACTTTTGA

The second file contains the corresponding amino acid sequences and subset is:

>ORF1_BnaA03g18710D_S45:82:1509
MAAAVSTVGAINRAPLSLNGSGAGAASVPATTFLGKKVVTASRFTQSNNKKSNGSFKVVA
VKEDKQTDGDRWRGLAYDTSDDQQDITRGKGMVDSVFQAPMGTGTHNAVLSSYEYISQGL
KQYNLDNMMDGLYIAPAFMDKLVVHITKNFLTLPNIKVPLILGIWGGKGQGKSFQCELVM
AKMGINPIMMSAGELESGNAGEPAKLIRQRYREAADMIKKGKMCCLFINDLDAGAGRMGG
TTXYTVNNXMVNATXMNJADNPTNVQLPGMYNKEENARVPIIVTGNDFSTLYAPLIRDGR
MEKFYWAPTREDXIGVCKGIFRTDNVKDEDIVTLVDQFPGQSIDFFGALRARVYDDEVRK
FVEGLGVEKIGKRLVNSREGPPVFEQPAMTLEKLMEYGNMLVMEQENVKRVQLADQYLNE
AALGDANADAIGRGTFYGKAAQQVNLPVPEGCTDPQADNFDPTARSDDGTCVYNF

Now the problem is for some amino acids we have X (the above sequence contains 4 x) instead of amino acid, I am interested to check the corresponding nucleotide sequence for these ambiguous codons, check possible combinations for ambigious nucloeide, and replace with correct amino acids using Codon table.

For example for the first X position which is at position 243 in sequence (TTXYTV), the nucleotide sequence for this x is yag, where Y corresponds to C or T (y=c/t), so the possible combination would be CAG or TAG, CAG codes for Gln(Q) (cag=Gln (Q)), and TAG codes for stop codons (tag= stop codon).

The output may be saved in excel format something like the following image or any suitable format.

image

Any help will be highly appreciated.

rna-seq sequence • 203 views
ADD COMMENTlink modified 6 months ago by Asaf8.4k • written 6 months ago by waqaskhokhar99980

Let's say the possible combinations allow for 2 different amino acids. How would you pick the 'right" one?

ADD REPLYlink written 6 months ago by RamRS30k

I would like to keep both combinations, in the sequence a the yag will become cag and codes for Q, while in the sequence b the yag will become tag and reflects stop codon (*) and the outputs would be like these:

>ORF1_BnaA03g18710D_S45:82:1509_a
MAAAVSTVGAINRAPLSLNGSGAGAASVPATTFLGKKVVTASRFTQSNNKKSNGSFKVVA
VKEDKQTDGDRWRGLAYDTSDDQQDITRGKGMVDSVFQAPMGTGTHNAVLSSYEYISQGL
KQYNLDNMMDGLYIAPAFMDKLVVHITKNFLTLPNIKVPLILGIWGGKGQGKSFQCELVM
AKMGINPIMMSAGELESGNAGEPAKLIRQRYREAADMIKKGKMCCLFINDLDAGAGRMGG
TTQYTVNNXMVNATXMNJADNPTNVQLPGMYNKEENARVPIIVTGNDFSTLYAPLIRDGR
MEKFYWAPTREDXIGVCKGIFRTDNVKDEDIVTLVDQFPGQSIDFFGALRARVYDDEVRK
FVEGLGVEKIGKRLVNSREGPPVFEQPAMTLEKLMEYGNMLVMEQENVKRVQLADQYLNE
AALGDANADAIGRGTFYGKAAQQVNLPVPEGCTDPQADNFDPTARSDDGTCVYNF

>ORF1_BnaA03g18710D_S45:82:1509_b
MAAAVSTVGAINRAPLSLNGSGAGAASVPATTFLGKKVVTASRFTQSNNKKSNGSFKVVA
VKEDKQTDGDRWRGLAYDTSDDQQDITRGKGMVDSVFQAPMGTGTHNAVLSSYEYISQGL
KQYNLDNMMDGLYIAPAFMDKLVVHITKNFLTLPNIKVPLILGIWGGKGQGKSFQCELVM
AKMGINPIMMSAGELESGNAGEPAKLIRQRYREAADMIKKGKMCCLFINDLDAGAGRMGG
TT*YTVNNXMVNATXMNJADNPTNVQLPGMYNKEENARVPIIVTGNDFSTLYAPLIRDGR
MEKFYWAPTREDXIGVCKGIFRTDNVKDEDIVTLVDQFPGQSIDFFGALRARVYDDEVRK
FVEGLGVEKIGKRLVNSREGPPVFEQPAMTLEKLMEYGNMLVMEQENVKRVQLADQYLNE
AALGDANADAIGRGTFYGKAAQQVNLPVPEGCTDPQADNFDPTARSDDGTCVYN

And I think these will be better alternative as compared to excel format.

ADD REPLYlink modified 6 months ago by RamRS30k • written 6 months ago by waqaskhokhar99980
1

You should look for a way to expand regexes - some piece of code that can take [bcr]at as input and give you bat cat rat. Once you have that, you should linearize your nucleotide FASTA and replace all ambiguous codes with their corresponding regexes (Y would become [CT] for example). Then, apply the regex expansion algorithm to the sequence field of your linearized fasta and you'd get multiple sequences per header. Use a custom awk to write each resultant sequence with a suffixed header and you'll be all set.

The key is the regex expansion algorithm though, that's what you'll need to find here.

ADD REPLYlink written 6 months ago by RamRS30k
0
gravatar for Asaf
6 months ago by
Asaf8.4k
Israel
Asaf8.4k wrote:

You can use sed to write each nucleotide in a line. For instance if the file name is /tmp/r1.fa:

tail -n +2 /tmp/r1.fa |tr -d "\n" | sed 's/\(...\)/\1\n/g' |cat -n

You can do the same for the AA file, just remove two dots from the expression in the sed.

ADD COMMENTlink written 6 months ago by Asaf8.4k

I don't think this answers OP's question at all.

ADD REPLYlink written 6 months ago by RamRS30k

Yeah, I don't do excel. Closest he'll get.

ADD REPLYlink written 6 months ago by Asaf8.4k
1

It's not about Excel, I think. We should help OP with the ambiguous nucleotides, not help them with their Excel requirements. Also, in their reply to my comment, OP has clarified that they want a multi-fasta, not a spreadsheet.

ADD REPLYlink written 6 months ago by RamRS30k

I read it as finding a way to find corresponding nucleotides and AA, I think my solution gives him a way to do so, it will require some manual work but unless he wants to write some code I don't know a tool that can do that.

ADD REPLYlink written 6 months ago by Asaf8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1973 users visited in the last hour