add nucleotide in the begining of fasta sequences
4
0
Entering edit mode
3.5 years ago
amitpande74 ▴ 20

HI, I want to add 2 nucleotides in the beginning of each line in a FASTA file.

> 
GCATAGGC

the desired output

>
TAGCATAGGC

can someone help.

fasta sequence add nucleotide • 2.0k views
ADD COMMENT
0
Entering edit mode

What have you tried? This can be done with a sed command that matched the first character and replaced the line-beginning anchor with TA.

ADD REPLY
0
Entering edit mode

sed -i 's/^/TA/' file.fasta

ADD REPLY
0
Entering edit mode

That does not match the first character in each line. You'll end up adding TA to the header lines too, and that too before the > lines, essentially corrupting the FASTA file.

Also, don't use -i until you're 100% sure the command is exactly what you want.

ADD REPLY
0
Entering edit mode

yes, it does add a TA to the header. Then what exactly should be the command.

ADD REPLY
0
Entering edit mode

amitpande74, please accept all answers that solve your question.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

A: Fasta file edition

Replace "ACTG" with "TA".

ADD REPLY
3
Entering edit mode
3.5 years ago

seqkit mutate can edit FASTA sequence (point mutation, insertion, deletion) . Please use v0.14.0rc1 or later version which fix a bug for insersion

seqkit mutate -i supports inserting bases at any position. For example, for two (multi-line) sequences.

$ cat seqs.fa 
>seq1
GCATAGGC
>seq2
AAACCC
GGGTTT

1). At the beginning

$ cat seqs.fa | seqkit mutate -i 0:TA
>seq1
TAGCATAGGC
>seq2
TAAAACCCGGGTTT

2). At the end.

$ cat seqs.fa | seqkit mutate -i -1:TA
>seq1
GCATAGGCTA
>seq2
AAACCCGGGTTTTA

3). Behind the 5th base

$ cat seqs.fa | seqkit mutate -i 5:TA
>seq1
GCATATAGGC
>seq2
AAACCTACGGGTTT
ADD COMMENT
0
Entering edit mode

nice solution, great to know, this most certainly simplifies the task

ADD REPLY
2
Entering edit mode
3.5 years ago
Fatima ▴ 1000

If each sequence is one and only one line, and they Capital letters. (This works for both nucleotide and amino acid sequences; you can replace [A-Z] with [ATGC] if you want to be more specific.)

sed '/^[A-Z]/s/^/TA/'  fila.fasta > output.fasta

If you also have multi-line sequences, then you can first use this command to convert it to one-liner sequences:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'  input.fasta > file.fasta
ADD COMMENT
2
Entering edit mode
3.5 years ago

close. filter the headers ( assuming that sequences are in single line):

$ sed '/^>/! s/^/TA/' test.fa

or, you can also use:

$ sed  "0~2 s/^/TA&/" test.fa

with Awk:

$ awk -v OFS="\n" '/^>/ {getline seq; print $0,"TA"seq}' test.fa
$ awk '{print ((NR%2)? "":"TA") $0}' test.fa
ADD COMMENT
2
Entering edit mode
3.5 years ago

When the FASTA file may span multiple lines and when the resulting FASTA should be well-formed (wrapped at the same length) one needs to chain up more commands.

My best bet makes use of both bioawk and seqkit (both a installable with bioconda):

cat foo.fa | bioawk -v prefix="TATA" -c fastx '{ printf(">%s\n%s%s",$name, prefix, $seq) }' | seqkit seq

prints

>foo
TATAATGGACTCTCGTCCTCAGAAAGTCTGGATGACGCCGAGTCTCACTGAATCTGACAT
GGATTACCACAAGATCTTGACAGCAGGTCTGTCCGTTCAACAGGGGGTTGTTCGGCAAAG
AGTCATCCCAGTGTATCAAGTAAACAATCTTGAGATCCCAGTGTATCAAGTAAACAATCT
TGAGATCCCAGTGTATCAAGTAAACAATCTTGAGATCCCAGTGTATCAAGTAAACAATCT
TGAGATCCCAGTGTATCAAGTAAACAATCTTGAG

Uses the trick shown in A: Fasta file edition

ADD COMMENT

Login before adding your answer.

Traffic: 1994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6