Question: Edit a Multifasta File Through awk
0
gravatar for pthom010
6 months ago by
pthom0100
pthom0100 wrote:

I have a multifasta file that looks like this

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

And I have a txt.file (tab delimited) that looks like this:

TRINITY_DN10231_c0_g1_i1        UBQ5_TOBAC
TRINITY_DN10231_c0_g1_i2        UBQ5_TOBAC

The text file has abbreviated transcript names that I would like to use to rename my fasta files. I would like to remove the len= and path sections from my new fasta. I ran the following code to rename the fasta sequences and what I would like to get is seen below:

>TRINITY_DN10231_c0_g1_i1_UBQ5_TOBAC

awk '
FNR==NR{
a[$1]=$1 $2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' txt.file FS="[> ]" fasta.fa > newfasta.fasta

What I get however, is this:

TRINITY_DN10231_c0_g1_i1UBQ5_TOBAC

I've tried tweaking the code in the initial argument defining the array but that removes all of the headers. Not sure where to go next. Any help would be appreciated.

awk unix fasta • 241 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by pthom0100

Maybe a[$1]=$2 only?

ADD REPLYlink written 6 months ago by Asaf8.5k

That only gives me this:

TRINITY_DN10231_c0_g
ADD REPLYlink written 6 months ago by pthom0100

with seqkit and awk:

$ awk '{print $1}' file.fa                                                                                                     
>TRINITY_DN10231_c0_g1_i1
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2
GGCGCGCGGAGAGAGA

$ awk '{print $1}' file.fa | seqkit replace  --quiet  -p "(.+)" -r '{kv}' -k file.txt    

>UBQ5_TOBAC
ATATATATATAT
>UBQ5_TOBAC
GGCGCGCGGAGAGAGA

input:

$ cat file.fa                                                                                                                  
>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA
ADD REPLYlink modified 6 months ago • written 6 months ago by cpad011214k
0
gravatar for Zhilong Jia
6 months ago by
Zhilong Jia1.6k
London
Zhilong Jia1.6k wrote:

cat 1.txt

TRINITY_DN10231_c0_g1_i1        UBQ5_TOBAC
TRINITY_DN10231_c0_g1_i2        UBQ5_TOBAC

cat 2.txt

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

awk 'FNR==NR{data[$1]=$2; next}{if ($1 ~/>/) {aa=$1 "_" data[$1]; print $0} else {print} }' 1.txt 2.txt

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA
ADD COMMENTlink modified 6 months ago • written 6 months ago by Zhilong Jia1.6k

Should I cat both the fasta and the txt file? I'm a bit confused.

ADD REPLYlink written 6 months ago by pthom0100

No, just show the content of the files to clarify 1.txt and 2.txt .

ADD REPLYlink written 6 months ago by Zhilong Jia1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1201 users visited in the last hour