Edit a Multifasta File Through awk
1
0
Entering edit mode
4.0 years ago
pthom010 ▴ 40

I have a multifasta file that looks like this

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

And I have a txt.file (tab delimited) that looks like this:

TRINITY_DN10231_c0_g1_i1        UBQ5_TOBAC
TRINITY_DN10231_c0_g1_i2        UBQ5_TOBAC

The text file has abbreviated transcript names that I would like to use to rename my fasta files. I would like to remove the len= and path sections from my new fasta. I ran the following code to rename the fasta sequences and what I would like to get is seen below:

>TRINITY_DN10231_c0_g1_i1_UBQ5_TOBAC

awk '
FNR==NR{
a[$1]=$1 $2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' txt.file FS="[> ]" fasta.fa > newfasta.fasta

What I get however, is this:

TRINITY_DN10231_c0_g1_i1UBQ5_TOBAC

I've tried tweaking the code in the initial argument defining the array but that removes all of the headers. Not sure where to go next. Any help would be appreciated.

fasta unix awk • 1.2k views
ADD COMMENT
0
Entering edit mode

Maybe a[$1]=$2 only?

ADD REPLY
0
Entering edit mode

That only gives me this:

TRINITY_DN10231_c0_g
ADD REPLY
0
Entering edit mode

with seqkit and awk:

$ awk '{print $1}' file.fa                                                                                                     
>TRINITY_DN10231_c0_g1_i1
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2
GGCGCGCGGAGAGAGA

$ awk '{print $1}' file.fa | seqkit replace  --quiet  -p "(.+)" -r '{kv}' -k file.txt    

>UBQ5_TOBAC
ATATATATATAT
>UBQ5_TOBAC
GGCGCGCGGAGAGAGA

input:

$ cat file.fa                                                                                                                  
>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA
ADD REPLY
0
Entering edit mode
4.0 years ago
Zhilong Jia ★ 2.2k

cat 1.txt

TRINITY_DN10231_c0_g1_i1        UBQ5_TOBAC
TRINITY_DN10231_c0_g1_i2        UBQ5_TOBAC

cat 2.txt

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

awk 'FNR==NR{data[$1]=$2; next}{if ($1 ~/>/) {aa=$1 "_" data[$1]; print $0} else {print} }' 1.txt 2.txt

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA
ADD COMMENT
0
Entering edit mode

Should I cat both the fasta and the txt file? I'm a bit confused.

ADD REPLY
0
Entering edit mode

No, just show the content of the files to clarify 1.txt and 2.txt .

ADD REPLY

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6