Question

Rename fasta headers using a variable already present in the header?

0

Entering edit mode

4.6 years ago

FreshBio • 0

Hello Community!

I know several similar questions have been asked but they all seem to want to rename their fasta headers entirely using a new variable name, or they have a separate text file with names that they would like to use to replace their sequence headers.

In my case, I just want to rename the fasta headers in my very large fasta file using a chunk from the header already present.

Here is an example of the header I have:

>lcl|NC_019859.2_cds_XP_023809260.1_3 [gene=gemin6] [db_xref=GeneID:110014398] [protein=gem-associated protein 6] [protein_id=XP_0238
ACTCCGGAGTG....

My desired output:

>gemin6
ACTCCGGAGTG....

I want to just remove everything except the gene name that shows up after "gene="

I tried using sed:

grep ">" LARGE_MSA.fna | perl -pe "s/>\w+\.\d+\_(.+)/\1/"|perl -pe "s/>[A-Za-z0-9]+_(.+)/\1/"|perl -pe "s/[0-9]+_(.+)/\1/" | perl -pe "s/>[gene=.*]/\1/"

However, this is just returning everything, the entire header and I cannot figure out how to properly chunk out the gene name and how to properly label the square brackets or the equal sign within my line of code as I think this is why it is failing. Still a beginner with sed, grep, and awk! Thank you for the help!

fasta • 1.8k views

ADD COMMENT • link 4.6 years ago by FreshBio • 0

0

Entering edit mode

Try this:

grep '>' LARGE_MSA.fna | sed 's/>.*.gene=/>/' | sed 's/].*//'

ADD REPLY • link 4.6 years ago by patelk26 ▴ 290

0

Entering edit mode

That would give you new headers for the FASTA. You could use sed to replace lines in-place, matching only those beginning with > and replacing them with a capture group that captures \[gene=([^\]])\].

ADD REPLY • link 4.6 years ago by Ram 43k

0

Entering edit mode

Oh that worked! So I want to understand, what does the />/ after the equal sign mean in the line of code? I understand that sed 's/>.*' looks for the string starting with ">" and it can be any single character with any amount, but then I don't understand the period before gene and the "/>/" bit.

Also, once I obtain these names can I use this line fo code to replace the names (I kept this from another biostar question) awk '/^>/{print ">" ++i; next}{print}' < file.fasta.

I am afraid of testing it out though and screwing my up gigantic file lol!

ADD REPLY • link 4.6 years ago by FreshBio • 0

0

Entering edit mode

I would probably try:

grep '>' LARGE_MSA.fna | sed 's/>.*.gene=/>/' | sed 's/].*//' |  awk '/^>/{print ">" ++i; next}{print}' < LARGE_MSA.fna

ADD REPLY • link 4.6 years ago by FreshBio • 0

0

Entering edit mode

what does the />/ after the equal sign mean in the line of code

That's sed syntax. sed s tells sed it is performing a substitute (replace) operation. The character after s is taken as the delimiter. sed s/ -> substitute the expression between the two next / characters with the expression between the next two /

sed s/FIRST/SECOND/' would replaceFIRSTwithSECOND. Any delimiter can be used, sosed s#FIRST#SECOND#` is also valid.

Here, sed s/>.*.gene=/>/ replaces >.*.gene= with >, essentially removing everything from the first character up until gene= in a header. The next sed removed everything starting from the first ] character, giving you the header you want.

Instead of using awk to replace headers, go for BBUtils.

ADD REPLY • link 4.6 years ago by Ram 43k

0

Entering edit mode

I have tried using sed

No, that's perl, not sed :-)

ADD REPLY • link 4.6 years ago by Ram 43k

0

Entering edit mode

Oh haha, yes I used it from another biostars post and tried to modify it. I really need to learn this properly! So what does the perl -pe line do if it is not perl being used?

ADD REPLY • link 4.6 years ago by FreshBio • 0

0

Entering edit mode

It is perl being used. Your post said sed while it used perl.

ADD REPLY • link 4.6 years ago by Ram 43k

0

Entering edit mode

The sed command is not replacing anything in the file. It just prints the correct gene name out but when I check the Large_MSA.fna again, the names are the same. Did I miss something?

ADD REPLY • link 4.6 years ago by FreshBio • 0

0

Entering edit mode

You can use sed to generate the new content. Substituting existing content is the next step in the process. From your comment, I thought you understood that this is a two step process:

Also, once I obtain these names can I use this line fo code to replace the names (I kept this from another biostar question) awk '/^>/{print ">" ++i; next}{print}' < file.fasta.

ADD REPLY • link 4.6 years ago by Ram 43k

0

Entering edit mode

Oh yea I did but with my awk command which I thought would replace it, it's not at all. It prints it out fine but I thought if I fed the same file back into my awk command, instead of printing it as stdout it would just replace the names int he file but it didn't do that. I am trying to figure out why that is.