Rename fasta headers using a variable already present in the header?
0
0
Entering edit mode
5.0 years ago
FreshBio • 0

Hello Community!

I know several similar questions have been asked but they all seem to want to rename their fasta headers entirely using a new variable name, or they have a separate text file with names that they would like to use to replace their sequence headers.

In my case, I just want to rename the fasta headers in my very large fasta file using a chunk from the header already present.

Here is an example of the header I have:

>lcl|NC_019859.2_cds_XP_023809260.1_3 [gene=gemin6] [db_xref=GeneID:110014398] [protein=gem-associated protein 6] [protein_id=XP_0238
ACTCCGGAGTG....

My desired output:

>gemin6
ACTCCGGAGTG....

I want to just remove everything except the gene name that shows up after "gene="

I tried using sed:

grep ">" LARGE_MSA.fna | perl -pe "s/>\w+\.\d+\_(.+)/\1/"|perl -pe "s/>[A-Za-z0-9]+_(.+)/\1/"|perl -pe "s/[0-9]+_(.+)/\1/" | perl -pe "s/>[gene=.*]/\1/"

However, this is just returning everything, the entire header and I cannot figure out how to properly chunk out the gene name and how to properly label the square brackets or the equal sign within my line of code as I think this is why it is failing. Still a beginner with sed, grep, and awk! Thank you for the help!

fasta • 2.0k views
ADD COMMENT
0
Entering edit mode

Try this:

grep '>' LARGE_MSA.fna | sed 's/>.*.gene=/>/' | sed 's/].*//'
ADD REPLY
0
Entering edit mode

That would give you new headers for the FASTA. You could use sed to replace lines in-place, matching only those beginning with > and replacing them with a capture group that captures \[gene=([^\]])\].

ADD REPLY
0
Entering edit mode

Oh that worked! So I want to understand, what does the />/ after the equal sign mean in the line of code? I understand that sed 's/>.*' looks for the string starting with ">" and it can be any single character with any amount, but then I don't understand the period before gene and the "/>/" bit.

Also, once I obtain these names can I use this line fo code to replace the names (I kept this from another biostar question) awk '/^>/{print ">" ++i; next}{print}' < file.fasta.

I am afraid of testing it out though and screwing my up gigantic file lol!

ADD REPLY
0
Entering edit mode

I would probably try:

grep '>' LARGE_MSA.fna | sed 's/>.*.gene=/>/' | sed 's/].*//' |  awk '/^>/{print ">" ++i; next}{print}' < LARGE_MSA.fna
ADD REPLY
0
Entering edit mode

what does the />/ after the equal sign mean in the line of code

That's sed syntax. sed s tells sed it is performing a substitute (replace) operation. The character after s is taken as the delimiter. sed s/ -> substitute the expression between the two next / characters with the expression between the next two /

sed s/FIRST/SECOND/' would replaceFIRSTwithSECOND. Any delimiter can be used, sosed s#FIRST#SECOND#` is also valid.

Here, sed s/>.*.gene=/>/ replaces >.*.gene= with >, essentially removing everything from the first character up until gene= in a header. The next sed removed everything starting from the first ] character, giving you the header you want.

Instead of using awk to replace headers, go for BBUtils.

ADD REPLY
0
Entering edit mode

I have tried using sed

No, that's perl, not sed :-)

ADD REPLY
0
Entering edit mode

Oh haha, yes I used it from another biostars post and tried to modify it. I really need to learn this properly! So what does the perl -pe line do if it is not perl being used?

ADD REPLY
0
Entering edit mode

It is perl being used. Your post said sed while it used perl.

ADD REPLY
0
Entering edit mode

The sed command is not replacing anything in the file. It just prints the correct gene name out but when I check the Large_MSA.fna again, the names are the same. Did I miss something?

ADD REPLY
0
Entering edit mode

You can use sed to generate the new content. Substituting existing content is the next step in the process. From your comment, I thought you understood that this is a two step process:

Also, once I obtain these names can I use this line fo code to replace the names (I kept this from another biostar question) awk '/^>/{print ">" ++i; next}{print}' < file.fasta.

ADD REPLY
0
Entering edit mode

Oh yea I did but with my awk command which I thought would replace it, it's not at all. It prints it out fine but I thought if I fed the same file back into my awk command, instead of printing it as stdout it would just replace the names int he file but it didn't do that. I am trying to figure out why that is.

ADD REPLY
0
Entering edit mode

Like I said in an earlier comment,

Instead of using awk to replace headers, go for BBUtils.

ADD REPLY

Login before adding your answer.

Traffic: 859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6