Fasta headers column spilt or selection
5 months ago

How to take a specific column in sequence header identifiers of fasta file?

I am having my header such as:

>PGM0100236.1 [Candida]  scaffold00238
>PGM0100236.1 [Candida]  scaffold00239
>PGM0100236.1 [Candida]  scaffold00240
>PGM0100236.1 [Candida]  scaffold00241


I would like to take my third column alone i.e scaffold00238 for all the headers in my fasta file. Please give a simple command solution. I am new to bioinfo and linux script.

Thank you.

awk '{print $3}' input > output  ADD REPLY 2 Entering edit mode This solution also prints the words scaffold losing all other information. What OP wants. I would like to take my third column alone i.e scaffold00238 for all the headers in my fasta file ADD REPLY 0 Entering edit mode If your file only contains the headers and not the sequence, another easy solution is cat my_file | cut -f3 > my_new_filtered_file  If it does contain the sequence then cat my_file | grep ">" | cut -f3 > my_new_filtered_file  This assumes that the delimitator between columns is a tab (\t). If it is an empty space, you need to define the delimitator with a cut -d " " -f3 ADD REPLY 1 Entering edit mode Neither of these solutions are doing what OP wants as far as I can tell. OP wants to use a word to modify the header of a multi-fasta file. ADD REPLY 0 Entering edit mode palani : Please confirm that you want to change >PGM0100236.1 [Candida] scaffold00238 AGCATCG  to >scaffold00238 AGCATCG  ADD REPLY 0 Entering edit mode Yes, exactly like that. Thanks for all the response. This is my first time in biostars. I am happy for all the suggestions. Thank you all. ADD REPLY 0 Entering edit mode Thank you all for your suggestions, I will try it. I am glad for all your support. ADD REPLY 1 Entering edit mode 5 months ago antmantras ▴ 50 Edit: Apologies, I thougth OP wanted only the names of the scaffolds. Then a solution could be: awk '/^>/{$0=">"$NF}1' myfile.fasta > output.fasta  This will get the last field of the fasta headers. ADD COMMENT 1 Entering edit mode Congratulations, 2/3 of your commands qualify for the UUOC award! ADD REPLY 1 Entering edit mode Yeah, I know it can be written with: grep ">" myfile.fasta | awk '{print$3}' > output.txt


if one is only looking for the names of the third column. However, I think is easier to understand for someone new to Unix what is going on with that command sequence (by first using cat). Anyways, since that is not what OP wanted, I removed that part.

That's a good reason to use a cat where it's not required (as the Wiki page says). I also use it when I'm "building" a piped command sequence as I often start out with head file | ... and then go back to the working command and replace head with cat, but here on the forum, you can skip the cat-ing as ultimately, people should learn better ways of using commands and while we don't need to be perl-like in complexity, we can avoid over-simplification as well.

5 months ago

seqkit replace -p ".+(scaffold[0-9]+$)" -r "\$1" file.fasta