Question

Fasta headers column spilt or selection

0

Entering edit mode

19 months ago

genomics_buddy • 0

How to take a specific column in sequence header identifiers of fasta file?

I am having my header such as:

>PGM0100236.1 [Candida]  scaffold00238
>PGM0100236.1 [Candida]  scaffold00239
>PGM0100236.1 [Candida]  scaffold00240
>PGM0100236.1 [Candida]  scaffold00241

I would like to take my third column alone i.e scaffold00238 for all the headers in my fasta file. Please give a simple command solution. I am new to bioinfo and linux script.

Thank you.

Fasta • 1.5k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 19 months ago by genomics_buddy • 0

0

Entering edit mode

awk '{print $3}' input > output

ADD REPLY • link 19 months ago by young_bioinformatician ▴ 230

2

Entering edit mode

This solution also prints the words scaffold losing all other information.

What OP wants.

I would like to take my third column alone i.e scaffold00238 for all the headers in my fasta file

ADD REPLY • link 19 months ago by GenoMax 141k

0

Entering edit mode

If your file only contains the headers and not the sequence, another easy solution is

cat my_file | cut -f3 > my_new_filtered_file

If it does contain the sequence then

cat my_file | grep ">" | cut -f3 > my_new_filtered_file

This assumes that the delimitator between columns is a tab (\t). If it is an empty space, you need to define the delimitator with a cut -d " " -f3

ADD REPLY • link 19 months ago by Antonio R. Franco ★ 5.1k

1

Entering edit mode

Neither of these solutions are doing what OP wants as far as I can tell.

OP wants to use a word to modify the header of a multi-fasta file.

ADD REPLY • link 19 months ago by GenoMax 141k

0

Entering edit mode

palani : Please confirm that you want to change

>PGM0100236.1 [Candida] scaffold00238
AGCATCG

to

>scaffold00238
AGCATCG

ADD REPLY • link 19 months ago by GenoMax 141k

0

Entering edit mode

Yes, exactly like that. Thanks for all the response. This is my first time in biostars. I am happy for all the suggestions. Thank you all.

ADD REPLY • link 19 months ago by genomics_buddy • 0

0

Entering edit mode

Thank you all for your suggestions, I will try it. I am glad for all your support.

ADD REPLY • link 19 months ago by genomics_buddy • 0

score 1 · Answer 1 · 2022-09-28

1

Entering edit mode

19 months ago

antmantras ▴ 80

Edit: Apologies, I thougth OP wanted only the names of the scaffolds. Then a solution could be:

awk '/^>/{$0=">"$NF}1' myfile.fasta > output.fasta

This will get the last field of the fasta headers.

ADD COMMENT • link 19 months ago by antmantras ▴ 80

1

Entering edit mode

Congratulations, 2/3 of your commands qualify for the UUOC award!

ADD REPLY • link 19 months ago by Ram 43k

1

Entering edit mode

Yeah, I know it can be written with:

grep ">" myfile.fasta | awk '{print $3}' > output.txt

if one is only looking for the names of the third column. However, I think is easier to understand for someone new to Unix what is going on with that command sequence (by first using cat). Anyways, since that is not what OP wanted, I removed that part.

ADD REPLY • link 19 months ago by antmantras ▴ 80

1

Entering edit mode

That's a good reason to use a cat where it's not required (as the Wiki page says). I also use it when I'm "building" a piped command sequence as I often start out with head file | ... and then go back to the working command and replace head with cat, but here on the forum, you can skip the cat-ing as ultimately, people should learn better ways of using commands and while we don't need to be perl-like in complexity, we can avoid over-simplification as well.

ADD REPLY • link 19 months ago by Ram 43k

score 1 · Answer 2 · 2022-09-28

1

Entering edit mode

19 months ago

rpolicastro 13k

A seqkit answer.

seqkit replace -p ".+(scaffold[0-9]+$)" -r "\$1" file.fasta

ADD COMMENT • link 19 months ago by rpolicastro 13k