Question

Remove part of the header from multi-fasta file (another one)

0

Entering edit mode

4.6 years ago

macielrodriguez2 ▴ 50

Hi!!!

I have a multifasta file wih headers like:

>trnN-GUU_INIA601-ARAGORN_v1.2.38 ccsA_INIA601-blatX
>rpl16_INIA601-blatX ndhF_INIA601-blatX psbJ_INIA601-blatX
>trnW-CCA-I_INIA601-ARAGORN_v1.2.38 trnL-UAG_INIA601-ARAGORN_v1.2.38
>psaC_INIA601-blatX trnR-UCU_INIA601-ARAGORN_v1.2.38 ndhA_INIA601-blatX
>trnC-ACA_INIA601-ARAGORN_v1.2.38 trnW-CCA-II_INIA601-ARAGORN_v1.2.38

I would like some way to only leave the name of the gene, like:

>rpl16 
>trnW 
>psaC 
>trnC

Thank you so much for your kind help :)

gene sequence fasta • 835 views

ADD COMMENT • link updated 4.6 years ago by zx8754 11k • written 4.6 years ago by macielrodriguez2 ▴ 50

0

Entering edit mode

with seqkit:

$ seqkit replace -p "[-_].*" -r "" input.fa

check if it makes sense to remove "_INIA601" and every thing after "_INIA601" from fasta headers.

ADD REPLY • link 4.6 years ago by cpad0112 21k

score 0 · Answer 1 · 2019-12-06

By looking at the file I have assumed that tRNA gene names include codon sequence and are: "trnN-GUU", "trnW-CCA-I", trnC-ACA". By omitting codon sequence you would lose information and would not distinguish some cases as "trnR-UCG" or "trnR-CCG". So if you need to extract full gene names you can use:

perl -pe 's/_.*//g' multifasta_file

This regular expression finds everything starting with _ and changes it to nothing.

Otherwise, if you want to get result as you wrote, you can use:

perl -pe 's/[_-].*//g' multifasta_file

This expression removes everything that starts with _ or -. Perl regular expressions are greedy so the longest sequence found is changed to nothing.