Question

Modify FASTA headers

0

Entering edit mode

7.6 years ago

mpbiology.dna • 0

Hello! I have a FASTA file and I need a script that read in the file, changes all the headers to e new format and writes out all the sequences in a new output file. The modified headers should contain, for each sequence, the species name (with "_" rather than "space"), a space character, and the identifier in square brackets with "ALKBH1:" in front of it. Overall the headers should look like this

 > Homo_sapiens [ALKBH1:NP_001192039]

If any of you can generate a script for me with some line descriptions I would really appreciate it. I am a beginner in this field and I need some help. Thank you all in advance!

FASTA Python script headers • 3.3k views

ADD COMMENT • link updated 7.6 years ago by Pierre Lindenbaum 166k • written 7.6 years ago by mpbiology.dna • 0

3

Entering edit mode

You could (you should, in fact) start by trying something and ask for help after getting stuck. Then you describe what you tried (show the code), and show some relevant input / output. This question has been asked many times here and elsewhere, you can get started by reading the threads:

Question: Renaming Entries In A Fasta File

Question: How To Rename FASTA Headers

Question: Editing header of a fasta file

how to rename fasta file headers using sed

Modifying FASTA headers with Unix command line tools

Any script to parse fasta headers?

ADD REPLY • link 7.6 years ago by h.mon 35k

1

Entering edit mode

example of input is neded.

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Each sequence in the input file looks like this:

>gi|122937263|ref|NP_001073901.1|/1-505 alkB1 DNA repair protein ALKBH1 [Homo sapiens]
MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL
HKHGCLFRDLVRIQGKDLLTPVSRILIGNPGCTYKYLNTRLFTVPWPVKGSNIKHTEAEIAAACETFLKLND
YLQIETIQALEELAAKEKANEDAVPLCMSADFPRVGMGSSYNGQDEVDIKSRAAYNVTLLNFMDPQKMPYLK
EEPYFGMGKMAVSWHHDENLVDRSAVAVYSYSCEGPEEESEDDSHLEGRDPDIWHVGFKISWDIETPGLAIP
LHQGDCYFMLDDLNATHQHCVLAGSQPRFSSTHRVAECSTGTLDYILQRCQLALQNVCDDVDNDDVSLKSFE
PAVLKQGEEIHNEVEFEWLRQFWFQGNRYRKCTDWWCQPMAQLEALWKKMEGVTNAVLHEVKREGLPVEQRN
EILTAILASLTARQNLRREWHARCQSRIARTLPADQKPECRPYWEKDDASMPLPFDLTDIVSELRGQLLEAK
P

ADD REPLY • link updated 7.6 years ago by Pierre Lindenbaum 166k • written 7.6 years ago by mpbiology.dna • 0

0

Entering edit mode

Few assumptions: species is Hsa in all the sequences and protein id is alway 6 letters (for eg.ALKBH1). Should work with multifasta file. Code:

$ sed  '/^>/ s/.*\(NP.*\)|.*protein\s\(.\{6\}\)\s\[\(.\{4\}\)\s\(.\{7\}\).*/>\3_\4 \[\2: \1\]/g' test.fa

output:

>Homo_sapiens [ALKBH1: NP_001073901.1]

MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL

input:

$ cat test.fa 
>gi|122937263|ref|NP_001073901.1|/1-505 alkB1 DNA repair protein ALKBH1 [Homo sapiens]
MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL

ADD REPLY • link 7.6 years ago by cpad0112 21k