Question: Modify FASTA headers
0
gravatar for mpbiology.dna
2.5 years ago by
mpbiology.dna0 wrote:

Hello! I have a FASTA file and I need a script that read in the file, changes all the headers to e new format and writes out all the sequences in a new output file. The modified headers should contain, for each sequence, the species name (with "_" rather than "space"), a space character, and the identifier in square brackets with "ALKBH1:" in front of it. Overall the headers should look like this

 > Homo_sapiens [ALKBH1:NP_001192039]

If any of you can generate a script for me with some line descriptions I would really appreciate it. I am a beginner in this field and I need some help. Thank you all in advance!

headers script python fasta • 933 views
ADD COMMENTlink modified 2.5 years ago by Pierre Lindenbaum128k • written 2.5 years ago by mpbiology.dna0
2

You could (you should, in fact) start by trying something and ask for help after getting stuck. Then you describe what you tried (show the code), and show some relevant input / output. This question has been asked many times here and elsewhere, you can get started by reading the threads:

Question: Renaming Entries In A Fasta File

Question: How To Rename FASTA Headers

Question: Editing header of a fasta file

how to rename fasta file headers using sed

Modifying FASTA headers with Unix command line tools

Any script to parse fasta headers?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by h.mon29k
1

example of input is neded.

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum128k

Each sequence in the input file looks like this:

>gi|122937263|ref|NP_001073901.1|/1-505 alkB1 DNA repair protein ALKBH1 [Homo sapiens]
MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL
HKHGCLFRDLVRIQGKDLLTPVSRILIGNPGCTYKYLNTRLFTVPWPVKGSNIKHTEAEIAAACETFLKLND
YLQIETIQALEELAAKEKANEDAVPLCMSADFPRVGMGSSYNGQDEVDIKSRAAYNVTLLNFMDPQKMPYLK
EEPYFGMGKMAVSWHHDENLVDRSAVAVYSYSCEGPEEESEDDSHLEGRDPDIWHVGFKISWDIETPGLAIP
LHQGDCYFMLDDLNATHQHCVLAGSQPRFSSTHRVAECSTGTLDYILQRCQLALQNVCDDVDNDDVSLKSFE
PAVLKQGEEIHNEVEFEWLRQFWFQGNRYRKCTDWWCQPMAQLEALWKKMEGVTNAVLHEVKREGLPVEQRN
EILTAILASLTARQNLRREWHARCQSRIARTLPADQKPECRPYWEKDDASMPLPFDLTDIVSELRGQLLEAK
P
ADD REPLYlink modified 2.5 years ago by Pierre Lindenbaum128k • written 2.5 years ago by mpbiology.dna0

Few assumptions: species is Hsa in all the sequences and protein id is alway 6 letters (for eg.ALKBH1). Should work with multifasta file. Code:

$ sed  '/^>/ s/.*\(NP.*\)|.*protein\s\(.\{6\}\)\s\[\(.\{4\}\)\s\(.\{7\}\).*/>\3_\4 \[\2: \1\]/g' test.fa

output:

>Homo_sapiens [ALKBH1: NP_001073901.1]

MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL

input:

$ cat test.fa 
>gi|122937263|ref|NP_001073901.1|/1-505 alkB1 DNA repair protein ALKBH1 [Homo sapiens]
MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by cpad011213k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1967 users visited in the last hour