Trim FASTA header in multifasta file
2
1
Entering edit mode
4.0 years ago
j.m.carsten ▴ 10

Hi,

I never did programming on my own, so I don't have any idea how to handle this problem. I search the internet and found several solutions using "sed", "awk" or a pipeline of both together with "cat". I found a python script and many other things but nothing worked out.

First of all: I don't have a unix system. I'm trying to do that on a Win10 machine using "Windows power shell".

I have python 3.5 GnuWin with sed and awk support and the programs seem to work fine.

Now the problem: My multifasta file looks like that:

>Q0HIK0|6-phosphogluconate dehydratase|Shewanella sp. (strain MR-4)
M-HSVVQSVTDRIIARSKASREAYLAALNDARNHKACQEVGSVAQVAGVPCDGVTQGQPG
MELSLLSREVIAMATAVGLSHNMFDGALLLGICDKIVPGLLIGALSFGHLPMLFVPAGPG
KVDRAQLLEAEAQSYHSAGTCTFYGQLMLEVMGLQLPGSSFVNPDDPLREALNKMAAKQV
CRLTELGTQYSPIGEVVNEKSVVNGIVALLATGGSTNLTMHIVAAARAAGIIVNWDDFSE
KLLDGEVRWVDGPTVSLDTEVLTSVATPFQNNGGLKLLKGNLGRAVIKVSAVQEKHRVVE
APAVVIDDQNKLDALFKSGALDRDCVVVVKGQGPKANGMPELHKLTPLLGSLQDKGFKVA
LMTDGRMSGASGKVPAAIHLTPEAIDGGLIAKVQDGDLIRVDALTGELSLLVSDAELAAR
TATEIDLRHSRYGMGRELFGALRSNLSSPETGARSTSAIDELY*

>A0KX60|6-phosphogluconate dehydratase|Shewanella sp. (strain ANA-3)
M-HSVVQSVTDRIIARSKASREAYLAALNDARNHKACQEVGSVAQVAGVPCDGVTQGQPG
MELSLLSREVIAMATAVGLSHNMFDGALLLGICDKIVPGLLIGALSFGHLPMLFVPAGPG
KVDRAQLLEAEAQSYHSAGTCTFYGQLMLEVMGLQLPGSSFVNPDDPLREALNKMAAKQV
CRLTELGTQYSPIGEVVNEKSVVNGIVALLATGGSTNLTMHIVAAARAAGIIVNWDDFSE
KLLDGELRWVDGPTVSLDTEVLTSVATPFQNNGGLKLLKGNLGRAVIKVSAVQEKHRVVE
APAVVIDDQNKLDALFKSGALDRDCVVVVKGQGPKANGMPELHKLTPLLGSLQDKGFKVA
LMTDGRMSGASGKVPAAIHLTPEAIDGGLIAKVQDGDLIRVDALTGELSLLVSDAELAAR
TAAEIDLRHSRYGMGRELFGALRSNLSSPETGARSTSAIDELY*


I want the the fasta header to be trimmed that it is less than 99 characters long because the program MrBayes only accepts headers with less characters. Idk why this is the case. But OK. And yes: those headers are less than 99 characters, but my multifasta file contains more than 5000 fasta sequences and a lot of them are longer than 99 characters. So, I searched the internet to find programs, which can do that. I found several solutions here on biostar, but none of them seem to work on my machine.

I tried:

• awk -F"|" '/>/{$0=">"$2}1' file ==> gave the error: syntax error at >
• awk -vs1=">" -F"|" '/^>/ {print s1 \$2;next} 1' file_name ==> gave warning: escape sequence '>' treated as plain '>' with the output of the multifasta fileinto the shell, but nothing else
• cat input.txt | sed 's/.*[/>/' | sed 's/]//' ==> gave nothing except that the multifasta file was printed to the shell
• sed -e 's/\len=//' -e 's/\path.*// MultiFasta.txt > OutLength.txt ==> gave nothing instead of a >> prompt, which I had to interrupt with CTRL+C
• I tried this python script after installing the pyfaidx package (post: https://www.biostars.org/p/105338/):

from pyfaidx import Fasta, wrap_sequence

key_fn = lambda x: ' '.join(x.replace('len=', '').split()[:2])
fa = Fasta('multi.fasta', key_function = key_fn)
with open('out.fasta', 'w') as out:

for seq in Fasta:
out.write('>{name}\n'.formatseq.name))
for line in wrap_sequence(70, str(seq)):
out.write(line)


I exchanged the multi.fasta into my file of interest filename. However, this script gave the following error: for seq in Fasta: TypeError: 'type' object is not iterable

I'm still looking for a solution. I would be super happy, if you could help me out.

sequence fasta header trimming • 2.5k views
2
Entering edit mode
4.0 years ago
cut -c 1-99 in.fa > out.fa

1
Entering edit mode
4.0 years ago
James Ashmore ★ 3.1k

No access to Windows power shell, but this python script should do it:

with open('output.fasta', 'w') as outfile:
with open('input.fasta', 'r') as infile:
for line in infile:
if line.startswith('>'):
line = line[0:99] + '\n'
print(line, end='', file=outfile)


Replace 'output.fasta' and 'input.fasta' with your filenames...