Question: Trim FASTA header in multifasta file
1
gravatar for j.m.carsten
2.7 years ago by
j.m.carsten10
j.m.carsten10 wrote:

Hi,

I never did programming on my own, so I don't have any idea how to handle this problem. I search the internet and found several solutions using "sed", "awk" or a pipeline of both together with "cat". I found a python script and many other things but nothing worked out.

First of all: I don't have a unix system. I'm trying to do that on a Win10 machine using "Windows power shell".

I have python 3.5 GnuWin with sed and awk support and the programs seem to work fine.

Now the problem: My multifasta file looks like that:

>Q0HIK0|6-phosphogluconate dehydratase|Shewanella sp. (strain MR-4)
M-HSVVQSVTDRIIARSKASREAYLAALNDARNHKACQEVGSVAQVAGVPCDGVTQGQPG
MELSLLSREVIAMATAVGLSHNMFDGALLLGICDKIVPGLLIGALSFGHLPMLFVPAGPG
KVDRAQLLEAEAQSYHSAGTCTFYGQLMLEVMGLQLPGSSFVNPDDPLREALNKMAAKQV
CRLTELGTQYSPIGEVVNEKSVVNGIVALLATGGSTNLTMHIVAAARAAGIIVNWDDFSE
LSDAVPLLARVYPNGHADINHFHAAGGMAFLIKELLDAGLLHEDVNTVAGFGLRRYTQEP
KLLDGEVRWVDGPTVSLDTEVLTSVATPFQNNGGLKLLKGNLGRAVIKVSAVQEKHRVVE
APAVVIDDQNKLDALFKSGALDRDCVVVVKGQGPKANGMPELHKLTPLLGSLQDKGFKVA
LMTDGRMSGASGKVPAAIHLTPEAIDGGLIAKVQDGDLIRVDALTGELSLLVSDAELAAR
TATEIDLRHSRYGMGRELFGALRSNLSSPETGARSTSAIDELY*

>A0KX60|6-phosphogluconate dehydratase|Shewanella sp. (strain ANA-3)
M-HSVVQSVTDRIIARSKASREAYLAALNDARNHKACQEVGSVAQVAGVPCDGVTQGQPG
MELSLLSREVIAMATAVGLSHNMFDGALLLGICDKIVPGLLIGALSFGHLPMLFVPAGPG
KVDRAQLLEAEAQSYHSAGTCTFYGQLMLEVMGLQLPGSSFVNPDDPLREALNKMAAKQV
CRLTELGTQYSPIGEVVNEKSVVNGIVALLATGGSTNLTMHIVAAARAAGIIVNWDDFSE
LSDAVPLLARVYPNGHADINHFHAAGGMAFLIKELLDAGLLHEDVNTVAGFGLRRYTQEP
KLLDGELRWVDGPTVSLDTEVLTSVATPFQNNGGLKLLKGNLGRAVIKVSAVQEKHRVVE
APAVVIDDQNKLDALFKSGALDRDCVVVVKGQGPKANGMPELHKLTPLLGSLQDKGFKVA
LMTDGRMSGASGKVPAAIHLTPEAIDGGLIAKVQDGDLIRVDALTGELSLLVSDAELAAR
TAAEIDLRHSRYGMGRELFGALRSNLSSPETGARSTSAIDELY*

I want the the fasta header to be trimmed that it is less than 99 characters long because the program MrBayes only accepts headers with less characters. Idk why this is the case. But OK. And yes: those headers are less than 99 characters, but my multifasta file contains more than 5000 fasta sequences and a lot of them are longer than 99 characters. So, I searched the internet to find programs, which can do that. I found several solutions here on biostar, but none of them seem to work on my machine.

I tried:

  • awk -F"|" '/>/{$0=">"$2}1' file ==> gave the error: syntax error at >
  • awk -vs1=">" -F"|" '/^>/ {print s1 $2;next} 1' file_name ==> gave warning: escape sequence '>' treated as plain '>' with the output of the multifasta fileinto the shell, but nothing else
  • cat input.txt | sed 's/.*[/>/' | sed 's/]//' ==> gave nothing except that the multifasta file was printed to the shell
  • sed -e 's/\len=//' -e 's/\path.*//┬áMultiFasta.txt > OutLength.txt ==> gave nothing instead of a >> prompt, which I had to interrupt with CTRL+C
  • I tried this python script after installing the pyfaidx package (post: https://www.biostars.org/p/105338/):

from pyfaidx import Fasta, wrap_sequence

key_fn = lambda x: ' '.join(x.replace('len=', '').split()[:2])
fa = Fasta('multi.fasta', key_function = key_fn)
with open('out.fasta', 'w') as out:

for seq in Fasta:
    out.write('>{name}\n'.formatseq.name))
    for line in wrap_sequence(70, str(seq)):
      out.write(line)

I exchanged the multi.fasta into my file of interest filename. However, this script gave the following error: for seq in Fasta: TypeError: 'type' object is not iterable

I'm still looking for a solution. I would be super happy, if you could help me out.

sequence trimming fasta header • 1.7k views
ADD COMMENTlink modified 2.7 years ago by James Ashmore2.7k • written 2.7 years ago by j.m.carsten10
2
gravatar for Pierre Lindenbaum
2.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:
cut -c 1-99 in.fa > out.fa
ADD COMMENTlink written 2.7 years ago by Pierre Lindenbaum124k
1
gravatar for James Ashmore
2.7 years ago by
James Ashmore2.7k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.7k wrote:

No access to Windows power shell, but this python script should do it:

with open('output.fasta', 'w') as outfile:
    with open('input.fasta', 'r') as infile:
        for line in infile:
            if line.startswith('>'):
                line = line[0:99] + '\n'
            print(line, end='', file=outfile)

Replace 'output.fasta' and 'input.fasta' with your filenames...

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by James Ashmore2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1658 users visited in the last hour