Hi all friends,
I have a large fasta file that most sequences have a identical header (they differ from the length). I usually extracted the sequences of interest with the following python codes:
#!/usr/bin/env python # -*- coding: utf-8 -*- """Extract sequences from a fasta file if their name is in a 'wanted' file. Wanted file contains one sequence name per line. Usage: %program <input_file> <wanted_file> <output_file>""" import sys import re try: from Bio import SeqIO except: print "This program requires the Biopython library" sys.exit(0) try: fasta_file = sys.argv # Input fasta file wanted_file = sys.argv # Input wanted file, one gene name per line result_file = sys.argv # Output fasta file except: print __doc__ sys.exit(0) wanted = set() with open(wanted_file) as f: for line in f: line = line.strip() if line != "": wanted.add(line) fasta_sequences = SeqIO.parse(open(fasta_file),'fasta') with open(result_file, "w") as f: for seq in fasta_sequences: name = seq.id if name in wanted and len(seq.seq.tostring()) > 0: wanted.remove(name) # Output only the first appearance for a name SeqIO.write([seq], f, "fasta")
But the code just extracted one of the sequences with identical header, while I want to all sequences of interest that may have identical headers and adding the sequential number at the beginning or end of identical headers in the extracted fasta file? something like:
header_1 header_2 header_3 etc.