I have a fasta file which contains thousands of sequences, with headers as such:
Each pipe-deliminated section of the header can vary from sequence to sequence, and some sequences might have identical headers except for the first or second sections.
I need to be able to search through this large file and pick out and print to another file specific sequences based upon their header. There needs to be degeneracy in this search however. I have seen examples where a library text file was used but only exact matches between the fasta file and library file would work.
For instance, let's say I want all sequences which have any variation on 'piggyBac' in their header (so PiggyBac, piggybac,DNA-piggyBac, etc.).
I'm just at a loss as to how to do this exactly. Is there some way to index this file and then search the keys for variations on 'piggyBac'? If anyone has suggestions or can point me to code that does something similar it would really be helpful.
I appreciate it