Question

Search through thousands of fasta sequences via the description field of the header using bioperl

0

Entering edit mode

9.9 years ago

cjy8709 • 0

Hi everyone

I have a fasta file where the header typically looks like this:

>FBpp0071678 type=protein;loc=2R:join(18050425..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060032); ID=FBpp0071678; name=a-PB; parent=FBgn0000008,FBtr0071764; dbxref=FlyBase_Annotation_IDs:CG6741-PB,FlyBase:FBpp0071678,GB_protein:AAF46809.2,REFSEQ:NP_524641,GB_protein:AAF46809; MD5=9eb6e9e4c12ec62fdeb31cca5b0683b6; length=1329; release=r5.57; species=Dmel;
________________________________________________________________________________________________________________________________________________________________________________________________________^----------------^

The file has thousands of sequences and I need to search according to the parent field of the header (I've highlighted it in the header I'm showing). Basically I'm searching a sequence based on the FBgn number.

I'm using bioperl's Bio::DB::Fasta module but from what I understand when the db indexes my database it only recognizes the first non-space characters as the identifier and than everything after the white space is ignored.

Anyways is there any function in bioperl to search the database through the description field of the header?

Or would I have to make a perl script to change all the headers so that it fits my current specific criteria.

Thanks!

fasta-search bioperl • 2.5k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by cjy8709 • 0

1

Entering edit mode

9.9 years ago

Prakki Rama ★ 2.7k

If your fasta is oneline format, then you can try searching using fgrep.

If you have all your FBgn pattern in a file and use the following.

LC_ALL=C fgrep -A 1 -f patternList fasta.fa

else, if you do not have a pattern list, but want to search the sequences with pattern parent=FBgn in their header, and want to print out, then try

LC_ALL=C fgrep -A 1 'parent=FBgn' fasta.fa

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Thanks for the comment but I was looking for a way to index my DB with different identifiers in bioperl.

ADD REPLY • link 9.9 years ago by cjy8709 • 0

Ram · Accepted Answer · 2014-06-18

You can use the -makeid option. From the documentation:

The -makeid option gives you a chance to modify sequence IDs during indexing. The option value should be a code reference that will take a scalar argument and return a scalar result...

If you have a sequence object, $seq, you can get the part of the header after the space (which is called the description) like this:

$seq->desc

Then you could combine that with the ID in your -makeid subroutine, for indexing.