Question: Using blast or similar tool to annotate ORFs using Bacmet database
0
gravatar for Longshotx
15 months ago by
Longshotx 50
Longshotx 50 wrote:

Hello all - I am interested in annotating some ORFs I assembled using the 753 experimentally confirmed genes in Bacmet (http://bacmet.biomedicine.gu.se/) as a reference. Basically I want to do a blast search against this database and annotate similarly matched ORFs if they have an evalue less than or equal to 0.00001.

What would be the best tool or strategy for this?

ADD COMMENTlink modified 15 months ago by Mensur Dlakic7.3k • written 15 months ago by Longshotx 50
1
gravatar for Mensur Dlakic
15 months ago by
Mensur Dlakic7.3k
USA
Mensur Dlakic7.3k wrote:

I can think of at least two strategies.

First, split your 753 proteins into single files, do BLASTp searches for each of them against your ORF database using the E-value threshold of 0.00001, and save the output of each search into individual files. BLASTp search files with no matches have the size of about 2500 characters - you can find the exact range of numbers easily by finding out the smallest 5-10 files after the search. After that there will be a jump in file sizes of at least 500, which are the files that contain matches. NOTE: if your ORF database is smaller than 753 proteins, you may want to do a reverse strategy: split your database into individual proteins, and BLASTp against the Bacmet ORFs.

Run jackhmmer, HHpred or some other automated search strategy on your Bacmet proteins and build HMM files for each of them. Concatenate all HMMs into a single file, and annotate your ORFs against this database using Prokka.

ADD COMMENTlink written 15 months ago by Mensur Dlakic7.3k

I tried to do something similar to the hmm profile strategy but got confused. I originally aligned all the the proteins in Bacmet and build a hmm profile, but when I performed a search the query said "Bacmet alignment" for all of them which is not helpful at all. Are you suggesting I build a hmmprofile for each of the 753 individual genes? So hmmbuild Bacmetprotein1.fa Bacmetprotein1.hmm....same for all other proteins. Concatenate the HMMs. Hmmpress them and use in Prokka?

Lastly I was not aware Prokka could annotate existing ORFs . I thought you had to feed it the assemblies, and it would use a specified HMM profile and Prodigal (since Prokka is somewhat of a wrapper for other tools) to create the ORFs and annotate them. Note: (I just looked at the Github...it cannot annotate ORFs from Prodigal...you must rerun it on the contigs).

Thanks!

ADD REPLYlink modified 15 months ago • written 15 months ago by Longshotx 50

Are you suggesting I build a hmmprofile for each of the 753 individual genes?

That is what I am suggesting. From a quick glance at Bacmet database it seems that it contains 753 different proteins. I don't think making an alignment out of them would be productive.

By the same token, making HMMs out of individual proteins is not very productive either. I suggested you do a search of 753 individual proteins using jackhmmer or HHpred, which will return alignments of your starting proteins with their homologs. Those alignments should be used to build HMMs.

HMMs can be named during the building process using the -n switch, or you can edit the NAME field manually since HMMs are simple text files. If they are not named explicitly, you may get a generic name such as Bacmet alignment creep into all of them.

Lastly I was not aware Prokka could annotate existing ORFs

It is meant to annotate (meta)genome assemblies. However, one can skip the Prodigal step and feed it ORFs instead. Commenting out Prodigal section (starts around line 700 in main prokka script) and changing a few lines should do the trick.

ADD REPLYlink written 15 months ago by Mensur Dlakic7.3k

I have never used either of those tools, but it appears I can use the jackhmmer script on the cluster I am working on. Could you suggest a db to use? I am assuming nr? Or I could use the BacMet predicted genes as a database...that has 150,000 genes.

ADD REPLYlink modified 15 months ago • written 15 months ago by Longshotx 50

It will probably be good enough if you use UniProt90 database (available as gzipped file here). That database is considerably smaller than nr.

If you are confident that Bacmet database of predicted genes is comprehensive enough, that will save you quite a bit of time compared to either UniProt90 or nr.

ADD REPLYlink written 15 months ago by Mensur Dlakic7.3k

Excellent thank you. I'm going to try the BacMet 735 experimentally confirmed genes against the against BacMet predicted database of 155k gene homologs for now.

ADD REPLYlink written 15 months ago by Longshotx 50

Hi again Mensur

So I hope I am setting this up correctly because I just ran it using parallel and it finished without any output file.

So I have a directory with a .fasta for everyone of the 735 genes and the database containing the 155k genes is in a directory above.

I ran this command:

ls *.fa | parallel --gnu "jackhmmer {} Bacmet_predicted.fa"

I didn't realize I had to specify an output file (very dumb of me). I'm not sure what I need it to output though. Am I looking to get a Stockholm (.sto) alignment file for each query gene by passing the -A option?

Then I build a hmm profile for each...

I'm not familiar with combining the hmm profiles.

Please let me know if this sounds correct. Thanks

Just in case someone wants to run this on therir own I edited my code:

ls *.fa | parallel --gnu "jackhmmer --cpu 16 -o {.}.out -A {.}.sto --tblout {.}.tab {} Bacmet_predicted.fa"
ADD REPLYlink modified 15 months ago • written 15 months ago by Longshotx 50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1871 users visited in the last hour