Hello biologists and bioinformaticians around the world!
I'm having serious problems to accomplish a task.
I need to perform a megablast search from metagenomic reads (454) in order to determine the number of hits and extract the reads that mached the reference sequences of a specific genera.
Performing this, I need to compare the results with Bowtie2.
The first goal is choose between megablast and Bowtie, through number of hits.
The second goal is remove these sequences, which belong to this specific genera, from the original set of reads, in order to analyze the community with and without this genera.
First of all I tried to construct my custom database for megablast.
I downloaded from NCBI the list of fasta sequences and the list of GI's from my genera.
Next step was to try run the formatdb for custom databases. My code was:
formatdb -i nt -o T -p F ---> format the original database with parsing active --> it's OK
formatdb -F microcystis_gis.txt -B mic_gis.gi --> format my GI list to binary format --> it's OK
formatdb -i nt -p F -L mic_gi -F mic_gis.gi -t my_mic_db --> format database generating a alias for my GI list
Here I have no success, the error was: [formatdb] FATAL ERROR: Unable to find mic_gis.gi
formatdb -i microcystis_fasta.fasta -p F -n mic_db --> no success again
Then I tried fastacmd:
fastacmd -d nt -p F -i microcystis_gis.txt -D 1 -o microcystis.fasta ---> after a long time of processing, the result was a file with all the sequences, there wasn't the GI's filtering.
Then I tried makeblastdb:
makeblastdb -in microcystis_fasta.fasta -dbtype 'nucl' –parse_seqids -title microcystis_db
and the error was: Error: Too many positional arguments (1), the offending value: –parse_seqids
I think that my problem can be the format of my input files. Maybe the header of my microcystis_fasta.fasta isn't correct, here it is:
>gi|469475955|gb|KC166868.1| Shewanella putrefaciens strain DCH-5 16S ribosomal RNA gene, partial sequence GTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTA ATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCTGGGCTC......
and this file is with a empty line between each new sequence.
My GI list and my fasta file are the ones downloaded from NCBI, I dont understand how they can be in a wrong format.... Maybe it is necessary to change this files in some way? I can do this by Perl script, but I was wondering if you can tell something more effective before.
By constructing my database for Bowtie2, it worked, but with this warning:
Warning: Encountered empty reference sequence
But, as usually, Bowtie2 performed fine and worked.....
I think that this warning is a indication that there is something wrong with the format of my reference fasta file.
Thank you everyone, Brazilian greetings!!