Question: Why Is Blast Creating Duplicates In My Output Files?!
3
gravatar for JayB
4.7 years ago by
JayB30
JayB30 wrote:

Wow, my first post on BioStars. I have been browsing through the archives for a while now, but the time has come where I can't find the answer I need and I have to actually ASK. Thanks in advance for all your advice!

So I am trying to compare 4 amino acid fasta files (.faa) to find the common set of proteins. I have a local database with all the protein sequences which I use to retrieve the sequences and convert the BLAST output into fasta format. Here's what I have so far:

Step 1.BLAST File_1 and File_2, outputting to a new file "hits1"

   blastp -query file_2.faa -subject file_1.faa  -outfmt "6 sseqid" -evalue 1e-20 >hits1

Step 2. I then take the output file, which contains only the IDs and convert it to fasta format.

blastdbcmd -db local_db -dbtype prot -entry_batch hits1 -outfmt %f -out results1.faa

Step 3. Then I can take this file and compare it to file_3. Repeat steps 2 and 3 until all the files are compared...

blastp -query file_2.faa -subject results1.fasta  -outfmt "6 sseqid" -evalue 1e-20 >hits1

The problem I am having is that with each round of BLAST, duplicates of each ID are created. For example in file_1 (and in the database) the gene ID "003375649" is only present 1x. But the hits1 file (after running the first round of BLAST) contains this ID 2x. Then on the second round of BLAST (hits2) the output contains this ID 4x. This means that although the original file contains only 3000 protein sequences, the hits 2 file already contains >42,000 sequences and the BLAST takes for.ev.er to run! I want to know how to solve this problem but more importantly I am interested in knowing why the program is creating duplicates for my general knowledge.

Here's what I have tried: a. making more stringent e-values (1e-30), this does not help b. creating local databases from the .faa files themselves rather than one larger one and running blast like so:

makeblastdb -in file_1.faa -dbtype prot -out file_1_db -parse_seqids
blastp -query file_2.faa -db file_1_db -outfmt "6 sseqid" -evalue 1e-20 >results1.faa

this gives the exact same results as this first method. Soooo why the duplicates?

fasta • 2.9k views
ADD COMMENTlink modified 10 weeks ago by ryleehackley0 • written 4.7 years ago by JayB30

Have you tried adding "-max_target_seqs 1" to the blastp command?

ADD REPLYlink written 4.7 years ago by Philipp Bayer5.6k

A general remark on this: Do not use Blast tabular format as a primary output format of blast but use outfmt 11 (ASN.1) for long running jobs. This might be subject to debate whether this is such a strict rule, but at least for long running blast jobs, tabular format is an absolute NoGo. Tabular is lossy and you can convert ASN.1 into any other format using blast_formatter including tabular if you need.

ADD REPLYlink modified 21 months ago • written 21 months ago by Michael Dondrup44k
3
gravatar for Michael Dondrup
4.7 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

I think there are duplicates because the same subject produces a hit for multiple queries or the subject has several HSPs for the same query (repeats, gene duplications?). In other words the output of blast should be correct. Please check some alignments in the standard output format to verify and check your input file for duplications.

I suppose an easy hack: piping blast output through unix command uniq should work, e.g.:

cat hits1 | uniq > hits1unique

That will spare you from re-running the first step if it took a lot of time. or:

blastp -query file_2.faa -subject file_1.faa -outfmt "6 sseqid" -evalue 1e-20 | uniq > hits1

Using max_target_seqs alone might not help, because the same subject could still be pulled up by multiple queries as best hit, and still would appear multiple times. Setting -max_target_seqs 1 will produce only the best subject sequence per query (even if there are very similar scoring hits) for including it in subsequent queries; you have to decide if that is what you want. I wouldn't use it but set some sort of score decay threshold or e-value instead.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Michael Dondrup44k

Well, first of all Thanks SO much Michael, this was very useful. As I tried to express in the original post, I did check that there were no duplicates in the original file so I know the duplication is a result of BLAST. I did go back and check the alignments as you suggested and they seem correct. I discovered that only some of the IDs were being duplicated and others not, so I now believe that this is an artifact of the the BLAST matrix. Some of the sequences are likely to have matches WITHIN other sequences, but are themselves a separate gene, which would explain why increasing the e-value does very little to bring down the number of repeats - that is because they are correct matches. Rather than playing around with the matrix I took your advice and used the sort ... | uniq method. That is working well right now using a small number of files. But if I need to repeat this process to compare a larger number of files, I will look into using a different matrix. THANKS AGAIN!

ADD REPLYlink written 4.7 years ago by JayB30
0
gravatar for ryleehackley
10 weeks ago by
ryleehackley0 wrote:

I wanted to add a caveat (years later) because I haven't seen this mentioned elsewhere, -max_target_seqs 1 will return duplicate results if there are multiple possible alignments of your query to the same subject. Maybe this can be avoided by additional filtering with the initial blast command, but I have been taking the duplicate sseqids and selecting the alignment with the maximum bitscore, pident, or alignment length.

ADD COMMENTlink written 10 weeks ago by ryleehackley0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1351 users visited in the last hour