Question

Why Is Blast Creating Duplicates In My Output Files?!

5

Entering edit mode

10.3 years ago

JayB ▴ 50

Wow, my first post on BioStars. I have been browsing through the archives for a while now, but the time has come where I can't find the answer I need and I have to actually ASK. Thanks in advance for all your advice!

So I am trying to compare 4 amino acid fasta files (.faa) to find the common set of proteins. I have a local database with all the protein sequences which I use to retrieve the sequences and convert the BLAST output into fasta format. Here's what I have so far:

Step 1.BLAST File_1 and File_2, outputting to a new file "hits1"

   blastp -query file_2.faa -subject file_1.faa  -outfmt "6 sseqid" -evalue 1e-20 >hits1

Step 2. I then take the output file, which contains only the IDs and convert it to fasta format.

blastdbcmd -db local_db -dbtype prot -entry_batch hits1 -outfmt %f -out results1.faa

Step 3. Then I can take this file and compare it to file_3. Repeat steps 2 and 3 until all the files are compared...

blastp -query file_2.faa -subject results1.fasta  -outfmt "6 sseqid" -evalue 1e-20 >hits1

The problem I am having is that with each round of BLAST, duplicates of each ID are created. For example in file_1 (and in the database) the gene ID "003375649" is only present 1x. But the hits1 file (after running the first round of BLAST) contains this ID 2x. Then on the second round of BLAST (hits2) the output contains this ID 4x. This means that although the original file contains only 3000 protein sequences, the hits 2 file already contains >42,000 sequences and the BLAST takes for.ev.er to run! I want to know how to solve this problem but more importantly I am interested in knowing why the program is creating duplicates for my general knowledge.

Here's what I have tried: a. making more stringent e-values (1e-30), this does not help b. creating local databases from the .faa files themselves rather than one larger one and running blast like so:

makeblastdb -in file_1.faa -dbtype prot -out file_1_db -parse_seqids
blastp -query file_2.faa -db file_1_db -outfmt "6 sseqid" -evalue 1e-20 >results1.faa

this gives the exact same results as this first method. Soooo why the duplicates?

fasta • 6.6k views

ADD COMMENT • link updated 16 months ago by beausoleilmo ▴ 580 • written 10.3 years ago by JayB ▴ 50

0

Entering edit mode

Have you tried adding "-max_target_seqs 1" to the blastp command?

ADD REPLY • link 10.3 years ago by Philipp Bayer 8.4k

0

Entering edit mode

A general remark on this: Do not use Blast tabular format as a primary output format of blast but use outfmt 11 (ASN.1) for long running jobs. This might be subject to debate whether this is such a strict rule, but at least for long running blast jobs, tabular format is an absolute NoGo. Tabular is lossy and you can convert ASN.1 into any other format using blast_formatter including tabular if you need.

ADD REPLY • link 7.4 years ago by Michael 54k

score 5 · Answer 1 · 2014-01-08

I think there are duplicates because the same subject produces a hit for multiple queries or the subject has several HSPs for the same query (repeats, gene duplications?). In other words the output of blast should be correct. Please check some alignments in the standard output format to verify and check your input file for duplications.

I suppose an easy hack: piping blast output through unix command uniq should work, e.g.:

cat hits1 | uniq > hits1unique

That will spare you from re-running the first step if it took a lot of time. or:

blastp -query file_2.faa -subject file_1.faa -outfmt "6 sseqid" -evalue 1e-20 | uniq > hits1

Using max_target_seqs alone might not help, because the same subject could still be pulled up by multiple queries as best hit, and still would appear multiple times. Setting -max_target_seqs 1 will produce only the best subject sequence per query (even if there are very similar scoring hits) for including it in subsequent queries; you have to decide if that is what you want. I wouldn't use it but set some sort of score decay threshold or e-value instead.

score 1 · Answer 2 · 2018-07-12

1

Entering edit mode

5.8 years ago

ryleehackley ▴ 10

I wanted to add a caveat (years later) because I haven't seen this mentioned elsewhere, -max_target_seqs 1 will return duplicate results if there are multiple possible alignments of your query to the same subject. Maybe this can be avoided by additional filtering with the initial blast command, but I have been taking the duplicate sseqids and selecting the alignment with the maximum bitscore, pident, or alignment length.

ADD COMMENT • link 5.8 years ago by ryleehackley ▴ 10

1

Entering edit mode

I hae started recently to do a similar type of work and I have noticed the same thing. What I do is a post filtering step where I sort first by lowest e-value and later filter the uniq used ids.

ADD REPLY • link 4.9 years ago by mgdias.jose ▴ 20

0

Entering edit mode

What commands do you use to do that? So far, I have awk -F'\t' '{print $1}' blast.output.txt | sort -u which extract the unique values of the blast output, assuming that the first column is stitle.

Assuming that the output format was -outfmt "7 stitle qacc saccver staxids sscinames pident length mismatch gapopen qstart qend sstart send evalue bitscore". This should work:

grep -v "^#" blast.output.txt | sort -t$'\t' -k14,14 | sort -u -t$'\t' -k1,1

grep -v "^#" blast.output.txt: removes the pound signs (keep everything except the lines starting with #)

sort -t$'\t' -k14,14: sort column 14 (-k14,14, evalue in my case), with tab separated file -t$'\t'

sort -u -t$'\t' -k1,1: sort column 1 (-k1,1, stitle in my case), with tab separated file -t$'\t', but keep unique only -u.

I feel there should be a better solution...

ADD REPLY • link 16 months ago by beausoleilmo ▴ 580