Blastn command line tab format gives duplicates
1
2
Entering edit mode
4.2 years ago
elesbb91 ▴ 70

When using the tab output format, it causes multiple outputs of the same hit to show, but with different %identities and other attributes.

When using the standard output, the results are what I find on the web based version of the blastn tool.

How can I create a tabular form of results without getting all these duplicate hits?

blastn command-line • 1.8k views
ADD COMMENT
0
Entering edit mode

a tabular form of results without getting all these duplicate hits

Are you referring to different HSP being reported as separate entries?

ADD REPLY
0
Entering edit mode

You probably haven't used any thresholds/criteria and that's why you are getting many hits, you can add these options:

-num_alignments 1

And

 -evalue 1e-3

You can replace 3 with any number lower than 300?

see other options in

blastn --help
ADD REPLY
0
Entering edit mode

No, -num_alignments still will show partial alignments in table format

ADD REPLY
0
Entering edit mode

Even if there's a global alignment for that query that has a better identity score?

ADD REPLY
1
Entering edit mode

yes, that will show the full alignment in sections, for example, if I modify a little a Ferroxidase:

>AAM45881.1 multicopper ferroxidase [Chlamydomonas reinhardtii]
MDSKEAAEPASVHVNVDVEAQKAQAQAEAAAKGGACATSGMSKGKIIVTSLVIFLGVAVGVGLGVGLGVG
LKKDDGSSAYTSLDLGTGSGGGNTYFVAADKIQWNYAPSGRNKCFPPDLAAKYLAMQPGITRVGGTFAKA
IYRAYTDSSFNTLATTPAEWQHLGNVGPVMYGAVGQVIRVVFKNNLDFPVNMAPSGGLIAWDGNGRRSAR
IDPVKPGQTVTYLWQIPEDAGPVANATVTSRLWLYRSSVDPQKHDNAGLVGPIIVTSAANADANGRARDV
DRDVVAIFQLVQERASPLLFQEDTSLTAGTSYTKMAINGYTWCNMPDGAITIKTGERVRWHVASIGSSES
LHNFHWHGHVVELNGHHVDQFTAIPTATYSVNMVPDEPGTWMFHCHVNFHMDGGMVALYTVTGDPAPLPT
GGVERVYYVRAQEVEWSYSGPNNTQACAVPELQFSSEPGSEEVNGNVFLEGPSTDPVRLGHIYTKTLLIE
YTDASFTTVKPRPADEQYLGLLGPVMRANVGDTIKVVLKNDAKIDVSLHPHGVRYSKANEGTLYEDGTSG
CACCACACACACACACACACACACACCACACACCACACACACACACACACCACACACACACACACCACAC
ADKADDVVAPGTTYTYVWNVPDRAGPGPCDPSSMLWMYHSHIDETAETYAGVAGGIIVTAKDMARSTADL
TPKDVDREIVIFFTVVDEIKSSNFMENLANKLGDGGALAAQLAANATEMTALVTDPVFMEHMLKHGINGH
MYCHMPRLTFEQGDKVRLHVMVLGTLEDMHTPNMGGPRFDYNGMHTDSIQISPGGMVSADVQMTSPGDYE
LQCRVADHVMAGMRAKYTVTANASRMVVNPSGVTRTYYIQAEAVNWDYAPAGYQKCTDTDFSYQSSVYLR
RTSYTIGSRYRKAVYRAYTDATFSTRVPTPAYYGTMGPMIIAEVGDRIVVHFKNAVTDLEEYPLNISPGG
GLLVEGAADENCAEVAAGETCVYRWIVPDSSGPGTADFNTAVYGYTSSVDVATAPSAGLAGALVVAGRGQ
LVAGPDGSLLPRGVDLMVPLYWQVVDENSSPFLDLNVEAAQLNVTKFENDAVLSADFDEGNRMHSINGYV
YCNQPLVTIAKGKKLRWVLVAYGTEGDFHSPQFTGQSLEADKSGYSTLASLMPSIARVADMTAADVGTWL
LYCDVHDHYMAGMMSQFAVTAA

(if is not obvious I'm adding a "AC" repeat line)

then I run Blast:

$ blastp -task blastp-fast -db nr -remote -evalue 1e-3 -num_alignments 1 -out blast.tsv -outfmt 6 -query AAM4588.1.fa
$ cat blast.tsv
AAM45881.1      XP_001694585.1  100.000 582     0       0       631     1212    561     1142    0.01209
AAM45881.1      XP_001694585.1  99.465  561     3       0       1       561     1       561     0.01165
AAM45881.1      XP_001694585.1  31.767  617     343     19      636     1211    212     791     2.71e-65    253
AAM45881.1      XP_001694585.1  31.015  532     286     16      95      572     427     931     6.78e-55    221
AAM45881.1      XP_001694585.1  31.232  349     206     11      871     1210    89      412     4.16e-37    164
AAM45881.1      XP_001694585.1  30.946  349     207     11      89      412     801     1140    1.52e-35    159
ADD REPLY
0
Entering edit mode

Oh, okay! It had worked for me in the past. I used a dataset of genes and a comprehensive set of genes from bacteria, and using -num_alignments 1 and -evalue 1e-300 I only got one hit for each gene!

ADD REPLY
0
Entering edit mode

Take a look at my comment, I just don't understand how changing the output format will change the hits too

ADD REPLY
0
Entering edit mode

So, this is all great news and stuff.. But, why when using standard output (not adding the output parameter) those other hits are not included? If I use -outfmt 7 I get more hits? Like, a heckin load more. It seems odd because the ONLY thing that is changing is the outfmt..

My only guess is that if outfmt changes, does it cause other parameters' defaults to change?

Here is the output:

STANDARD:

BLASTN 2.10.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: RNA_HUMAN
           159,998 sequences; 565,721,404 total letters



Query= XM_017006770.1 PREDICTED: Homo sapiens SET domain containing 5
(SETD5), transcript variant X7, mRNA

Length=10666
                                                                      Score        E
Sequences producing significant alignments:                          (Bits)     Value

XM_017006770.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  19697      0.0  
XM_017006784.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  12946      0.0  
XM_017006782.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  12946      0.0  
XR_001740195.2 PREDICTED: Homo sapiens SET domain containing 5 (S...  12490      0.0  
XM_017006785.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  12251      0.0  
XM_017006786.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  12249      0.0  
NM_001292043.1 Homo sapiens SET domain containing 5 (SETD5), tran...  11527      0.0  
NM_001349451.1 Homo sapiens SET domain containing 5 (SETD5), tran...  11250      0.0  
XM_017006775.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  10741      0.0  
XM_017006772.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  10741      0.0  
XM_005265301.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  10741      0.0  
NM_001080517.2 Homo sapiens SET domain containing 5 (SETD5), tran...  10741      0.0  
XM_017006767.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  8506       0.0  
XM_017006773.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  8067       0.0  
XM_024453621.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_024453620.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_017006783.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_017006780.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_017006779.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_017006778.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_017006777.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_017006776.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0  
XM_017006774.1 PREDICTED: Homo sapiens SET domain containing 5 (S...  6758       0.0

TAB OUTPUT:

# BLASTN 2.10.0+
# Query: XM_017006770.1 PREDICTED: Homo sapiens SET domain containing 5 (SETD5), transcript variant X7, mRNA
# Database: RNA_HUMAN
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 119 hits found
XM_017006770.1  XM_017006770.1  100.000 10666   0   0   1   10666   1   10666   0.0 19697
XM_017006770.1  XM_017006784.1  100.000 7010    0   0   1   7010    1   7010    0.0 12946
XM_017006770.1  XM_017006782.1  100.000 7010    0   0   1   7010    1   7010    0.0 12946
XM_017006770.1  XM_017006782.1  100.000 999 0   0   7008    8006    7065    8063    0.0 1845
XM_017006770.1  XR_001740195.2  100.000 6763    0   0   1   6763    1   6763    0.0 12490
XM_017006770.1  XR_001740195.2  100.000 2260    0   0   6761    9020    6858    9117    0.0 4174
XM_017006770.1  XM_017006785.1  99.970  6640    2   0   1   6640    1   6640    0.0 12251
XM_017006770.1  XM_017006785.1  100.000 90  0   0   6631    6720    6744    6833    2.56e-38    167
XM_017006770.1  XM_017006786.1  99.970  6639    2   0   1   6639    1   6639    0.0 12249
XM_017006770.1  XM_017006786.1  100.000 59  0   0   6631    6689    6695    6753    4.37e-21    110
XM_017006770.1  NM_001292043.1  100.000 6242    0   0   4425    10666   857 7098    0.0 11527
XM_017006770.1  NM_001292043.1  100.000 750 0   0   1   750 1   750 0.0 1386
ADD REPLY
0
Entering edit mode

For example, XM_017006782.1 is listed twice in tab format but NOT in standard format.

ADD REPLY
1
Entering edit mode

It is listed once in the summary but both HSP (high scoring segment pairs) are displayed when you scroll down to the pair-wise alignments in HTML format. Since alignments are not shown in format 7 you get two entries for the two HSP's.

ADD REPLY
0
Entering edit mode

In standard output, I do not see both alignments either in the summary nor in the pair-wise alignments. This goes for the HTML version too. Even the link you provided I am only seeing one hit for XM_017006782.1 as well as in the alignments view. I seriously am so lost xD

NVM I see them. So they are just matches that were found in different areas of the query?

ADD REPLY
2
Entering edit mode

You either need to scroll down a ways or use the Next Match link that you see in the screenshots below.

They are called High Scoring Pairs (HSP's).

Screen-Shot-2020-02-20-at-3-44-01-PM Screen-Shot-2020-02-20-at-3-44-28-PM

ADD REPLY
0
Entering edit mode

I think in standard format and in online blast, it outputs both alignments both only shows the max score and total score. For example in online blastn if you align XM_017006770.1 and XM_017006782.1 you'll get

PREDICTED: Homo sapiens SET domain containing 5 (SETD5), transcript variant X26, mRNA   12946   14792   75% 0.0 100.00% XM_017006782.1

MAX score and total score, and two different alignment one from 1 to 7010 and one from 7008 to 8006 (with the same coordinates and score as your table). I believe you should have two alignments in your standard format too.

I'm trying to say that the hits are not different, but the format is the different.

Query  6961  TCACATCTCTTACTACTGCTAGTCGCTGCAACACTCCTCTACAGTTTGAG  7010
             ||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  6961  TCACATCTCTTACTACTGCTAGTCGCTGCAACACTCCTCTACAGTTTGAG  7010


Range 2: 7065 to 8063GenBankGraphicsNext MatchPrevious MatchFirst Match
Alignment statistics for match #2
Score   Expect  Identities  Gaps    Strand
1845 bits(999)  0.0 999/999(100%)   0/999(0%)   Plus/Plus
Query  7008  GAGCTTTGTCACCGAAAAGACCTGGATTTGGCAAAAGTAGGATACCTTGACTCCAACACT  7067
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  7065  GAGCTTTGTCACCGAAAAGACCTGGATTTGGCAAAAGTAGGATACCTTGACTCCAACACT  7124
ADD REPLY
0
Entering edit mode

Yes, I do not have two alignments in my standard format. I only see the two in the tabular format. This is the whole problem lol. I have no idea why. I also do not see both alignments in the online blast either, even if I download the results.

ADD REPLY
0
Entering edit mode

So I figured it out. Thank you guys. The second hits were partial alignments of the same strand as JC posted. So I understand now the differences. Thank you everyone!

ADD REPLY
5
Entering edit mode
4.2 years ago
JC 13k

Those are not duplicates, those are partial alignments.

ADD COMMENT

Login before adding your answer.

Traffic: 2266 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6