NcbiblastpCommandline alignment results are different from blast webpage
4
0
Entering edit mode
11 weeks ago

Hi I used this code to blast alignment between my fasta files.

output = NcbiblastpCommandline(query=seq1, subject=seq2, outfmt=0)()[0]

But the results are completely different from what I obtain from blast webpage. (below link)

https://blast.ncbi.nlm.nih.gov/Blast.cgi

for example the alignment results between A0A023W3H0.fasta and A0A0H2ZGX7.fasta files with webpage is :

Identities =44/756(6%), Positives = 73/756(9%), Gaps = 529/756(69%)

But with my code is

Identities = 6/14 (43%), Positives = 8/14 (57%), Gaps = 0/14 (0%)

What should I do? Thanks

0
Entering edit mode

Any another suggestion? Anybody can help me?

0
Entering edit mode

So this issue is not fixed? Why did you accept the answer from Mensur? Accepting an answer indicates that your problem has been fixed.

You will need to provide the source files (fasta files you are using) for people to try and reproduce the problem. Use pastebin.com to post sequences.

0
Entering edit mode

Unfortunately the problem still exists. Excuse me. I though accepting answer means I read and tried it! However, I posted the link of two of fasta file to easily download them from uniprot site. The links are:

I tried all codes with my sequences and no changes happened in the results. Even I tried to change format of sequences in order to fix the problem but I didn't succeed. What should I do? Thanks

0
Entering edit mode

No. Accepting an answer means it actually solved the problem described in original post.

4
Entering edit mode
10 weeks ago
Mensur Dlakic ★ 20k

What you are trying to do is fairly simple, and you are complicating it by: 1) not providing your sequences so that someone can reproduce your attempt; 2) giving a result in a form that is impossible to read. Be honest, can you make any sense of the result you posted above?

Here are my two sequences:

::::::::::::::
p13_hs.fas
::::::::::::::
>NP_001018025.1 protein p13 MTCP-1 [Homo sapiens]
MAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHLLTSQLPLMWQ
LYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD
::::::::::::::
p13_mi.fas
::::::::::::::
>XP_019483182.1 PREDICTED: protein p13 MTCP-1 [Hipposideros armiger]
MSGEDVGPPPDHLWVHQEGIYRDEYQRTWVAVLEEDTNFLRARVQQVQVPLGDAARPSHLLTSQLPLMWQ
LYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD


After I paste them in, I select all the lines I just pasted, and click on the button above that says 101/010 and they get formatted in a way that is legible.

Here is my code, and I will also format it as above so it can be read:

from Bio.Blast.Applications import NcbiblastpCommandline
from Bio import SeqIO

fasta_file1 = 'p13_hs.fas'
fasta_file2 = 'p13_mi.fas'
output = NcbiblastpCommandline(query=fasta_file1, subject=fasta_file2, outfmt=0)()[0]
print(output)


And finally here is the output from running the above script:

BLASTP 2.11.0+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.

Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.

Database: User specified sequence set (Input: p13_mi.fas).
1 sequences; 107 total letters

Query= NP_001018025.1 protein p13 MTCP-1 [Homo sapiens]

Length=107
Score     E
Sequences producing significant alignments:                          (Bits)  Value

XP_019483182.1 PREDICTED: protein p13 MTCP-1 [Hipposideros armiger]   210     1e-77

> XP_019483182.1 PREDICTED: protein p13 MTCP-1 [Hipposideros armiger]
Length=107

Score = 210 bits (535),  Expect = 1e-77, Method: Compositional matrix adjust.
Identities = 101/107 (94%), Positives = 106/107 (99%), Gaps = 0/107 (0%)

Query  1    MAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHL  60
M+GEDVG PPDHLWVHQEGIYRDEYQRTWVAV+EE+T+FLRARVQQ+QVPLGDAARPSHL
Sbjct  1    MSGEDVGPPPDHLWVHQEGIYRDEYQRTWVAVLEEDTNFLRARVQQVQVPLGDAARPSHL  60

Query  61   LTSQLPLMWQLYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD  107
LTSQLPLMWQLYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD
Sbjct  61   LTSQLPLMWQLYPEERYMDNNSRLWQIQHHLMVRGVQELLLKLLPDD  107

Lambda      K        H        a         alpha
0.321    0.136    0.431    0.792     4.96

Gapped
Lambda      K        H        a         alpha    sigma
0.267   0.0410    0.140     1.90     42.6     43.6

Effective search space used: 9025

Database: User specified sequence set (Input: p13_mi.fas).
Posted date:  Unknown
Number of letters in database: 107
Number of sequences in database:  1

Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40


So if something like this doesn't work for you, it is either because the sequences are not formatted properly (we don't know, because you never showed us), or maybe your BioPython version is not recent (we don't know, you never told us). Other than that, the script should do what is expected.

0
Entering edit mode

>tr|A0A0E1R3H3|A0A0E1R3H3_LISMN Penicillin-binding protein 3
OS=Listeria monocytogenes serotype  4b str. LL195 OX=1230340 GN=pbpC PE=3 SV=1
MKGCIMASYGGKKRNNKKAIIIGSIAAVAVLAGIAIYFFIQNQHKDEKNALAAAETFTSN
IAKEKYDKLGSSVSSESLKKVEFTKKEMEDKYQAVYDGIGAKDIKVKNLKSVYDDKENKF
NLTYELEMRTSLGKLATQKYKTTISKQDDDWKIDWKPALIFPGMVKTDKVRITEDEAERG
GDVIGKSGLERYYDKQLRGKNGGAIKIINDQTKQEDTLQKIDKKDGEEIKLTIDAAVQKK
AFDSLGSETGAVTMINPTNGELLALVSTPSYDANQMVLGITSEDYAKYNDDKRLPFLARY
ANRYAPGSTFKTITATIGLDTGVTKPDKVREISGLQWQKDASWGKYFVTRVHDVPKVNMT
TSYGQGELLMSPIQQAIAYSAIASDGKMPYPKLDTKEKAGKETQATEAASANQVKAALVK
VKGRGGSGLVIDKLKPVIESMYK


>tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein
OS=Streptococcus pneumoniae serotype 4 (strain ATCC BAA-334 / TIGR4)
OX=170187 GN=licC PE=1 SV=1
MKAIILAAGLGTRLRPMTENTPKALVQVNQKPLIEYQIEFLKEKGINDIIIIVGYLKEQF
SVYREDCTNEWFLVYGDDYKVQDIIVDSKAGRILSGVSFWDAPTAEKIVSFIDKAYVSGE
FVDLYWDNMVKDNIKELDVYVEELEGNSIYEIDSVQDYRKLEEILKNEN


And code is:

output = NcbiblastpCommandline(query=A0A0E1R3H3_FileAdress, subject=A0A0H2UQB5_fileAdress, outfmt=0)()[0]


and I tried this :

output =NcbiblastpCommandline(query=A0A0E1R3H3_FileAdress, subject=A0A0H2UQB5_fileAdress, ungapped=False)()


However the results with code is:

 Query= tr|A0A023W3H0|A0A023W3H0_KLEPN Class D OXA-48 carbapenemase (Fragment)
OS=Klebsiella pneumoniae OX=573 GN=OXA-48 PE=4 SV=1

Length=145
Score  (Bits)       E Value
Sequences producing significant alignments:  13.5             6.0

tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas fluo...

> tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas  fluorescens
OX=294 GN=mtlD PE=1 SV=1 Length=493

Score = 13.5 bits (23),  Expect = 6.0, Method: Composition-based stats.
Identities = 11/31 (35%), Positives = 12/31 (39%), Gaps = 3/31 (10%)

Query  105  RIVKQAMLTEANGDYIIRAKTGYSTRIEPKI  135
RIV    LT   G Y I    G      P+I
Sbjct  125  RIVS---LTITEGGYCIDDSNGEFMAHLPQI  152

Score = 13.1 bits (22),  Expect = 6.6, Method: Composition-based stats.
Identities = 3/6 (50%), Positives = 4/6 (67%), Gaps = 0/6 (0%)

Query  132  EPKIGW  137
EP + W Sbjct  259  EPFVQW  264

Lambda      K        H         a      alpha
0.320    0.135    0.421    0.792     4.96

Gapped Lambda   K        H     a      alpha   sigma
0.267   0.0410   0.140   1.90     42.6     43.6

Effective search space used: 56160

0
Entering edit mode
10 weeks ago

There is a length discrepancy between the 2 methods. The online option is suggesting that the sequence is 756 whereas the other one is suggesting 14. I would investigate the seq1 and seq2 objects in python to make sure they are of the correct length and also the input fasta files. Hope this helps.

0
Entering edit mode

Thank you very much for replying. Of course I noticed this difference and tried to fix it but I couldn't. In offline mode it shows the real length of two sequences but unfortunately use a short length for alignment. what can I do? thanks

0
Entering edit mode
10 weeks ago
Mensur Dlakic ★ 20k

It appears that your local command is creating an ungapped alignment, while the remote will always create a gapped alignment by default. See if this command helps:

output = NcbiblastpCommandline(query=seq1, subject=seq2, ungapped=False, outfmt=0)()[0]


By the way, you would have made this easier to troubleshoot by posting alignments from both commands.

0
Entering edit mode

Thank you Mensur Dlakic for response. I tried it but the results didn't change properly.

For Example for A0A023W3H0.fasta and A0A0H2UQB5.fasta the Web Results :

Query: tr|A0A023W3H0|A0A023W3H0_KLEPN Class D OXA-48 carbapenemase (Fragment) OS=Klebsiella pneumoniae
OX=573 GN=OXA-48 PE=4 SV=1 Query ID: lcl|Query_551911 Length: 145

>tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein
OS=Streptococcus pneumoniae serotype 4 (strain ATCC BAA-334 / TIGR4)
OX=170187 GN=licC PE=1 SV=1 Sequence ID: Query_551913 Length: 229 Range 1: 1 to 229

NW Score = -109,Identities = 28/236(12%), Positives = 62/236(26%), Gaps =98/236(41%)


whereas the code result:

 BLASTP 2.13.0+
Query= tr|A0A023W3H0|A0A023W3H0_KLEPN Class D OXA-48 carbapenemase
(Fragment) OS=Klebsiella pneumoniae OX=573 GN=OXA-48 PE=4 SV=1

> tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein OS=Streptococcus pneu.. 14.6    1.1

> tr|A0A0H2UQB5|A0A0H2UQB5_STRPN LicC protein
OS=Streptococcus  pneumoniae serotype 4 (strain ATCC BAA-334 / TIGR4)
OX=170187  GN=licC    PE=1 SV=1 Length=229

Score = 14.6 bits (26),  Expect = 1.1, Method: Compositional matrix adjust.
Identities = 9/23 (39%), Positives = 13/23 (57%), Gaps = 4/23 (17%)


Any other suggestion? Thanks

0
Entering edit mode

laila.jafari : Please do not use " button in edit bar when formatting parts of post that needs to stay mono-spaced or aligned. Your post is impossible to format (I tried it) and as posted it is difficult to see what the result looks like. Please edit the post above and use 101010 button to format the text.

0
Entering edit mode

Hi, I edited the whole text. Hope the problem is fixed.

0
Entering edit mode

Not it is not. Compare your post to the one from Mensur. Even the result that you posted needs to be formatted with 101 button.

0
Entering edit mode

Ok, thank you very much. I am beginner and it is my first post in this webpage. so I apologize you for this mistakes.

0
Entering edit mode

No problem. I think there is something about the way you copy and pasted the results. For example this sections looks like

> tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas  fluorescens OX=294 GN=mtlD PE=1 SV=1 Length=493

Score = 13.5 bits (23),  Expect = 6.0, Method: Composition-based stats.  Identities = 11/31 (35%), Positives = 12/31 (39%), Gaps = 3/31 (10%)

Query  105  RIVKQAMLTEANGDYIIRAKTGYSTRIEPKI  135
RIV    LT   G Y I    G      P+I Sbjct  125  RIVS---LTITEGGYCIDDSNGEFMAHLPQI  152

Score = 13.1 bits (22),  Expect = 6.6, Method: Composition-based stats.  Identities = 3/6 (50%), Positives = 4/6 (67%), Gaps = 0/6 (0%)

Query  132  EPKIGW  137
EP + W Sbjct  259  EPFVQW  264


Where as it needs to look like this to see the alignments properly.

> tr|O08355|O08355_PSEFL Mannitol dehydrogenase OS=Pseudomonas  fluorescens OX=294 GN=mtlD PE=1 SV=1 Length=493

Score = 13.5 bits (23),  Expect = 6.0, Method: Composition-based stats.  Identities = 11/31 (35%), Positives = 12/31 (39%), Gaps = 3/31 (10%)

Query  105  RIVKQAMLTEANGDYIIRAKTGYSTRIEPKI
135  RIV    LT   G Y I    G      P+I
Sbjct  125  RIVS---LTITEGGYCIDDSNGEFMAHLPQI  152

Score = 13.1 bits (22),  Expect = 6.6, Method: Composition-based stats.  Identities = 3/6 (50%), Positives = 4/6 (67%), Gaps = 0/6 (0%)
Query  132  EPKIGW
137  EP + W
Sbjct  259  EPFVQW  264

0
Entering edit mode

Thank you for your help. Hope all mistakes fixed.

0
Entering edit mode
9 weeks ago
Mensur Dlakic ★ 20k

I admire your persistence, but really: is this so important that two weeks later you still can't let go and move on to the next problem?

But since you are so persistent, here is a solution to your problem. When you want to do 2 sequence alignment with BLAST, this is the link. When you go there, paste your two sequences and increase the E-value to 10, you will get exactly the same alignment as with the local blast version.

What you have linked above is a global alignment using Needleman-Wunsch algorithm, which is different from what BLAST does ("LA" in BLAST stands for local alignment). That is why the results are different because your local BLAST copy is doing a local alignment, and the link above is doing a global alignment.

0
Entering edit mode

Thank you for your appreciate. I persist because I need it's results for my PhD project. I know I can use this link but I need about 3500,000 alignments that needs a code to do it! Even with web-scarping, it takes about one year to get results! So I need offline code. I thought BLAST is global! I need global alignment and I couldn't find proper code that can get me the results of the second link.

3
Entering edit mode

I need global alignment and I couldn't find proper code that can get me the results of the second link.

You can use the program needle from EMBOSS suite. There is another program called stretcher in EMBOSS suite that can also be used. You can download EMBOSS suite locally.

ggsearch program from FASTA suite (LINK) (not to be confused with Fasta format) generates global alignments. You can download FASTA suite.

0
Entering edit mode