Failure in confirming BWA mapping with BLASTN
1
0
Entering edit mode
4.7 years ago

Hello,

I aligned my fastq files to a fusion genome made by the human GRCh38 and a virus genome that I called chrV. I extracted the starting position of the reads and the ending point, so that I can extract the sequence of interest with: samtools view -h -q 10 -f67 -F1408 $r_file $r_locus | grep -v @ | cut -f 10 where $r_file is the deduplicated bam file and $r_locus is the string that defines the position of the read, in this case "chrV:254748988-254749089". The string that I get out is: "GTGGGGCCGGTCGGGGTACGGCGTGAACGTGCGCCGCGCTTCATGCGGCGATCTCCAGCGTCCAGCCCCGACCGGGGTCGTGGACGACCCGATGCCCCGCG\n"

which I have saved on a file called query.fa. I then ran command line blastn against the dabase build upon the reference genome, then against the human genome alone and then in the remote nt collection. I was expecting a hit with the reference genome, zero on the second and a hit on the organism identified by the mapping with BWA (in this case Actinoplanes phage phiAsp2). Instead, I got nothing:

$ blastn -db ~/refSeq/fusion/fusion38-10k.fa  -query ~/Downloads/h32/confirm/query.fa 
BLASTN 2.9.0+

Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.

Database: fusion38-10k.fa
           196 sequences; 3,469,986,645 total letters
Query= 
Length=101

***** No hits found *****

Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 256778620122
Database: fusion38-10k.fa
Posted date:  Aug 13, 2019  3:54 PM
Number of letters in database: 3,469,986,645
Number of sequences in database:  196
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

$ blastn -db ~/REM_refseq/blastDB/nt -query ~/Downloads/h32/confirm/query.fa 
BLASTN 2.9.0+

Database: Nucleotide collection (nt)
           52,478,804 sequences; 218,532,353,722 total letters

Query= 
Length=101

***** No hits found *****

Lambda      K        H
    1.33    0.621     1.12 
   Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 14962859207586
  Database: Nucleotide collection (nt)
    Posted date:  Jun 15, 2019  6:40 PM
  Number of letters in database: 218,532,353,722
  Number of sequences in database:  52,478,804

Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

$ blastn -db nt -remote -query ~/Downloads/h32/confirm/query.fa 
BLASTN 2.9.0+

Database: Nucleotide collection (nt)
           53,494,200 sequences; 234,541,331,245 total letters

Query= 
Length=101

RID: N5H8D8TW014

***** No hits found *****
Lambda      K        H
    1.33    0.621     1.12 
    Gapped
Lambda      K        H
    1.28    0.460    0.850 
    Effective search space used: 16089245153517

      Database: Nucleotide collection (nt)
    Posted date:  Aug 11, 2019 12:00 AM
  Number of letters in database: 234,541,331,245
  Number of sequences in database:  53,494,200
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

Also, it is weird that Query= given that the file contains the entry:

$ cat ~/Downloads/h32/confirm/query.fa 
GTGGGGCCGGTCGGGGTACGGCGTGAACGTGCGCCGCGCTTCATGCGGCGATCTCCAGCGTCCAGCCCCGACCGGGGTCGTGGACGACCCGATGCCCCGCG

In addition, if I run blastn online I get a series of hits (different from the expectation):

                                                                  Score        E     Max
Sequences producing significant alignments:                       (Bits)     Value  ident

CT573213.2 Frankia alni str. ACN14A chromosome, complete sequence  45.5       0.59   96%         
CP000249.1 Frankia casuarinae strain CcI3, complete genome         45.5       0.59   96%         
CP041061.1 Micromonospora sp. HM134 chromosome, complete genome    44.6       0.59   91%         
LR132003.1 Gouania willdenowi genome assembly, chromosome: 3       44.6       0.59   80%         
...

What am I getting wrong? why I can't find the same hit with the command line version?

sequencing next-gen blastn confirmation • 1.1k views
ADD COMMENT
0
Entering edit mode

why I can't find the same hit with the command line version?

I see -db ~/refSeq/fusion/fusion38-10k.fa and -db ~/REM_refseq/blastDB/nt in the two searches. Or are you referring to nt being used locally and via online blast page?

If you click on Algorithm parameters option on online blast page you can see defaults blast uses for additional parameters. You may need to reproduce them locally. Most likely you have to add -task blastn-short.

ADD REPLY
0
Entering edit mode

That is right, fusion38-10k.fa is built on the reference file i used for the BWA alignment, blastDB/nt is a local build of the human genome alone and then -db nt -remote is in remote. I'll try -task blastn-short even if id did not used that option online...

ADD REPLY
0
Entering edit mode

-db nt -remote is in remote

This is entire GenBank so keep that in mind, if

lastDB/nt is a local build of the human genome alone

So they are definitely not the same database.

ADD REPLY
1
Entering edit mode
4.7 years ago
JC 13k

The problem is you are not using the same algorithms. When you use blastn in the command line, the default -task value is "megablast". When you search online if you use the same, you don't get any significant hit, however, if you switch your task to "blastn", it find the hits your reported.

Run your command line as:

blastn -task blastn -db nt -remote -query ~/Downloads/h32/confirm/query.fa

ADD COMMENT
0
Entering edit mode

I tried but:

$ blastn -task blastn -db nt -remote -query ~/Downloads/h32/confirm/query.fa
BLASTN 2.9.0+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.

Database: Nucleotide collection (nt)
           53,555,755 sequences; 234,891,250,153 total letters

Query= 
Length=101

RID: N7CMMPHW015
Error: NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_560232_130.14.18.6_9008__PrepareRelease_Linux64-Centos_1552331742/c++/compilers/unix/../../src/objects/seq/../seqalign/Score_.cpp", line 90: Error: CScore::C_Value::GetInt(): Invalid choice selection: NCBI-Seqalign::Score.value.real

Is it normal that the query is empty?

Anyway it worked on the local database:

$ blastn -task blastn -db ~/REM_refseq/blastDB/nt -query ~/Downloads/h32/confirm/query.fa
BLASTN 2.9.0+

Database: Nucleotide collection (nt)
           52,478,804 sequences; 218,532,353,722 total letters

Query= 
Length=101
                                                                      Score        E
Sequences producing significant alignments:                          (Bits)     Value

CT573213.2 Frankia alni str. ACN14A chromosome, complete sequence     45.5       0.55 
CP000249.1 Frankia casuarinae strain CcI3, complete genome            45.5       0.55 
LR132003.1 Gouania willdenowi genome assembly, chromosome: 3          44.6       0.55 
...
ADD REPLY

Login before adding your answer.

Traffic: 2495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6