Custom BLAST database with mask
0
0
Entering edit mode
3.9 years ago

Hi,

I am new to bioinformatics. But I have a dataset that I want to analyze. I want to build a custom BLAST database according to a paper. Here is how the database was constructed:

"A viral genomic sequence database was constructed as follows. First, of the viral sequences, sequence regions that highly resemble the human or bacterial genomes were masked (i.e., replaced by the sequence “NNN …”). The sequence regions to be masked were determined by a local sequence similarity search using BLASTn (in BLAST+ version 3.9.0) [77]. The word size and E value parameters were set at 11 and 1.0e−3, respectively. As sources of the human and bacterial genome sequences, the human reference genome (GRCh38/hg38) and the prokaryotic representative genomes were used, respectively. "

I want to construct similar database,but with bacterial refseq. How can I obtain the masking data to make my own database?

alignment • 1.1k views
ADD COMMENT
0
Entering edit mode

we may assume that

BLAST+ version 3.9.0

is a typo? (or did you copy from the paper?) the most recent version of blast is 2.10.1

ADD REPLY
0
Entering edit mode

I copied it from the paper. It should be a typo, I assume.

ADD REPLY
0
Entering edit mode

How can I obtain the masking data to make my own database?

That you will most likely to have to do yourself. No such data is going to be readily available.

ADD REPLY
0
Entering edit mode

yeah, I know. I plan to construct my own database. I am just confused how can I mask the bacterial refseq sequences using human genome as a source.

ADD REPLY
0
Entering edit mode

I am just confused how can I mask the bacterial refseq sequences using human genome as a source.

You are masking viral genomes not bacterial if I read the excerpt you included in your original post right.

First, of the viral sequences, sequence regions that highly resemble the human or bacterial genomes were masked

As they describe you will take human and bacterial (representative) genomes and then blast them against your viral database using the limits described. You will then mask any sequence regions that show a hit in your viral genome.

One additional way is to use bbmask.sh from BBTools as described here. You would get your human and bacterial genomes. Shred (create fake reads) from them and use them to align to a database of viral genomes. You would then mask all sequences that show a hit in your viral genome database.

ADD REPLY
0
Entering edit mode

You are masking viral genomes not bacterial

Not quite, they already provide the masked viral database. I want to build my own bacterial database according to their method.

You will then mask any sequence regions that show a hit in your viral genome.

So, what I get from your explanation is I build a human database, blast my bacterial sequences against it. Then extract the bacterial sequences showing hit, then just masked them. But how can I specify the exact region that should be masked just based on blast hit? The approach you showed would probably answer this question, but is it possible to do it with the method described in that paper?

ADD REPLY

Login before adding your answer.

Traffic: 2082 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6