Question

Local blast custom databases

0

Entering edit mode

7.6 years ago

treitlis ▴ 40

Hi,

I work on some protist genomes and because the protists live in symbiosis with bacteria usually I have bacterial contamination in the NGS data.

Previously, I was using a script to decontaminate the genomic data from bacterial hits, and among other things, the script was based on blastn and blastp, and instead of having subset of bacteria+archea database, I was using GI lists to filter the blast against nt and nr (using -gilist -negativegilist parameter)

I did not create custom subsets because I was thinking that is easier to update from time to time the gi-lists and the entire nt and nr database, instead of doing everything mention above and also to update the custom subset databases.

Now, NCBI is phasing out GI numbers, as I am sure everybody knows, and I am quite stuck with my script. I cannot use accession number as a way to filter the results, and also I don't know any way to make a custom subset of the nt or nr database using accession numbers and not GI numbers. I know that with GI numbers you could use the blastdb_aliastool to create a custom database based on your gilist but is there a way to do the same thing with accession numbers?

For my script it is very important somehow to have the filtering done in the blast, or to have custom blast database based on taxonomy, because I use custom outfmt, and one of the methods used in classification of a sequence as "contaminant" are coverage and identity thresholds which are calculated for the entire sequence length, and based on values from the blast results. It would not work if I would not get the results in the same order as blastn or blastp outputs them, which most probably would be the case if I would use some post filtering.

Thank you in advance for any help

ncbi gi numbers gilist • 4.1k views

ADD COMMENT • link 7.1 years ago by treitlis ▴ 40

score 1 · Answer 1 · 2016-09-20

1

Entering edit mode

7.6 years ago

5heikki 11k

With BLAST 2.4.0+

 *** Restrict search or results
 -gilist <String>
   Restrict search of database to list of GI's
    * Incompatible with:  negative_gilist, seqidlist, remote, subject,
   subject_loc
 -seqidlist <String>
   Restrict search of database to list of SeqId's
    * Incompatible with:  gilist, negative_gilist, remote, subject,
   subject_loc

I imagine -seqidlist refers to accession.version type identifiers. There are still many things that cannot be done without GI's (e.g. custom databases with taxonomy information). I would think that they will eventually fix it all. Be sure to let them know if there's something that can't be done anymore..

ADD COMMENT • link 7.6 years ago by 5heikki 11k

0

Entering edit mode

Hello,

I think I completely missed this parameter of the blast. I will try to download the accession number for some organism and try if to see how it will work and post back.

Anyway thank you very much for pointing out this parameter.

ADD REPLY • link 7.6 years ago by treitlis ▴ 40

0

Entering edit mode

There's a new version of blast (2.5.0).

BLAST+ 2.5.0 released

Fri, 23 Sep 2016 17:00:00 EST

A new version (2.5.0) of the BLAST+ executables is now available.

The new version offers support for HTTPS, accession.version as the primary sequence identifier, support for composition-based statistics with RPSTBLASTN, and a new taxonomic organism report. The release notes are at https://www.ncbi.nlm.nih.gov/books/NBK131777/

Two BLAST+ features require communication with the NCBI website (and HTTPS). First, the –remote flag sends the search to the NCBI for processing. Second, BLAST can take a sequence ID as a query and retrieve the sequence from the NCBI. You should update your BLAST+ executables by November 9, 2016 to ensure that these features continue to work. More information about the HTTPS transition is available at https://www.ncbi.nlm.nih.gov/home/develop/https-guidance.shtml

Improved support for accession.version as the primary identifier affects a number of BLAST+ executables. Blastdbcmd now produces FASTA formatted sequences from BLAST databases that have accession.version as the identifier rather than the traditional bar-delimited FASTA ID. See https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers for more information. Makeblastdb can now guess the type of ID (e.g., GenBank or Refseq) from the accession alone if the –parse_seqids flag is used. The search executables now produce reports identifying matches using accession.version by default.

The new executables are available on the NCBI FTP site at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST

ADD REPLY • link 7.6 years ago by 5heikki 11k

0

Entering edit mode

Sorry for the late rely. Thank you for the information, but I don't know how well this version will help me. However, as I saw there is still possibility to use the GI numbers, and even now you can still download the GI list. I will try to use eutils to download the accession list and use seqid as trial but I am worried that I will get an error as with gilist. I saw that if the GI list is too big I get an error in blast "Error vector::reserve". So I usually use gilist (if the target seqs are the smaller set) or negative_gilist, if the target GI list is huge. But there is no negative_seqidlist, so I will see if I get the error if the seqidlist is too big.

ADD REPLY • link 7.5 years ago by treitlis ▴ 40

0

Entering edit mode

Hello,

I don't know what's with NCBI but to download the accession list for a group it will take ages. I don't know why the download of the accession list goes with max 1 kb/s but I am able to download the gilist with 250kb/s and if I try to download all sequences as fasta for a group I can download with several MB/s. It took me half a day to download the archea nt accession list, which is roughly 6 MB file. In this way I think it will be faster to download using entrez the entire eukaryote and bacteria fasta files for both nt and nr and concatenate the file and afterwards create the custom database, since the download goes much more faster even if I download tens of GB of data.

ADD REPLY • link 7.6 years ago by treitlis ▴ 40

score 0 · Answer 2 · 2016-09-19

0

Entering edit mode

7.6 years ago

Chris Cole ▴ 800

Rather than do this with your own scripts I recommend this tool: https://blobtools.readme.io

It is designed exactly for this kind of job and works very nicely. It is an update of the tools described here: http://journal.frontiersin.org/article/10.3389/fgene.2013.00237/full

ADD COMMENT • link 7.6 years ago by Chris Cole ▴ 800

0

Entering edit mode

Hello,

I know those tools, but they are not helpful when you are not working on well known organisms, or when the contaminants are not present in the nt database. I still use the tools to generate the plot before and after, but they are not well suited for my needs.

ADD REPLY • link 7.6 years ago by treitlis ▴ 40

0

Entering edit mode

I'm surprised. I've used it on novel dictystelid organisms and found the GC vs read depth very useful for filtering out the majority of the contaminant sequences. See our preprint here. http://biorxiv.org/content/early/2016/05/24/054536

ADD REPLY • link 7.6 years ago by Chris Cole ▴ 800

0

Entering edit mode

Hello,

Good example in your paper, but my case is very different. I don't want to enter in detail why I chosen this path. Even prodege, with it's very complex algorithm is unable to decontaminate the data. Also in FastQC I see a nice clear peak of GC content, not two peaks, or more, and in my "contaminants" file generated in my own way the CG content is extremely similar to the "clean" file. This approach is not suitable for my data. The eukaryotic genome and it's epibiotic bacteria, have both AT rich genomes, with a GC content of around 35%, in bacteria and the eukaryote as well.

The only sequences in NCBI for these eukaryotes are the SSU sequences, nothing more.

Moreover these data are from single cell genomes, which complicate even further the procedure. Here you cannot use coverage cutoff, to remove certain reads, because you have highly uneven coverage. Honestly I have tried a lot of things before choosing to use my own way.

ADD REPLY • link 7.6 years ago by treitlis ▴ 40

0

Entering edit mode

OK. Fair enough. Sounds like you've tried a few things already. Sorry I can't be more help.

Good luck!

ADD REPLY • link 7.6 years ago by Chris Cole ▴ 800

score 0 · Answer 3 · 2017-03-22

Hi there treitlis,

I am also a protistologist working on a similar issue of decontamination. For this, I typically use deconseq on my reads prior to assembly (using custom databases, but I suppose you could use nr as well). See here for their package. Its super fast so you can do it on reads.

But, if you would like to use blast, I have been trying to set up taxonomically restrictive blastdbs. I have been in touch with NCBI about this exact issue over the past couple of weeks and I think I have a solution for you.

(1) use blastdbcmd to extract the seqid and its corresponding Kingdom* information with the -outfmt "%i,%K" to a file. Takes anywhere from 30min - 2 hours

(2) grep this file for which Kingdom you want and write the accessions to a file

(3) make a blastdb_alias file using -seqidlist flag

I have only done it with GIs, but I think you are right to want to move towards accessions. I will try that later today and hope it works!

See you in Prague @ Protist 2017 ?

/Courtney

P.S. *Yes, the term Kingdom is outdated.. it should be domain.. and yes there are probably only two domains.. :P

score 0 · Answer 4 · 2017-03-22

Do you have reference genomes for your bug and/or probable contaminants?

You can simply map and discard/keep reads accordingly with the -F flag of samtools, before assembling. I've done this to remove host DNA from plasmid purifications in order to assemble the plasmids properly.

Have you tried running your data through Kraken as well to identify the contaminants etc.?

score 0 · Answer 5 · 2017-03-22

Hello,

Thank you for your replies.

Deconseq kinda works and kinda not. It's not the best tool for my job. I don't have reference for the contaminants (they are unsequenced bacteria) but I have some close relatives in NCBI. Deconseq removes kinda 4-5% of the reads which obliviously it's not correct. I know that I have more reads from bacteria (based on my manual decontamination). Even if I manually try to add some bacteria from NCBI it does not improve the outcome significantly to take it into account.

Since I needed to figure it out somehow, I was still using GI lists to filter the blast results using awk( I am a big fan of this small unix software) But now, I am able to do it also in different way, using accession numbers and not using GIlist in blasting step.

Currently if I need to decontaminate I do blast (blastn and blastp, depending on the situation), and then using accession lists created from the NCBI taxonomy and the accession2taxid (ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/ ) I separate the blast hits in two main groups (Eukaryotes, and anything else) (using awk script). And then further awk scripts tell me the coverage, identity and the overall "score" of a specific contig/scaffold from the blast results and if it's a bacterial contamination or not.

However, I have a small issue with this method and I am not sure what's the problem. NCBI nt database is big, with a lot of sequences, and if you check the taxonomy for Eukaryotes (just as an example) https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=2759&lvl=3&lin=f&keep=1&srchmode=1&unlock you can see that there are around 170 million sequences. But the accession2taxid from NCBI for gb database (I assume it's nt) has just around 120 million entries (and 12 more in the "dead" ones). So I don't know why is this inconsistency. When I do the separation into certain taxa (eg prokaryotes eukaryotes viruses, synthetic constructs) of the nucl_gb.accession2taxid.gz file I get around 90 million accession numbers for EUK. Obviously is much less than the 170 which you can see in the taxonomy, and much less than in my gilist. I don't have this issue in the nr database, which is kinda interesting, or at least I don't have such discrepancy. Do you think the fact that one accession number can have more versions (revision 1, 2 and so on) could be the problem. It can be that each version gets a different gi number but the accession number does not change, just the version number changes. This is my working theory but I am not sure about this idea. Any other opinions about this?

Most probably I will be in Prague at Protist 2017. My supervisor is one of the organizers:-).

LE: I have never tried Kraken, but as I read, it relies on good references, which I don't have, but none the less I will check the software when I will have time. But looking for Kraken I found an article http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1397-7 which could be useful for certain applications, because it combines database methods with other methods which don't rely on databases:-). Probably worth a try.