Species coverage in the NCBI protein NR database ?
0
0
Entering edit mode
3 months ago
HERMAN ▴ 10

Hi Biostars,

I am currently trying to build a Eukaryote version of the NCBI NR database and I am not really sure that I fully understand how the NR is implemented.

Here is the code that I'm using to do so :

#!/usr/bin/bash 

##############
# DOWNLOAD FULL NR
##############

baseURL="https://ftp.ncbi.nlm.nih.gov/blast/db/" 

for i in $(seq -f "%02g" 0 82); do
     echo "$baseURL/nr.$i.tar.gz"
done > download_list.txt       

aria2c -i download_list.txt --max-concurrent-downloads=10

# Not shown here but I also download and check the md5 files

for file in nr.*.tar.gz; do
    tar -I pigz -xf "$file"
done

###########################
# FETCH ALL SEQID FOR EUKARYOTE SPECIES
###########################

# From https://bioinf.shenwei.me/taxonkit/tutorial/#making-nr-blastdb-for-specific-taxids

id=2759 # Eukaryotes taxid

taxonkit list --ids $id --indent "" > $id.taxid.txt

# prot.accession2taxid.gz is an NCBI file, name pretty straightforward

pigz -dc prot.accession2taxid.gz \
    | csvtk grep -t -f taxid -P $id.taxid.txt \
    | csvtk cut -t -f accession.version,taxid \
    | sed 1d \
    > $id.acc2taxid.txt

cut -f 1 $id.acc2taxid.txt > $id.acc.txt

#######################
# EXTRACT THE EUKARYOTE PROTEINS
#######################

blastdbcmd -db NR/nr -entry_batch $id.acc.txt -out NR_EUK/nr_euk.faa -dbtype prot -logfile extract.log

My problem is as following, for some fasta sequences I have this kind of header :

>A0A067SLB9.1 RecName: Full=Alpha-amanitin proprotein 1; Contains: RecName: Full=Alpha-amanitin; AltName: Full=Amatoxin; AltName: Full=Gamma-amanitin; Flags: Precursor [Galerina marginata CBS 339.88] >5N4C_E Chain E, Alpha-amanitin proprotein [Galerina marginata] >5N4C_F Chain F, Alpha-amanitin proprotein [Galerina marginata] >5N4C_G Chain G, Alpha-amanitin proprotein [Galerina marginata] >5N4C_H Chain H, Alpha-amanitin proprotein [Galerina marginata] >5N4E_C Chain C, Alpha-amanitin proprotein [Galerina marginata] >5N4E_D Chain D, Alpha-amanitin proprotein [Galerina marginata] >AYM46699.1 alpha-amanitin [Galerina sulciceps] >AEX26935.1 alpha-amanitin proprotein [Galerina marginata] >AYM46698.1 alpha-amanitin [Galerina marginata] >KDR68474.1 hypothetical protein GALMADRAFT_1387421 [Galerina marginata CBS 339.88] >QKM76215.1 alpha-amanitin [Galerina marginata]
MFDTNATRLPIWGIGCNPWTAEHVDQTLASGNDIC

As I understand it, the NIH " compress " sequences that look exactly alike into one representative sequence ? The thing is I would like every species that have this protein ( or other proteins, this one is just an example ) to be represented in the NR since I'm working with a datatation pipeline. One thing that confuses me aswell is that when I run a website BLASTP with the MFDTNATRLPIWGIGCNPWTAEHVDQTLASGNDIC peptide, I find every protein present in the header but as unique entries.

Sorry for the very long post and thank you for the ones of you that will read this until the end !

Best regard, Simon

NR • 394 views
ADD COMMENT
1
Entering edit mode

as an answer to your question on how do they represent multiple entries with a single sequence

The header of the fasta file contains all headers that match identical sequences and these headers are separated with the /r (carriage return) character.

Hence each individual header contains multiple headers for each identical sequence. In your example if you keep scrolling right you will see

>A0A067SLB9.1  [...]  >5N4C_E Chain E, Alpha-amanitin [...] >AEX26935.1

it is a weird hack actually, but as most bioinformatics it kind of works

ADD REPLY
0
Entering edit mode

As I understand it, the NIH " compress " sequences that look exactly alike into one representative sequence ? The thing is I would like every species that have this protein ( or other proteins, this one is just an example ) to be represented in the NR since I'm working with a datatation pipeline.

Then you need to follow the recommended option a in the tutorial.

From nr.fa exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formated.

ADD REPLY
0
Entering edit mode

You might also check the new Blast+ version: NCBI BLAST+ 2.2.15 now available for download

New taxonomy limit option

You can now also limit your search to sequences from a group of organisms without first using a custom script to get all species-level Taxonomy IDs (taxids) for the group.

You just need to run

blastp -db nr -taxids 2759 -query ...
ADD REPLY
0
Entering edit mode

Thank you for your quick response ! ( And for all the softwares haha ). I indeed overlooked this comment on the headers format.

Have a great day !

ADD REPLY

Login before adding your answer.

Traffic: 2757 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6