Question: Custom nr database with only arthropods ?
1
gravatar for Picasa
3 months ago by
Picasa390
Picasa390 wrote:

Hi,

I would like to make a custom database (nr) for blast local running but containing only arthropods sequences.

Any way to do this ?

Thanks a lot.

blast nr arthropods • 396 views
ADD COMMENTlink modified 3 months ago by shenwei3564.3k • written 3 months ago by Picasa390

I think you should be able to modify the taxonomy ID in tutorials provided in the following post to suit your requirements.

Vertebrate Subset Nr Database? Build My Own?

ADD REPLYlink modified 3 months ago • written 3 months ago by Sej Modha3.9k

Thanks for you answer !

But this topic is 8 years old and seems like those gi methods don't work anymore (message from genomax)

ADD REPLYlink written 3 months ago by Picasa390
1

Yes, you're right, I completely forgot about that. I guess you could either use -seqidlist which is equivalent of -gilist in the current version of BLAST+ 2.7 or you could use the alpha release of the newer version of BLAST+ that allows taxonomy ID based filter: https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf

Recent enhancements to BLAST+
We have made some recent enhancement to the BLAST+ applications that allow you to:
1.) Limit your search by taxonomy using information built into the BLAST databases.
ADD REPLYlink written 3 months ago by Sej Modha3.9k

Extract all protein sequences of specific taxons from the NCBI nr database, and makeblastdb.

ADD REPLYlink written 3 months ago by shenwei3564.3k

I have used your tutorial but it seems that the fasta at the end (made with seqkit) is not complete:

grep '>' nr.arthropods.fasta -c
4119698

wc -l arthropods.taxid.acc.txt

8228966 arthropods.taxid.acc.txt

I have downloaded the last prot.accession2taxid.gz and nr.gz .

ADD REPLYlink modified 3 months ago • written 3 months ago by Picasa390

do you have a way to fix ?

ADD REPLYlink written 3 months ago by Picasa390

what's the taxid are you using?

ADD REPLYlink written 3 months ago by shenwei3564.3k

Using Arthropods: 6656

ADD REPLYlink written 3 months ago by Picasa390

change

seqkit grep -f virus.taxid.acc.txt nr.gz -o nr.virus.fa.gz

to

zcat nr.gz \
    | perl -e 'BEGIN{ $/ = ">"; <>; } while(<>){s/>$//;  $i = index $_, "]\n"; $h = substr($_, 0, $i)."]"; $s = substr $_, $i+2; while($h =~ /(.+?) \[(.+?)\]/g){ print ">$1 [$2]\n$s";} } ' \
    | seqkit grep -f arthropods.taxid.acc.txt -o nr.arthropods.fa.gz

Where the Perl script is for unfolding the records with same sequences in different species

$ cat t.fa 
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]EAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVA
INGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQ
PYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPV
SLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM

to

$ cat t.fa | perl -e 'BEGIN{ $/ = ">"; <>; } while(<>){s/>$//;  $i = index $_, "]\n"; $h = substr($_, 0, $i)."]"; $s = substr $_, $i+2; while($h =~ /(.+?) \[(.+?)\]/g){ print ">$1 [$2]\n$s";} } '
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVA
INGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQ
PYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPV
SLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>EAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVA
INGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQ
PYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPV
SLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM

ADD REPLYlink modified 3 months ago • written 3 months ago by shenwei3564.3k

done.

But I still got low number of sequence compared to the accession number:

grep '>' nr.arthropods.seqkit2.fasta -c
4133470

is it normal ?

ADD REPLYlink written 3 months ago by Picasa390

Where the Perl script is for unfolding the records with same sequences in different species

Possibly because shenwei356 is discarding records with same sequence.

ADD REPLYlink written 3 months ago by genomax59k

Ok.

Anyway. I have tried to make a database with this new fasta, with this command:

makeblastdb -in ../nr.arthropods.fasta -parse_seqids -dbtype prot -out nr.arthropods

and I got this error:

Building a new DB, current time: 09/10/2018 21:29:34
New DB name:   db_blast/nr.arthropods
New DB title:  ../nr.arthropods.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 279052: 3-9, 20-21, 31, 44, 46-48, 51-55, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279053: 1-5, 16-17, 27, 52-53, 57-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279054: 1-3, 14-15, 25, 45-46, 50-56
FASTA-Reader: Ignoring invalid residues at position(s): On line 279055: 6-7, 17, 42-43, 47-53
FASTA-Reader: Ignoring invalid residues at position(s): On line 279056: 3-4, 14, 33, 50-52, 56-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279057: 1-2, 12-13, 23, 42, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279058: 1, 5-11, 22-23, 33, 54-55, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279059: 1-5, 23-24, 34, 46, 52-57
FASTA-Reader: Ignoring invalid residues at position(s): On line 279060: 1-7, 17-18, 28, 50-51, 55-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279061: 1, 11-12, 22, 44-45, 49-55
FASTA-Reader: Ignoring invalid residues at position(s): On line 279062: 5-6, 16, 38-39, 43-49, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279063: 9, 31-32, 36-42, 52-53
FASTA-Reader: Ignoring invalid residues at position(s): On line 279064: 4, 26-27, 31-37, 47-48, 58
FASTA-Reader: Ignoring invalid residues at position(s): On line 279065: 12, 19-21, 26-27, 31-37, 48-49, 59
FASTA-Reader: Ignoring invalid residues at position(s): On line 279066: 17-18, 22-28, 39-40, 50
FASTA-Reader: Ignoring invalid residues at position(s): On line 279067: 8-9, 13-19, 30-31, 41
FASTA-Reader: Ignoring invalid residues at position(s): On line 279068: 3-4, 8-14, 25-26, 36, 51-52, 56-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279069: 1-2, 13-14, 24, 36, 38, 43-48, 52-58
FASTA-Reader: Ignoring invalid residues at position(s): On line 279070: 8, 10, 20, 37-38, 42-48, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279071: 9, 30-31, 35-41, 50

volume: db_blast/nr.arthropods

file: db_blast/nr.arthropods.pin
file: db_blast/nr.arthropods.phr
file: db_blast/nr.arthropods.psq

BLAST Database creation error: Input doesn't start with a defline or comment around line 279072

by looking at this line, I have something indeed a bit weird:

sed -n '279072p' < nr.arthropods.fasta 
[Alpheus gracilipes]AFN55284.1 histone H3, partial [Delecto
ADD REPLYlink modified 3 months ago • written 3 months ago by Picasa390

I'll check it after a full download of nr.gz. And have to go to work now.

ADD REPLYlink written 3 months ago by shenwei3564.3k

yes so apparently, some sequences don't start with '>'

ADD REPLYlink written 3 months ago by Picasa390

It's the bug of the perl one-liner, because of the not well formated nr.gz.

I gave up after few hours. May be we should blast on whole nr and filter the result.

ADD REPLYlink written 3 months ago by shenwei3564.3k
3
gravatar for shenwei356
3 months ago by
shenwei3564.3k
China
shenwei3564.3k wrote:

Data:

Hardware in this tutorial

  • CPU: AMD 8-cores/16-threads 3.7Ghz
  • RAM: 32GB
  • DISK:
    • Taxonomy files stores in NVMe SSD
    • blatdb files stores in 7200rpm HDD

Tools:

Steps:

  1. Listing all taxids below $id using taxonkit.

    id=6656
    
    # 6656 is the phylum Arthropoda
    # echo 6656 | taxonkit lineage | taxonkit reformat 
    # 6656    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda    Eukaryota;Arthropoda;;;;;
    
    # time: 3s 
    taxonkit list --ids $id --indent "" > $id.taxid.txt 
    
    wc -l $id.taxid.txt
    # 518373 6656.taxid.txt
    
  2. Retrieving target accessions. There are two options:

    1. From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in nr.

      # time: 4min
      pigz -dc prot.accession2taxid.gz \
          | csvtk grep -t -f taxid -P $id.taxid.txt \
          | csvtk cut -t -f accession.version \
          | sed 1d \
          > $id.acc.txt
      
      
      wc -l $id.acc.txt
      # 8174609 6656.acc.txt
      
    2. From pre-formated nr blastdb

      # time: 40min
      blastdbcmd -db nr -entry all -outfmt "%a %T" | pigz -c > nr.acc2taxid.txt.gz
      
      pigz -dc nr.acc2taxid.txt.gz | wc -l
      # 555220892
      
      # time: 3min
      pigz -dc nr.acc2taxid.txt.gz \
          | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \
          | cut -d ' '  -f 1 \
          > $id.acc.txt
      
      
      wc -l $id.acc.txt
      # 6928021 6656.acc.txt
      
  3. Retrieving FASTA sequences from pre-formated blastdb. There are two options:

    1. From nr.fa exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formated.

      # time: 117min
      blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz
      
      # time: 80min
      # perl one-liner is used to unfold records having mulitple accessions
      cat <(echo) <(pigz -dc nr.fa.gz) \
          | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \
          | seqkit grep --delete-matched -f $id.acc.txt -o nr.$id.fa.gz
      
      # counting sequences
      # 
      # ls -lh nr.$id.fa.gz
      # -rw-r--r-- 1 shenwei shenwei 902M 9月  13 01:42 nr.6656.fa.gz
      # 
      pigz -dc nr.$id.fa.gz | grep '^>' -c
      
      # 6928017
      # Here 6928017 ~=  6928021 ($id.acc.txt)
      
    2. Directly from pre-formated blastdb

      # time: 5h20min
      blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz
      
      
      # counting sequences
      #
      # Note that the headers of outputed fasta by blastdbcmd are "folded"
      # for accessions from different species with same sequences, so the 
      # number may be small than $(wc -l $id.acc.txt).
      pigz -dc nr.$id.fa.gz | grep '^>' -c
      # 1577383
      
      # counting accessions
      #  
      # ls -lh nr.$id.fa.gz
      # -rw-r--r-- 1 shenwei shenwei 2.1G 9月  13 03:38 nr.6656.fa.gz
      #
      pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\n>/g' | grep '^>' -c
      # 288415413
      
  4. makeblastdb

    pigz -dc nr.$id.fa.gz > nr.$id.fa
    
    # time: 3min ($nr.$id.fa from step 3 option 1)
    #
    # building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:
    #
    #     BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1
    #
    makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id
    
    # rm nr.$id.fa
    
ADD COMMENTlink modified 3 months ago • written 3 months ago by shenwei3564.3k

test: blastp

# blastdb from step 3 option 2
# 
time blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast
# real    0m20.866s

# $ cat t4.fa.blast | grep Query= -A 10
# Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a
#    
# Length=35
                                                                        Score     E
# Sequences producing significant alignments:                          (Bits)  Value

# 2MPQ_A  Chain A, Solution structure of the sodium channel toxin Hd1a  72.4    2e-17
# A0A0J9X1W9.2  RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-...  72.4    2e-17
# ADB56726.1  HNTX-IV.2 precursor [Haplopelma hainanum]                 66.6    9e-15
# D2Y233.1  RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H...  66.6    9e-15
# ADB56830.1  HNTX-IV.3 precursor [Haplopelma hainanum]                 66.6    9e-15
ADD REPLYlink modified 3 months ago • written 3 months ago by shenwei3564.3k

finally, there's no bug and sequences count matches (Here 6928017 ~= 6928021).

ADD REPLYlink written 3 months ago by shenwei3564.3k
1
gravatar for shenwei356
3 months ago by
shenwei3564.3k
China
shenwei3564.3k wrote:

Do not follow this answer.

Here's the whole commands (replacing pigz with gzip if pigz not installed).

id=6656

# Step 1) list all taxid below $id
taxonkit list --ids $id --indent "" > $id.taxid.txt



# Step 2) get all accessions
# 3 min for me
pigz -d -c prot.accession2taxid.gz \
    | csvtk -t grep -f taxid -P $id.taxid.txt \
    | csvtk -t cut -f accession.version \
    > $id.taxid.acc.txt



# Step 3) Option A (very slow)
blastdbcmd -db nr -entry all -outfmt "%a\t%T" | \
    csvtk -t grep -f 2 -P $id.taxid.acc.txt | \
    csvtk -t cut -f 1 | \
    blastdbcmd -db nr -entry_batch - -out nr.$id.fa


# Step 3) Option B (relative fast but buggy)
# reformat nr.gz and retrieve sequences (buggy)
# 16 min for me
cat <(echo) <(pigz -d -c nr.gz) \
    | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ /\[.+\]/) { print ">$_"; next; } while($h =~ /(.+? \[.+?\])/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } '\
    | seqkit grep --delete-matched -f $id.taxid.acc.txt -o nr.$id.fa.gz

# checking sequences amount (with Option B)
seqkit stats nr.$id.fa.gz
# file           format  type      num_seqs        sum_len  min_len  avg_len  max_len
# nr.6656.fa.gz  FASTA   Protein  6,913,200  2,860,805,704        6    413.8   28,516

wc -l $id.taxid.acc.txt
# 8174610 $id.taxid.acc.txt



# Step 4) makeblastdb
gzip -d -c nr.$id.fa.gz > nr.$id.fa    
makeblastdb -in nr.$id.fa -dbtype prot -out nr.$id

Well, as for 6,913,200 < 8,174,610:

Since many genes in different species have same sequences, they are recorded as one record in nr.gz.

The intuitional way is matching FASTA record heads with all target accesions using regular expression matching, however, it's very slow for such a big dataset:

$ pigz -d -c nr.gz | seqkit stats | tee stats.txt
file  format  type        num_seqs         sum_len  min_len  avg_len  max_len
-     FASTA   Protein  169,199,080  61,790,996,688        6    365.2   74,488

So a better solution is "unfolding" these FASTA records.
But the FASTA headers in nr.gz are not well formated, it's hard to rightly unfold all records shared by muliple species. Here are some cases:

  1. Normal. accession product [species]accession2 product2 [species2] (easy to solve)

    >XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]EAL68957.1 hypothetical protein DDB_G0276911 [Dictyosteli
    um discoideum AX4]
    Mxxxxxxxx
    
  2. Some FASTA headers contain 0x01 (shown as ^ below) in front of accession (easy to solve)

    $ pigz -d -c nr.gz | grep '^>' | grep CAH60339.1 | cat -A
    >CAH60339.1 histone h3, partial [Copelatus sp. 028TA588]^AABG79397.1 histone H3, partial [Lumbrineris magnidentata]^AABG79414.1 histone H3, partial [Polygordius lacteus]^AABX89916.1 histone 3, partial [Akalyptoischion atrichos]
    xxxx
    >^accesion product
    xxxxxx
    
  3. Mix 1. Some headers contains species information in [], but thelast one does not. (there are ways)

    >xxxxx [xxxx]aaaa bbbbbb
    
  4. Mix 2. Where B3N7H1.1 does not end with ], so the last character 4 and the later accession A1Z7U3.1 lay together -_-! (HARD to solve)

    >NP_001027396.1 uncharacterized protein Dmel_CG33774 [Drosophila melanogaster]XP_001970317.1 uncharacterized protein Dere_GG10558 [Drosophila erecta]B3N7H1.1 RecName: Full=Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4A1Z7U3.1 RecName: Full=Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4AAZ52820.1 uncharacterized protein Dmel_CG33774 [Drosophila melanogaster]EDV59376.1 uncharacterized protein Dere_GG10558 [Drosophila erecta]
    
ADD COMMENTlink modified 3 months ago by genomax59k • written 3 months ago by shenwei3564.3k

Thanks I will try.

1) For the option A: How do you create your database (nr) and how much time is very slow ?

Because I got this error:

[ERRO] field (2) out of range (1) in file: -

2) I am not sure to understand the option B, because you mentioned "buggy", but the makeblastdb is working ?

ADD REPLYlink modified 3 months ago • written 3 months ago by Picasa390

buggy means failing to retrieve some sequences, makeblastdb works fine.

option a is ouputtin from pre-built blastdb downloaded from ftp.

ADD REPLYlink written 3 months ago by shenwei3564.3k

for option A, do you have a link to the pre-build blastb ?

because I have tried 2 methods :

makeblastb from nr.gz directly

and

update_blastdb.pl --decompress nr

and still got the

[ERRO] field (2) out of range (1) in file: -

ADD REPLYlink written 3 months ago by Picasa390

Get the pre-made blast indexes for nr directly from NCBI.

May want to try and see if the solution I posted here works for nr.

ADD REPLYlink written 3 months ago by genomax59k

Sorry for this bug. I've downloaded the nr and rewrite the steps posted as a new answer. There's no bug now.

ADD REPLYlink written 3 months ago by shenwei3564.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1763 users visited in the last hour