Question: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)
0
gravatar for sfcarroll
4.5 years ago by
sfcarroll70
Switzerland
sfcarroll70 wrote:

I am not sure if this is a problem or if in fact the process is correct. Any help is much appreciated.

I am trying to make blast databases from assembly fasta files, and have seeing the above error. It generated blast database files but how do I know they are correct?

I followed these steps:

1) Downloaded assembly fasta file archive 

site

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips

file

chromFa.tar.gz   

2) Unpacked the file

tar zyvf chromFa.tar.gz 

3) Ran makeblastdb

/home/sean/blast/ncbi-blast-2.2.29+/bin/makeblastdb -dbtype nucl -title chr1.fa.blast -in ../chr1.fa -parse_seqids

4) Received an error 

Building a new DB, current time: 09/04/2014 13:18:53
New DB name:   ../chr1.fa
New DB title:  chr1.fa.blast
Sequence type: Nucleotide
Deleted existing BLAST database with identical name.
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)
Adding sequences from FASTA; added 1 sequences in 20.1816 seconds.

5) Output files generated

-rw-rw-r-- 1 sean sean  62359693 Sep  4 13:19 chr1.fa.nsq
-rw-rw-r-- 1 sean sean        59 Sep  4 13:19 chr1.fa.nsi
-rw-rw-r-- 1 sean sean        18 Sep  4 13:19 chr1.fa.nsd
-rw-rw-r-- 1 sean sean        36 Sep  4 13:19 chr1.fa.nog
-rw-rw-r-- 1 sean sean        96 Sep  4 13:19 chr1.fa.nin
-rw-rw-r-- 1 sean sean        43 Sep  4 13:19 chr1.fa.nhr

 

6) The start of the assembly file does contain a lot of N's

➜  hg19  head chr1.fa
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

 

 

 

blast • 3.3k views
ADD COMMENTlink modified 4.5 years ago by Devon Ryan88k • written 4.5 years ago by sfcarroll70
6
gravatar for Devon Ryan
4.5 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

This is more a warning message with "Error" prepended than an actual error message (see the source code here). For something like the human genome, you're going to get this warning simply because the telomeres are hard masked. The resulting files should work regardless.

ADD COMMENTlink written 4.5 years ago by Devon Ryan88k

Thanks, I thought so, just wanted to sanity check my process. I know the BLAST databases can be downloaded from the NIH, but I am just trying to own the process. 

ADD REPLYlink written 4.5 years ago by sfcarroll70

Hi Devon! I have the same issue has sfcaroll, except my sequences don't have a single "n" in them. Should I be concerned about this error?

ADD REPLYlink written 2.2 years ago by catarina.fa0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1064 users visited in the last hour