Makeblastdb Error
4
3
Entering edit mode
10.1 years ago

I've download a cds (coding sequence) databse from Ensembl in order to make a BLASTN and I am now trying to use makeblastdb of this database using the following command:

makeblastdb -in Ory_cun_cds_5.fasta -input_type fasta -dbtype nucl

And I've been obtaining the following:

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 63% ambiguous nucleotides (shouldn't be over 40%)

I believe that this is due to the amount of N's in the sequences that I can find by making a quick search, butI I am not sure. Even if this is the case, should I delete this sequences? Does anyone know how to solve this?

makeblastdb blastn error ensembl • 8.8k views
ADD COMMENT
3
Entering edit mode
10.1 years ago

There are two possibilities here:

  1. You transferred the file from a mac to a unix system, or a windows to a unix system or similar, and that screwed up the symbols for line endings - in that case, makeblastdb is errroneously loading your sequence names as actual sequences and thinks that the H, B, D, X, N etc. in the name are ambiguous IUPAC codes (as listed here). In that case, you should fix your line-endings; check if you have "\n", "\r", "\n\r" at the end of your lines, I don't know on what OS you are so I don't know how, using Python I would just do .rstrip("\r\n") on all lines and add a "\n". Linux also has commands like mac2unix in case you know where your file comes from. Also check if you don't have any accidental linebreaks inside your sequence names.

  2. There are actually too many ambiguous codes in one of your sequences, in which case I would remove the sequence in question. If it's so ambiguous, you can't align anything to it - why would you?

But I never had that error so I just tried it with this file and blast 2.2.26:

>first
XBDHXBHDXBHDBNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnHBHXBHBHD
>second
ATGCNNNNNNNNNNNNNNNNNNNNNNNATGC
>third
ATCGTCGATCGATGCTAGTAGCTAGCATGCTAGCA

with the command makeblastdb -in test.fasta -dbtype nucl -input_type fasta and that worked fine... which makes me think the first possibility is more likely.

Output:

Building a new DB, current time: 03/06/2014 10:35:12
New DB name:   test.fasta
New DB title:  test.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 3 sequences in 0.123005 seconds.
ADD COMMENT
0
Entering edit mode

In case others are having this issue - this was my problem, I had edited the seq ids with regex which blastn did not like at all. it would take neither \r nor \n to delineate between the title and sequence, and so I had to manually edit the file in a text editor in order for things to work.

Interestingly, even though my file was on a windows machine it would not blast locally OR on our unix system.

ADD REPLY
2
Entering edit mode
9.9 years ago
keith ▴ 130

I blogged about this problem http://www.acgt.me/blog/2014/5/15/fun-with-an-error-message-from-ncbi-blast

I think the point of this message is to warn you that if your sequence is highly enriched for ambiguous characters (X for proteins, N, R, S etc. for DNA) then you probably won't be able to make much of a useful BLAST database from it. However, the warning gets prefixed with 'Error' to make it sound worse than it is.

It only check the first line of a sequence. So if your sequence has >40% Xs on the first line and then no more ambiguous characters for the rest of the sequence, you will still see this error. The code behind it triggers as long as there is 5 characters on the first line. So if the first line of your sequence just starts with 'XXXSW', you will see the error.

More importantly, you should be independently checking your sequences to see how many ambiguous characters there are. Too many people send sequences to all manner of bioinformatics tools, even if they won't be able to do anything useful because of the high content of Xs or Ns.

ADD COMMENT
0
Entering edit mode
10.1 years ago

Second possibility: Now I'm puzzled because if I had to guess the reason why this happens it would be your second possibility! I've checked manually and I do find some sequences why a large amount of N's! I guess I'll try to cp them into a new file and try to makeblastdb with only them and see if it works.

First possibility: I also thought about your first guess but I had excluded it because of two things:

  1. I have another data set of peptides that I downloaded from Ensembl and another from NCBI and that I transferred to Unix in the same way I transferred the cds data set (Download to Windows and transferred to Unix via WindSCP) and they both worked fine.
  2. When I have this problem with scripts that I create in Windows and then transfer to Unix, I open a new file in Unix and cp the content of the script I created in Windows into this files and it usually works fine. I did this with the data set and it doesn't work.

I don't really know Python (very new at this) but I guess I could ask for help to use your suggestion with .rstrip. I don't find any symbols like "\n", "\r", "\n\r" in the end of lines. I do have some extra empty lines between some sequences (between the end of one and the beginning (the > symbol) of the next) but this also happens in the peptide data set so I'm guessing that this isn't the problem. Would add an asterisk also mark the end of the sequence? There is an asterisk in every end of the peptide sequence in my peptide reference data set. I excluded this because I tried to makeblastdb with only the first three sequences of the cds data set and I could do it just fine. .

What I'm thinking: I was thinking about deleting the sequences with 40% or more ambiguous bases (that is, N's) to solve this problem but I will try your suggestions first and see if they work

Thanks a lot for your answers!

ADD COMMENT
0
Entering edit mode

Since you transferred files from Windows to Linux, I suggest installing dos2unix on Linux and running that on your input files, that might fix your problem.

sudo apt-get install dos2unix

or

sudo yum install dos2unix

Warning: On standard settings this will change your original file.

Edit: And here's a simple Python script which uses BioPython to remove sequences with more than 40% missing alleles which reads "your_input_file.fasta" (change that) and writes to "Your_cleaned_file.fasta"

from Bio import SeqIO
output = open("Your_cleaned_file.fasta", "w")
for seq in SeqIO.parse("your_input_file.fasta", "fasta"):
  sequence = str(seq.seq).upper()
  non_missing = sequence.count("A") + sequence.count("T") + sequence.count("C") + sequence.count("G")
  missing = len(sequence) - non_missing
  if float(missing)/non_missing > 0.4:
    continue
  output.write(seq.format("fasta"))
ADD REPLY
0
Entering edit mode

Hi!

I decided to linearize the fasta sequence (put the sequence in one line) and now it is working fine. It was weird because I was only getting one error what means that the FASTA-reader was only complaining about one line with 63% of ambiguous nucleotides when I new that several other lines had several different percentages of N's, from the sequence was all made of N (60/60) to 1/60. This solved the problem.

Anyway, thank you a lot to take the time to try to help me.

ADD REPLY
0
Entering edit mode
9.9 years ago

Dear all,

I am still stuck with same problem and not getting any idea to resolve it

I checked back to my seq and i dont have any ambiguous nucleotides such as N, Y, W etc

Second possibility I try to remove \r\n (might be by conversion problem dos2unix or mac2unix)

by dos2unix program

dos2unix -c mac input.fasa #(but not worked same error ??)
dos2unix -n input.fasta output.fasta

also used

tr '\r' '\n' <input> output #(but not worked same error ??)

but the problem is remain same

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 42% ambiguous nucleotides (shouldn't be over 40%))?????

Please its humble request you all, suggest me something that how to over come this problem.

The seq I downloaded from miRBase mature.

ADD COMMENT
1
Entering edit mode

Hello folks,

I got it. it was problem of RNA, its only take T (i.e DNA) instead U (RNA). i just simply replaced U into T and its work !!!

ADD REPLY
0
Entering edit mode

I encountered the same problem, but with DNA sequences. Error or Warning message is always distracting, but it seems that my file has been masked correctly.

ADD REPLY

Login before adding your answer.

Traffic: 2125 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6