I've download a cds (coding sequence) databse from Ensembl in order to make a BLASTN and I am now trying to use makeblastdb of this database using the following command:
makeblastdb -in Ory_cun_cds_5.fasta -input_type fasta -dbtype nucl
And I've been obtaining the following:
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 63% ambiguous nucleotides (shouldn't be over 40%)
I believe that this is due to the amount of N's in the sequences that I can find by making a quick search, butI I am not sure. Even if this is the case, should I delete this sequences? Does anyone know how to solve this?
You transferred the file from a mac to a unix system, or a windows to a unix system or similar, and that screwed up the symbols for line endings - in that case, makeblastdb is errroneously loading your sequence names as actual sequences and thinks that the H, B, D, X, N etc. in the name are ambiguous IUPAC codes (as listed here). In that case, you should fix your line-endings; check if you have "\n", "\r", "\n\r" at the end of your lines, I don't know on what OS you are so I don't know how, using Python I would just do .rstrip("\r\n") on all lines and add a "\n". Linux also has commands like mac2unix in case you know where your file comes from. Also check if you don't have any accidental linebreaks inside your sequence names.
There are actually too many ambiguous codes in one of your sequences, in which case I would remove the sequence in question. If it's so ambiguous, you can't align anything to it - why would you?
But I never had that error so I just tried it with this file and blast 2.2.26:
with the command makeblastdb -in test.fasta -dbtype nucl -input_type fasta and that worked fine... which makes me think the first possibility is more likely.
Output:
Building a new DB, current time: 03/06/2014 10:35:12
New DB name: test.fasta
New DB title: test.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 3 sequences in 0.123005 seconds.
In case others are having this issue - this was my problem, I had edited the seq ids with regex which blastn did not like at all. it would take neither \r nor \n to delineate between the title and sequence, and so I had to manually edit the file in a text editor in order for things to work.
Interestingly, even though my file was on a windows machine it would not blast locally OR on our unix system.
I think the point of this message is to warn you that if your sequence is highly enriched for ambiguous characters (X for proteins, N, R, S etc. for DNA) then you probably won't be able to make much of a useful BLAST database from it. However, the warning gets prefixed with 'Error' to make it sound worse than it is.
It only check the first line of a sequence. So if your sequence has >40% Xs on the first line and then no more ambiguous characters for the rest of the sequence, you will still see this error. The code behind it triggers as long as there is 5 characters on the first line. So if the first line of your sequence just starts with 'XXXSW', you will see the error.
More importantly, you should be independently checking your sequences to see how many ambiguous characters there are. Too many people send sequences to all manner of bioinformatics tools, even if they won't be able to do anything useful because of the high content of Xs or Ns.
ADD COMMENT
• link
updated 4.8 years ago by
Ram
44k
•
written 10.4 years ago by
keith
▴
130
Second possibility: Now I'm puzzled because if I had to guess the reason why this happens it would be your second possibility! I've checked manually and I do find some sequences why a large amount of N's! I guess I'll try to cp them into a new file and try to makeblastdb with only them and see if it works.
First possibility: I also thought about your first guess but I had excluded it because of two things:
I have another data set of peptides that I downloaded from
Ensembl and another from NCBI and that I transferred to Unix in the
same way I transferred the cds data set (Download to Windows and
transferred to Unix via WindSCP) and they both worked fine.
When I have this problem with scripts that I create in Windows
and then transfer to Unix, I open a new file in Unix and cp the
content of the script I created in Windows into this files and it
usually works fine. I did this with the data set and it doesn't
work.
I don't really know Python (very new at this) but I guess I could ask for help to use your suggestion with .rstrip. I don't find any symbols like "\n", "\r", "\n\r" in the end of lines. I do have some extra empty lines between some sequences (between the end of one and the beginning (the > symbol) of the next) but this also happens in the peptide data set so I'm guessing that this isn't the problem. Would add an asterisk also mark the end of the sequence? There is an asterisk in every end of the peptide sequence in my peptide reference data set. I excluded this because I tried to makeblastdb with only the first three sequences of the cds data set and I could do it just fine. .
What I'm thinking: I was thinking about deleting the sequences with 40% or more ambiguous bases (that is, N's) to solve this problem but I will try your suggestions first and see if they work
Since you transferred files from Windows to Linux, I suggest installing dos2unix on Linux and running that on your input files, that might fix your problem.
sudo apt-get install dos2unix
or
sudo yum install dos2unix
Warning: On standard settings this will change your original file.
Edit: And here's a simple Python script which uses BioPython to remove sequences with more than 40% missing alleles which reads "your_input_file.fasta" (change that) and writes to "Your_cleaned_file.fasta"
from Bio import SeqIO
output = open("Your_cleaned_file.fasta", "w")
for seq in SeqIO.parse("your_input_file.fasta", "fasta"):
sequence = str(seq.seq).upper()
non_missing = sequence.count("A") + sequence.count("T") + sequence.count("C") + sequence.count("G")
missing = len(sequence) - non_missing
if float(missing)/non_missing > 0.4:
continue
output.write(seq.format("fasta"))
I decided to linearize the fasta sequence (put the sequence in one line) and now it is working fine. It was weird because I was only getting one error what means that the FASTA-reader was only complaining about one line with 63% of ambiguous nucleotides when I new that several other lines had several different percentages of N's, from the sequence was all made of N (60/60) to 1/60. This solved the problem.
Anyway, thank you a lot to take the time to try to help me.
I encountered the same problem, but with DNA sequences. Error or Warning message is always distracting, but it seems that my file has been masked correctly.
In case others are having this issue - this was my problem, I had edited the seq ids with regex which blastn did not like at all. it would take neither \r nor \n to delineate between the title and sequence, and so I had to manually edit the file in a text editor in order for things to work.
Interestingly, even though my file was on a windows machine it would not blast locally OR on our unix system.