Entering edit mode
4.1 years ago
anikcropscience
▴
270
Hello,
I am using MAFFT to align multiple nucleotide sequences. I am using the following code:
system(paste0("mafft --maxiterate 1000 Alignment_output/",GENE,".fa > ",GENE,".mafft.fa"))
But I have received the following error message:
nthread = 0
nthreadpair = 0
nthreadtb = 0
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00
=========================================================================
=========================================================================
===
=== Alphabet 'z' is unknown.
=== Please check site 2208 in sequence 4.
===
=== To make an alignment that has unusual characters (U, @, #, etc), try
=== % mafft --anysymbol input > output
===
=========================================================================
=========================================================================
Illegal character z
Then I inserted --anysymbol
in that above code. But again I receive the following message:
inputfile = orig
Characters '= < >' can be used only in the title lines in the --anysymbol or --text mode.
The output file is empty and no result.
Can you please tell me how can I solve this issue?
Thanks a lot.
To clarify, here are the full codes that I used to run. I ran the scripts on a cluster.
The folder structure looks like the following:
ST16CH_Denovo
...ISOLATE_ID1
......scaffolds.fasta
...ISOLATE_ID2
......scaffolds.fasta
# Making a BLAST database:
mkdir BLAST_DB
mkdir BLAST_output
# Define the path to the folder containing de novo assemblies
DENOVOFOLDER=ST16CH_Denovo
# Create Blast databases for all assemblies
isolates=( $( ls $DENOVOFOLDER ) )
# generate Blast DB + fasta index (fai) for each scaffolds.fasta file
for isolatespades in ${isolates[@]}
do
echo $isolatespades
isolate=$( echo $isolatespades )
echo $isolate
makeblastdb -in $DENOVOFOLDER/$isolatespades/scaffolds.fasta -dbtype nucl -out BLAST_DB/$isolate; samtools faidx $DENOVOFOLDER/$isolatespades/scaffolds.fasta
samtools faidx $DENOVOFOLDER/$isolatespades/scaffolds.fasta
done
# Define the fasta file you want to use
ORIGFILE= gene_name.fa
cp $ORIGFILE blast.gene_name.query.txt
# define a label to be used later
GENE=(gene_name)
isolates=( $( ls $DENOVOFOLDER ) )
# run blast
for isolatespades in ${isolates[@]}
do
echo $isolatespades
isolate=$( echo $isolatespades)
echo $isolate
blastn -query blast.$GENE.query.txt -db BLAST_DB/$isolate -out BLAST_output/$isolate.$GENE.blastn.txt -outfmt 6
done
Are you using this statement in a loop? If so, please refine your command to error out with the file that produces the error - maybe print each
GENE
before running the mafft command. If your command is not in a loop (i.e. you know the value of the failingGENE
), you can skip the above debugging step.Once you know the failing input file, show us the first 20 lines of the file and also the output to
grep -m15 z input.fa | grep -Ev "^>"
Yes, I am using the following loop first in R to process all the files. This is the code that I am running first:
The
GENE
here stands forGENE= gene_id
. How can I get that error you are talking about?I don't see
GENE
being assigned values, so I'm guessing there is yet another loop wrapping the code above. Write the value ofGENE
to stdout before running the mafft command so you know where it fails.I have added the full script to my original post. Please have a look. It ran alright until the R-part. Also, when I see
blastn
in R, there is only one file outputted instead of 145. I think something is wrong with the naming.I still don't see your R code that runs mafft.
Side note: Your shell code is quite inefficient. Arrays are not great practice, especially when they're not absolutely required. See:
Your code:
Without arrays:
Ok, here is the R-code that runs MAFFT
We're back to square one again. How does the value for the variable
GENE
get assigned? In this comment, you show your R code but not the MAFFT part. I asked for the full context and the edit you make has ALL the shell context and NO R context where you call MAFFT. I ask where is MAFFT now and you give me the initial one-liner. How do these pieces fit together?The variable GENE gets assigned before running the R-script like this:
And then the whole R-script above runs and also the one liner above. I am using R in the command line as I am working on a Cluster. Do you think there is something missing? Because that is all the codes I have.
It would have helped if you had added the 3 above lines in your R comment from earlier. If you're assigning a static value to
GENE
manually, debugging should be easier.Let's say the value of
GENE
is actually"gene_name"
. You need to look at:And see what's going on. Please run the shell commands I gave above and show us the output.
Ok, Thank you very much. I will run the command and share the outputs as soon as I can.
Hi Ram
Sorry for the delay. My vpn contract ended and had to renew it. So, I see the problem. After running the code
system(paste0("cat Alignment_output/*_",GENE,".fa > Alignment_output/",GENE,".fa"))
I am supposed to get each fasta sequence in a new line. But for some samples, I see that the fasta indicator is appended to the end of the previous sequence. Please have a look below. That is why MAFFT is detecting a 'z' I guess? Do you have any solution to avoid this formatting in the output file?There are probably a few files that do not have a new line character before the End of File character (the last visible character is a sequence character for these files). You'd need to probably use some sed-foo to ensure that all
>
character are preceded by a new line. Instead of redirecting directly fromcat
, pipe it to ased
that replaces all>
s that are preceded by a non-new-line character with a>
preceded by a\n
while retaining the previous preceding character.I did some quick testing and this worked:
You may need to adjust the sed flags based on your distribution. the
-Ee
should work with gnu-sed but you may need to use-re
for a different sed distribution.Ok thanks a lot. I used the
system(paste0("printf \"\n\" >> Alignment_output/temp.fa"))
after the samtools command and the last command in the revcomp loop in the R-script. It solved the problem by this way.What does your
system()
command look like now, after having added--anysymbol
? I am guessing you added--anysymbol
right before that instance of>
in your command.Maybe try something like this:
Also, since 'GENE' is referring to
FASTA
files, shouldn't you remove that.fa
you're adding viapaste0()
?It looks like the following:
GENE here refers to a gene name, not any sequence.