Question

How to perform multiple alignment using MAFFT?

0

Entering edit mode

4.2 years ago

anikcropscience ▴ 270

Hello,

I am using MAFFT to align multiple nucleotide sequences. I am using the following code:

system(paste0("mafft --maxiterate 1000 Alignment_output/",GENE,".fa > ",GENE,".mafft.fa"))

But I have received the following error message:

nthread = 0
nthreadpair = 0
nthreadtb = 0
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00
========================================================================= 
========================================================================= 
=== 
=== Alphabet 'z' is unknown.
=== Please check site 2208 in sequence 4.
=== 
=== To make an alignment that has unusual characters (U, @, #, etc), try  
=== % mafft --anysymbol input > output
=== 
========================================================================= 
========================================================================= 
Illegal character z

Then I inserted --anysymbol in that above code. But again I receive the following message:

inputfile = orig

Characters '= < >' can be used only in the title lines in the --anysymbol or --text mode.

The output file is empty and no result.

Can you please tell me how can I solve this issue?

Thanks a lot.

To clarify, here are the full codes that I used to run. I ran the scripts on a cluster.

The folder structure looks like the following:

ST16CH_Denovo
    ...ISOLATE_ID1
    ......scaffolds.fasta
    ...ISOLATE_ID2
    ......scaffolds.fasta

# Making a BLAST database:

mkdir BLAST_DB
mkdir BLAST_output
# Define the path to the folder containing de novo assemblies


 DENOVOFOLDER=ST16CH_Denovo
# Create Blast databases for all assemblies


 isolates=( $( ls $DENOVOFOLDER ) )

# generate Blast DB + fasta index (fai) for each scaffolds.fasta file
    for isolatespades in ${isolates[@]} 
    do  
    echo $isolatespades 
    isolate=$( echo $isolatespades ) 
    echo $isolate 
    makeblastdb -in $DENOVOFOLDER/$isolatespades/scaffolds.fasta -dbtype nucl -out BLAST_DB/$isolate; samtools faidx $DENOVOFOLDER/$isolatespades/scaffolds.fasta 
    samtools faidx $DENOVOFOLDER/$isolatespades/scaffolds.fasta
done

# Define the fasta file you want to use
ORIGFILE= gene_name.fa
cp $ORIGFILE blast.gene_name.query.txt
# define a label to be used later
GENE=(gene_name)

isolates=( $( ls $DENOVOFOLDER ) )

 # run blast  

for isolatespades in ${isolates[@]} 
do   
    echo $isolatespades   
    isolate=$( echo $isolatespades)  
    echo $isolate   
    blastn -query blast.$GENE.query.txt -db BLAST_DB/$isolate -out BLAST_output/$isolate.$GENE.blastn.txt -outfmt 6
done

genome alignment gene sequence • 4.7k views

ADD COMMENT • link 4.2 years ago by anikcropscience ▴ 270

0

Entering edit mode

Are you using this statement in a loop? If so, please refine your command to error out with the file that produces the error - maybe print each GENE before running the mafft command. If your command is not in a loop (i.e. you know the value of the failing GENE), you can skip the above debugging step.

Once you know the failing input file, show us the first 20 lines of the file and also the output to grep -m15 z input.fa | grep -Ev "^>"

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

Yes, I am using the following loop first in R to process all the files. This is the code that I am running first:

blastn.filelist <- system(paste0("ls BLAST_output/*",GENE,"*"), intern=T)


for (blastn.file in blastn.filelist) {

    ### create isolate name variable
    isolate.name <- gsub("BLAST_output/", "", blastn.file)
    isolate.name <- gsub(paste0(".",GENE,".blastn.txt"), "", isolate.name)


    ### read in blastn file of specific isolate

   # check if the file is NOT empty
   if (file.info(blastn.file)$size > 0) {

    blastn <- read.table(paste(blastn.file))
    names(blastn) <- c("gene", "scaffold", "identity", "align.length", "mismatches", "gap.open", "qstart", "qend", "sstart", "send", "evalue", "bitscore")

    ### check where the best hit is located
    best.hit.index <- which(blastn$bitscore == max(blastn$bitscore))

   if (length(best.hit.index) > 1) {
      print(paste("Warning two top hits in isolate",isolate.name))
      best.hit.index <- best.hit.index[1]
      }

    ### check whether the scaffold seq needs to be rev complemented
    rev.comp <- FALSE

    if (blastn[best.hit.index,"sstart"] > blastn[best.hit.index,"send"]) {
        rev.comp <- TRUE
        }

    ### Detect issues when padding would extend beyond scaffold lengths
   faidx <- read.table(paste0(DENOVOFOLDER,"/",isolate.name,"/scaffolds.fasta.fai"))

   # find scaff.length
   scaff.length <- faidx$V2[faidx$V1 == as.character(blastn$scaffold[best.hit.index])]

   # find whether sstart or ssend is the lower number, higher number, add padding
   lower.bound <- min(blastn$sstart[best.hit.index], blastn$send[best.hit.index]) - pad
   upper.bound <- max(blastn$sstart[best.hit.index], blastn$send[best.hit.index]) + pad


   # reset lower, upper bounds depending on whether these exceed the scaffold 
   if (lower.bound < 1) {   
      lower.bound <- 1
      }

   if (upper.bound > scaff.length) {   
      upper.bound <- scaff.length
      }

    ### extract a defined sequence based on the blast hit location
   system(paste0("samtools faidx ",DENOVOFOLDER,"/",isolate.name,"/scaffolds.fasta ",blastn[best.hit.index,"scaffold"],":",lower.bound,"-",upper.bound," > Alignment_output/temp.fa"))


    if (rev.comp) {

        # revcomp the produced sequence
        system("tail -n+2 Alignment_output/temp.fa > Alignment_output/temp.seqonly.fa")
        system("head -n 1 Alignment_output/temp.fa > Alignment_output/temp.nameonly.fa")

        system("tr -d '\n' < Alignment_output/temp.seqonly.fa | tr '[ATGCatgcNn]' '[TACGtacgNn]' | rev > Alignment_output/temp.seqonly.rev.fa")
        system("cat Alignment_output/temp.nameonly.fa Alignment_output/temp.seqonly.rev.fa > Alignment_output/temp.fa")

        }

    system(paste0("sed '1s/.*/>",isolate.name,"_",GENE,"/' Alignment_output/temp.fa > Alignment_output/",isolate.name,"_",GENE,".fa"))

    system("rm Alignment_output/temp*")

   }    
    }

# concatenate all individual fasta files and align these
system(paste0("cat Alignment_output/*_",GENE,".fa > Alignment_output/",GENE,".fa"))
system(paste0("cat blast.",GENE,".query.txt >> Alignment_output/",GENE,".fa"))

The GENE here stands for GENE= gene_id. How can I get that error you are talking about?

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270

0

Entering edit mode

I don't see GENE being assigned values, so I'm guessing there is yet another loop wrapping the code above. Write the value of GENE to stdout before running the mafft command so you know where it fails.

for GENE in list_of_strings {
    cat("Processing gene: ", GENE) ## The last of these printed will be the one that fails
    retval <- system(paste0("mafft ....."))
    if (retval == 0) { cat("Gene: ", GENE, "processed successfully") ## This won't appear for the GENE that fails }
    else { stop(cat("Stopping here just in case the shell exec did not already exit with an error. The gene that failed is: ", GENE))
}

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

I have added the full script to my original post. Please have a look. It ran alright until the R-part. Also, when I see blastn in R, there is only one file outputted instead of 145. I think something is wrong with the naming.

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270

1

Entering edit mode

I still don't see your R code that runs mafft.

Side note: Your shell code is quite inefficient. Arrays are not great practice, especially when they're not absolutely required. See:

Your code:

DENOVOFOLDER=ST16CH_Denovo
    # Create Blast databases for all assemblies


     isolates=( $( ls $DENOVOFOLDER ) )

    # generate Blast DB + fasta index (fai) for each scaffolds.fasta file
        for isolatespades in ${isolates[@]}

Without arrays:

DENOVOFOLDER=ST16CH_Denovo
# Create Blast databases for all assemblies

# generate Blast DB + fasta index (fai) for each scaffolds.fasta file
for isolatespades in $DENOVOFOLDER/*

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

Ok, here is the R-code that runs MAFFT

system(paste0("mafft --maxiterate 1000 Alignment_output/",GENE,".fa > ",GENE,".mafft.fa"))

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270

0

Entering edit mode

We're back to square one again. How does the value for the variable GENE get assigned? In this comment, you show your R code but not the MAFFT part. I asked for the full context and the edit you make has ALL the shell context and NO R context where you call MAFFT. I ask where is MAFFT now and you give me the initial one-liner. How do these pieces fit together?

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

The variable GENE gets assigned before running the R-script like this:

GENE <- "gene_name"
DENOVOFOLDER <- "ST16CH_Denovo"
pad <- 2000

And then the whole R-script above runs and also the one liner above. I am using R in the command line as I am working on a Cluster. Do you think there is something missing? Because that is all the codes I have.

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270

0

Entering edit mode

It would have helped if you had added the 3 above lines in your R comment from earlier. If you're assigning a static value to GENE manually, debugging should be easier.

Let's say the value of GENE is actually "gene_name". You need to look at:

head -n 20 Alignment_output/gene_name.fa
grep -m15 z | grep -Ev "^>" Alignment_output/gene_name.fa

And see what's going on. Please run the shell commands I gave above and show us the output.

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

Ok, Thank you very much. I will run the command and share the outputs as soon as I can.

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270

0

Entering edit mode

Hi Ram
Sorry for the delay. My vpn contract ended and had to renew it. So, I see the problem. After running the code system(paste0("cat Alignment_output/*_",GENE,".fa > Alignment_output/",GENE,".fa")) I am supposed to get each fasta sequence in a new line. But for some samples, I see that the fasta indicator is appended to the end of the previous sequence. Please have a look below. That is why MAFFT is detecting a 'z' I guess? Do you have any solution to avoid this formatting in the output file?

>a12_3B_12_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAAC.....CAGTAT>a12_3B_14_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAACGAGACACCACTC....CCATTG>a12_3B_15_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAACGAGACACCACTCGGTCAACATGCAG

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270

0

Entering edit mode

There are probably a few files that do not have a new line character before the End of File character (the last visible character is a sequence character for these files). You'd need to probably use some sed-foo to ensure that all > character are preceded by a new line. Instead of redirecting directly from cat, pipe it to a sed that replaces all >s that are preceded by a non-new-line character with a > preceded by a \n while retaining the previous preceding character.

I did some quick testing and this worked:

echo '>a12_3B_12_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAAC.....CAGTAT>a12_3B_14_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAACGAGACACCACTC....CCATTG>a12_3B_15_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAACGAGACACCACTCGGTCAACATGCAG' |\
sed -Ee 's/([^\n])>/\1\n>/g'

>a12_3B_12_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAAC.....CAGTAT
>a12_3B_14_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAACGAGACACCACTC....CCATTG
>a12_3B_15_Zt09_7_00581
ACACGGCGCAGCACCATCAGTCACCACCACGCAACGAGACACCACTCGGTCAACATGCAG

You may need to adjust the sed flags based on your distribution. the -Ee should work with gnu-sed but you may need to use -re for a different sed distribution.

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

Ok thanks a lot. I used the system(paste0("printf \"\n\" >> Alignment_output/temp.fa")) after the samtools command and the last command in the revcomp loop in the R-script. It solved the problem by this way.

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270

0

Entering edit mode

What does your system() command look like now, after having added --anysymbol? I am guessing you added --anysymbol right before that instance of > in your command.

Maybe try something like this:

system(paste0("mafft --anysymbol --maxiterate 1000 Alignment_output/", GENE, ".fa > ", GENE, ".mafft.fa"))

Also, since 'GENE' is referring to FASTA files, shouldn't you remove that .fa you're adding via paste0()?

ADD REPLY • link updated 4.2 years ago by Ram 45k • written 4.2 years ago by Dunois ★ 2.9k

0

Entering edit mode

It looks like the following:

system(paste0("mafft --anysymbol --maxiterate 1000 Alignment_output/",GENE,".fa > ",GENE,".mafft.fa"))

GENE here refers to a gene name, not any sequence.

ADD REPLY • link 4.2 years ago by anikcropscience ▴ 270