How To Do Alignment, Stop Codon Removal And Dn/Ds Calulation In One Go?

9

Entering edit mode

12.3 years ago

Naren ★ 1.0k

I have over 1000 files each having 30 sequences. Manually aligning, removing stop codons and then calculating avarage dN/dS for each file is impossible for me.
Are there ways to perform this via command drive.
(I know PAML, but no tool known for aligning in paml format and for removing stop codons)
Even 3 different tools for each step will do, the thing is just that I should be able to do it from command prompt.

(I'm on Win7)

Thanks in advance.

paml • 18k views

ADD COMMENT • link updated 3.0 years ago by sunnykevin97 ▴ 1000 • written 12.3 years ago by Naren ★ 1.0k

1

Entering edit mode

what aligner would you like to use? Most, if not all, have a command line. Deleting the stop codons afterwards should be "trivial". E.g. biopython has an interface for most aligners and will run PAML as well. However, please keep in mind that dN/dS calculations are (obviously) very dependent on a good alignment. The huge downside of this automated approach will be that you will likely not quality check each alignment before moving on.

ADD REPLY • link 12.3 years ago by Whetting ★ 1.6k

0

Entering edit mode

Hello， I have a similar problem?In big data,I must delete the stop codons in the sequence.So,can you give me some suggestions?

Thanks!

ADD REPLY • link 8.8 years ago by wangqingqing • 0

7

Entering edit mode

5.5 years ago

Vincent Ranwez ▴ 90

Hello,

the MACSE_V2 toolkit provides several tools to deal with nuceotide coding sequences. The alignSequences subprogram of MACSE allows building reliable codon alignments even in the presence of frameshifts of stop codons (especially useful for dN/dS analysis and pseudogene analysis). Morevover, this subprogam can handle the fact that different sequences use different genetic codes. MACSE also includes a subprogram specifically designed to replace stop codons (and frameshift codons) from an alignment. This subprogam (exportAlignment) allows to specify the codon (three letters of your choice) that will replace the stop codons. You can even provide two different codons for replacing stops appearing within the sequence (unexpected unless in pseudogenes) and stop codons appearing at the end of the sequences. While there is several options (e.g. to specify the output file name and the genetic code to use) the basic usage is quite straightforward:

java -jar macse.jar -prog exportAlignment -align align.fasta -codonForFinalStop --- -codonForInternalStop NNN

To ease the alignment of coding nucleotide sequences, we also provide ready to use alignment pipelines (provided as singularity container), which include optional filtering steps. These pipelines output the (filtered) nucleotide alignment, the corresponding (filtered) amino acid ones and the detail of the filtering steps (if some filtering steps were selected).

ADD COMMENT • link 5.5 years ago by Vincent Ranwez ▴ 90

0

Entering edit mode

I tried the above command -

java.lang.StringIndexOutOfBoundsException: String index out of range: 1002

How do I solve this error ?

ADD REPLY • link 3.0 years ago by Ramana • 0

2

Entering edit mode

You have to use the nucleotide (CDS) alignment file as input.

ADD REPLY • link 3.0 years ago by sunnykevin97 ▴ 1000

6

Entering edit mode

12.3 years ago

jprmachado ▴ 80

Hi,

Few time ago i got the same problem. I solved using a perl script available here (

	ReplaceStopsWithGaps.pl is a perlscript written by Joseph Hughes, University of Glasgow

	use this to remove stop codons from an alignment
	typically, this would be done to calculate dN/dS in HYPHY
	Usage:
	perl ../Scripts/ReplaceStopWithGaps.pl -pep 104D5_pep.fasta -nuc 104D5.fasta -output 104D5_nostop.fasta

	use this to replace stop codons from the nucleotide alignment
	the nucleotide and the peptide alignments are necessary

view raw README hosted with ❤ by GitHub

	#!/usr/bin/perl -w
	#
	# use this to remove stop codons from an alignment
	# typically, this would be done to calculate dN/dS in HYPHY
	# Usage: perl ../Scripts/ReplaceStopWithGaps.pl -pep 104D5_pep.fasta -nuc 104D5.fasta -output 104D5_nostop.fasta
	# use this to replace stop codons from the nucleotide alignment
	# the nucleotide and the peptide alignments are necessary


	use strict;
	use Getopt::Long;
	use Bio::SeqIO;

	my ($inpep,$innuc,$output, $i, %stop);
	&GetOptions(
	'pep:s' => $inpep,#
	'nuc:s' => $innuc,
	'output:s' => $output,#file without gaps
	);


	my $pep = Bio::SeqIO->new(-file => "$inpep" , '-format' => 'fasta');
	my $nuc = Bio::SeqIO->new(-file => "$innuc" , '-format' => 'fasta');
	my $out = Bio::SeqIO->new(-file => ">$output" , '-format' => 'fasta');

	while ( my $pepseq = $pep->next_seq() ) {
	my $pep_str=uc($pepseq->seq);
	if ($pep_str=~/\*/){
	my $pep_id=$pepseq->id();
	my @aa=split(//,uc($pepseq->seq));
	for ($i=0; $i<scalar(@aa); $i++){
	if ($aa[$i]=~/\*/){
	$stop{$pep_id}{$i}++;
	print "$pep_id peptide sequence has a stop $aa[$i] at ".($i+1)."\n";
	}
	}
	}
	}
	while (my $nucseq = $nuc->next_seq()){
	my $nuc_id=$nucseq->id();
	my $nuc_str=uc($nucseq->seq);
	foreach my $pid (keys %stop){

	if ("$nuc_id" eq "$pid"){
	foreach my $site (keys %{$stop{$pid}}){
	#print "match $nuc_id and $pid\n";
	#print "The sequence for $nuc_id is \n$nuc_str\n";
	my $nucpos=$site*3;
	my $codon = substr $nuc_str, $nucpos, 3;
	print "$codon ";
	if ($codon =~ /(((U\|T)A(A\|G\|R))\|((T\|U)GA))/i){
	substr($nuc_str, $nucpos, 3) = '---';
	print "=> Match to a stop codon at nucleotide position ".($nucpos+1)."\nNew sequence for $nuc_id\n$nuc_str\n";
	}else{
	print "Doesn't seem to match a stop codon at nucleotide position ".($nucpos+1)." in $nuc_id\n";
	}
	}
	}
	}
	my $newseq = Bio::Seq->new(-seq => "$nuc_str",
	-display_id => $nuc_id);
	$out->write_seq($newseq);
	}

view raw ReplaceStopWithGaps.pl hosted with ❤ by GitHub

. Since is needed to feed with both nuclotides and amino acids i have used t-coffee to translate.

This have worked fine for me. I have done in linux, for windows you may neeed to write a .bat file to it easily. Take a look on bat files tutorial for syntax if you are not familiar with that. I think that will work.

You can use fasta format as sequence file for PAML no need of .pml format.

Regards,

Joao

ADD COMMENT • link 12.3 years ago by jprmachado ▴ 80

0

Entering edit mode

Thanks so much!!

ADD REPLY • link 10.2 years ago by tlorin ▴ 370

0

Entering edit mode

Hi @jprmachado

I am working dn/ds for 4431 gene clusters, i tried to remove stopcodons by pal2nal program. but for some reason around 198 cultures still has the stop codons.

now i have tried to remove the stopcodons with the above script but i don't get any error and stop codons are not removed.

any suggestions thank you

ADD REPLY • link 8.3 years ago by krp0001 ▴ 40

0

Entering edit mode

I tried with dummy data set it worked but in the actual data set it not working when i run to calculate dn/ds

ADD REPLY • link 5.0 years ago by 1769mkc ★ 1.3k

0

Entering edit mode

Hi, I'm also facing the same problem, unable to get rid off the STOP CODONS from my dataset.

ADD REPLY • link 3.1 years ago by sunnykevin97 ▴ 1000

5

Entering edit mode

12.3 years ago

SES 8.6k

Pal2Nal will generate a codon alignment without stop codons, given a MSA of proteins and the corresponding DNA sequences. If the input is a pairwise alignment, I believe it will calculate dN/dS ratios (using PAML) for you automatically. Otherwise, you can just input your alignments to PAML to calculate dN/dS. Pal2Nal is written in Perl, so it should work on your Win7 machine (I don't know about PAML though, unless there is a Windows version available).

ADD COMMENT • link 12.3 years ago by SES 8.6k

1

Entering edit mode

should the nucleotide alignment be trimmed? And also the protein alignment be trimmed?

ADD REPLY • link 9.0 years ago by lilepisorus ▴ 40

Login before adding your answer.