Question: DNA multialignment and conservation profile
gravatar for Kevin D
3.0 years ago by
Kevin D30
INRA France
Kevin D30 wrote:

Hello everyone,

I would like to identify the most conserved (and by the way, the most variable) sites in a multiple alignment of DNA genomic sequences with annotation information.

My final goal is to reconstruct the phylogenetic tree of a plant genus containing around 100 species. I have genomic data for only 15 of them (draft genomes). I selected about 1000 genes shared by the 15 and now I want to identify variable regions flanked by conserved ones in those 1000 genes. Indeed, I will then assume that those particular regions also exist in the 100 species. I’m interested in variable regions flanked by conserved ones because I will further design primers in the conserved parts and then amplify those regions for the species which have no genome sequenced. I guess the exon will be more conserved than the intron that’s why I need annotation information on the multiple alignments. For instance, it could be a multiple alignment with an annotation layer linked to a conserved profile.

So my question is: do you know software (command line ideally) that can perform this task?

Any suggestions will be appreciated.



alignment genome • 973 views
ADD COMMENTlink written 3.0 years ago by Kevin D30

I think this a multi-step project, you can initially "measure" conservation and variabilty by %GC in contigs and/or looking for snps (bowtie2, samtools) but you need the reads.

Map reads to contigs (draft genomes) <bowtie2>
Mapped reads for anotated gene <HT-SEQ>
Search for SNPS <samtools>
%GC per contig <perl or R>

I start to do that,

ADD REPLYlink written 3.0 years ago by Buffo1.8k

Thanks for your reply, Actually, I've already obtained the 15 species' sequences for each 1000 genes so by aligning them gene by gene I would have an idea where the snps/indels are but this would just be "visual". What I need is a 2-colum table for each gene with : column 1: position in the multiple DNA sequence alignment column 2: an index showing the conservation score for this position Then I could implement an algorithm that search the best region to amplify (i.e. a variable area flanked by 2 conserved regions) in the alignement based on the conservation score table. Having the annotation would be a plus but that could be done later. Regards

ADD REPLYlink written 3.0 years ago by Kevin D30

You can make that table;

1.- Compare your genomes using nucmer
2.- Align reads to the genome ( .sam file)
3.- Index and reordering .sam file  to .bam (samtools)
4.- You can use deep coverage as "measure" of conservation. 
5.- Mix results even on excel file and that`s it, you will get your table.

I`m not an expert, I don´t know if exists a program that does it in one step (I don´t think so) but I would do it in that way. Good luck.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Buffo1.8k

Thank you for your interesting ideas. I didn't know about nucmer, it sounds nice. I'll try it!

ADD REPLYlink written 3.0 years ago by Kevin D30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 764 users visited in the last hour