Question

DNA multialignment and conservation profile

0

Entering edit mode

8.5 years ago

Kevin D ▴ 30

Hello everyone,

I would like to identify the most conserved (and by the way, the most variable) sites in a multiple alignment of DNA genomic sequences with annotation information.

My final goal is to reconstruct the phylogenetic tree of a plant genus containing around 100 species. I have genomic data for only 15 of them (draft genomes). I selected about 1000 genes shared by the 15 and now I want to identify variable regions flanked by conserved ones in those 1000 genes. Indeed, I will then assume that those particular regions also exist in the 100 species. I’m interested in variable regions flanked by conserved ones because I will further design primers in the conserved parts and then amplify those regions for the species which have no genome sequenced. I guess the exon will be more conserved than the intron that’s why I need annotation information on the multiple alignments. For instance, it could be a multiple alignment with an annotation layer linked to a conserved profile.

So my question is: do you know software (command line ideally) that can perform this task?

Any suggestions will be appreciated.

Regards

Kevin

alignment genome • 2.3k views

ADD COMMENT • link 8.5 years ago by Kevin D ▴ 30

0

Entering edit mode

I think this a multi-step project, you can initially "measure" conservation and variabilty by %GC in contigs and/or looking for snps (bowtie2, samtools) but you need the reads.

Map reads to contigs (draft genomes) <bowtie2>
Mapped reads for anotated gene <HT-SEQ>
Search for SNPS <samtools>
%GC per contig <perl or R>

I start to do that,

ADD REPLY • link 8.5 years ago by Buffo ★ 2.4k

0

Entering edit mode

Thanks for your reply, Actually, I've already obtained the 15 species' sequences for each 1000 genes so by aligning them gene by gene I would have an idea where the snps/indels are but this would just be "visual". What I need is a 2-colum table for each gene with : column 1: position in the multiple DNA sequence alignment column 2: an index showing the conservation score for this position Then I could implement an algorithm that search the best region to amplify (i.e. a variable area flanked by 2 conserved regions) in the alignement based on the conservation score table. Having the annotation would be a plus but that could be done later. Regards

ADD REPLY • link 8.5 years ago by Kevin D ▴ 30

0

Entering edit mode

You can make that table;

1.- Compare your genomes using nucmer
2.- Align reads to the genome ( .sam file)
3.- Index and reordering .sam file  to .bam (samtools)
4.- You can use deep coverage as "measure" of conservation. 
5.- Mix results even on excel file and that`s it, you will get your table.

I`m not an expert, I don´t know if exists a program that does it in one step (I don´t think so) but I would do it in that way. Good luck.

ADD REPLY • link 8.5 years ago by Buffo ★ 2.4k

0

Entering edit mode

Thank you for your interesting ideas. I didn't know about nucmer, it sounds nice. I'll try it!

ADD REPLY • link 8.5 years ago by Kevin D ▴ 30