DNA multialignment and conservation profile
0
0
Entering edit mode
7.3 years ago
Kevin D ▴ 30

Hello everyone,

I would like to identify the most conserved (and by the way, the most variable) sites in a multiple alignment of DNA genomic sequences with annotation information.

My final goal is to reconstruct the phylogenetic tree of a plant genus containing around 100 species. I have genomic data for only 15 of them (draft genomes). I selected about 1000 genes shared by the 15 and now I want to identify variable regions flanked by conserved ones in those 1000 genes. Indeed, I will then assume that those particular regions also exist in the 100 species. I’m interested in variable regions flanked by conserved ones because I will further design primers in the conserved parts and then amplify those regions for the species which have no genome sequenced. I guess the exon will be more conserved than the intron that’s why I need annotation information on the multiple alignments. For instance, it could be a multiple alignment with an annotation layer linked to a conserved profile.

So my question is: do you know software (command line ideally) that can perform this task?

Any suggestions will be appreciated.

Regards

Kevin

alignment genome • 1.9k views
ADD COMMENT
0
Entering edit mode

I think this a multi-step project, you can initially "measure" conservation and variabilty by %GC in contigs and/or looking for snps (bowtie2, samtools) but you need the reads.

Map reads to contigs (draft genomes) <bowtie2>
Mapped reads for anotated gene <HT-SEQ>
Search for SNPS <samtools>
%GC per contig <perl or R>

I start to do that,

ADD REPLY
0
Entering edit mode

Thanks for your reply, Actually, I've already obtained the 15 species' sequences for each 1000 genes so by aligning them gene by gene I would have an idea where the snps/indels are but this would just be "visual". What I need is a 2-colum table for each gene with : column 1: position in the multiple DNA sequence alignment column 2: an index showing the conservation score for this position Then I could implement an algorithm that search the best region to amplify (i.e. a variable area flanked by 2 conserved regions) in the alignement based on the conservation score table. Having the annotation would be a plus but that could be done later. Regards

ADD REPLY
0
Entering edit mode

You can make that table;

1.- Compare your genomes using nucmer
2.- Align reads to the genome ( .sam file)
3.- Index and reordering .sam file  to .bam (samtools)
4.- You can use deep coverage as "measure" of conservation. 
5.- Mix results even on excel file and that`s it, you will get your table.

I`m not an expert, I don´t know if exists a program that does it in one step (I don´t think so) but I would do it in that way. Good luck.

ADD REPLY
0
Entering edit mode

Thank you for your interesting ideas. I didn't know about nucmer, it sounds nice. I'll try it!

ADD REPLY

Login before adding your answer.

Traffic: 2177 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6