Question

How to obtain protein-coding sequences from assembled genome/exome dataset?

0

Entering edit mode

4.6 years ago

DNAngel ▴ 250

I use bwa-mem to assemble my genome and exome datasets to work with just CDS of my various species. But so far, I was able to do this for individual CDS at a time using individual CDS ref sequences from different reference species.

Of course this is just not feasible when wanting to explore the entire genomic/exonic dataset and to test for selection on all the protein-coding genes obtained in my species. I am not sure how to assemble my raw single-end reads if I should download all the CDS sequences for the specific species and run it all in one file? The end of my custom script produces a single MSA file when using a single CDS gene as my reference, so would this produce one giant MSA alignment? I would have to then run various models one each gene individually or BLAST them so I need individual MSAs.

Any advice on this so I can be most efficient? End goal: obtain MSAs for all protein-coding genes in my genomic/exonic datasets so I can run various models testing for selection pressures on each gene.

PAML bwa • 774 views

ADD COMMENT • link 4.6 years ago by DNAngel ▴ 250