Question

How should I handle mapping to incomplete genome with many homologous regions?

0

Entering edit mode

4.1 years ago

O.rka ▴ 710

I have a novel diatom. These are known to have fairly complex genomes because of secondary endosymbiosis.

I have the following data: (1) A LOT of short RNA-seq (2) A draft genome (I believe this is incomplete because a fair amount of reads don't map to it) (3) Gene models from Funannotate using (2), (1), and all diatom proteins for reference. (4) Assembled transcripts from the reads that did not map to the genome/gene models.

When I map the genes to the gene models, my tables are very sparse, many reads are being multimapped, and many reads aren't mapping to the draft genome at all (imagine the most frustrating scenario). I ran blast and many of the genes that have homologous regions are annotated with similar functions. I want to somehow "group" my gene models (i.e. high quality transcripts from RNA and Protein evidence) with my de-novo transcripts that were unmapped to the draft genome.

The tricky part about this is that I will need to do this iteratively whenever a new sequencing set comes out. I have 4 sequences runs and for each of those I did a coassembly for de-novo transcriptomes (on the unmapped reads).

My thoughts were to do the following: i. Merge all of the transcripts (from gene models and each of the ii. All vs. all blast for the transcripts in (i) iii. Group the sequences that have hits to each other with % identity > (I'm not sure what should be used here).

RNA-Seq • 567 views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 4.1 years ago by O.rka ▴ 710