Aligning RNA-Seq reads against a newly assembled genome involves numerous issues and difficulties that we don't normally meet when analyzing data from model organisms.
For instance, many of the genome regions can be repeated in different contigs and scaffolds, diminishing the number of unique regions. These regions are result of poor genome assembly and are distinct from the repeated regions we normally see in complete genomes.
This in turn increases dramatically the number of multimapped reads (in some genomes to more than 80-90%) in RNA-Seq analyses.
I would be really happy to see how colleagues face these issues during alignent and most importantly in dowstream analyses (e.g. expression quantification).
How do you (or would you) take into account multimapped reads in gene (or exon or transcript) expression?
Do you (or would you) perform pairwise alignments between the contigs/scaffolds to check for non-redundancy?
What measures (or would you) take if your genome is highly repeated?
Thank you all,