Question

RNA-Seq technologies for genes with similar sequences

3

Entering edit mode

7.4 years ago

dktarathym ▴ 40

This is an assignment question

I was asked in an assignment:

In the genome, some genes have sequences that are very very similar to others (e.g. there might be two genes whose sequences are 99% identical). What difficulties would that cause to the analysis of RNA-seq data?

I gave a google search for it and also read some posts on seqanswers.com, but not really able to get a clear answer.

I am not sure, but is it related to aligning or duplicated reads ?

RNA-Seq • 2.1k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 7.4 years ago by dktarathym ▴ 40

0

Entering edit mode

You are on the right track. In RNA-seq, the aim is often to quantify the expression of all genes. To approach the "real" expression level of a gene, the (normalized) number of reads which map to that gene is used as a proxy/estimate. So, what's then the problem for genes that are 99% identical?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

SO, then the 99% similar genes would give wrong estimate (which will be due to wrong mapping of reads to the similar genes). That is, read which could belong to gene A, is actually mapped to gene2, due to 99% sequence similarity.

ADD REPLY • link 7.4 years ago by dktarathym ▴ 40

0

Entering edit mode

Actually, reads will most likely map equally well to both genes (or more than two) and get multimapped to multiple locations. As such it will be impossible to reliably estimate the expression of those genes. For a typical analysis those reads are discarded during read counting step and therefore...

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

score 7 · Answer 1 · 2016-12-03

This can be a large problem for RNAseq, the magnitude of which being dependent on how biologically relevant these genes are in a given context and how the quantification is performed. This has nothing to do with alignment or duplication rates, but everything to do with multi-mapping.

For the sake of an example, let us suppose that the two similar genes are the human KCNJ12 and KCNJ18 genes, which can differ by a few as two bases (at least in the CDS).

Suppose one uses a classic analysis pipeline where an aligner like STAR is used first and featureCounts second. Reads arising from KCNJ12 will almost all align equally well to both it and KCNJ18; the same is the case for reads arising from KCNJ18. Since multimapping reads such as this are excluded from downstream analysis in classic pipelines, almost all of the signal from these genes will be ignored. As the power to detect differences is dependent on the abundance of aligned reads, there will be essentially no power to determine whether either gene is differentially expressed.

If one used something like salmon or kallisto then things would be a bit better. These use EM to incorporate multimapping reads back into the counts. This isn't a panacea of course, since a couple of bases difference aren't exactly much to go on, but the results are going to be as good as you can hope for.

So, if you're really interested in families of similar genes, then use something like salmon for quantification and be aware that statistical power won't be as high as for other genes.

score 6 · Answer 2 · 2016-12-05

Last year, there was a paper in Genome Biology about the problem of multi-mapping or ambiguous reads: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0734-x The authors suggest to use a "two-stage analysis of RNA-Seq data in which multi-mapped or ambiguous reads can instead be uniquely assigned to groups of genes." In the case of KCNJ12 and KCNJ18, for example, this approach would put the two genes (or the regions that are highly similar) into a group, call it KCNJ12|KCNJ18 and then count the reads for this group. As Devon mentioned, many classical pipelines would discard these reads, and both genes might get 0 reads in the end, although they are well expressed. There are other more complicated ways to deal with multi-reads, but this approach is, in my opinion, very straightforward and intuitive to understand.