Hello all,
Today I was working on bowtie alignment between a human reference genome and RNA-seq data from HepG2 cells. I used RSEM to prepare the reference and create bowtie indexes, based on the release 75 from Ensembl (from Feb. 2014). I created indexes for both "all" and "rm" (masked) genome sets and aligned to my data. I was able to successfully align them, with alignment percentages of 82.36 and 69.85, respectively.
However, when I compared these results with a previous one obtained from a collegue that did the same analysis before, with the same data, but using the (masked) release 58 from Ensembl (from May 2010), I noticed that his alignment percentage was 51.26%. I repeated it with v.58 to be sure, and obtained the same percentage, which means that I'm following the correct alignment pipeline.
My question here is how different one release can be from one another. New genes and transcripts can be added to a new release, but I don't know if this is enough to make up for almost 20% variation (69.85% from release 75 and 51.26 %from release 58) on my data. Does anyone have any advice on that?
Regards,
I think Alastair is right about the patches. The genome assembly has not changed since release 55, so this is not the difference. However we have introduced many patches and haplotypes since then. There's a help video here to explain what we mean by patches and haplotypes.