Question

Genome type to build transcript reference with RSEM?

0

Entering edit mode

8.9 years ago

rna-seq_researcher ▴ 60

Hi all,

Currently I am preparing the reference transcriptome used by RSEM in RNA-seq experiments. For this, I use rsem-prepare-reference function with .GTF and .fasta files downloaded from Ensembl (latest release, v.80).

However, I have some questions regarding the masking level of the genome (which can be complete genome, as well as soft- or hard- masked for repetitive sequences). Is there any influence of the masking level when I build the transcript reference? For example, if I use a hard masked genome instead of a complete genome, will that have a huge impact on my final transcript set (considering that I will be using the same GTF coordinates in both scenarios)?

I ask that because I saw that the human transcriptome may have some level of repetitive sequences and I don't know if these sequences are completely lost in the hard-masked genome.

Does anyone have some insight on that matter?

Thanks!

rsem RNA-Seq genome alignment • 2.7k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by rna-seq_researcher ▴ 60

0

Entering edit mode

True! I just checked my transcripts.fa file and there are some small sequences (~10-20) full of Ns...

Thank you very very much!

ADD REPLY • link 8.9 years ago by rna-seq_researcher ▴ 60

score 2 · Accepted Answer · 2015-06-02

I would strongly encourage you to not use the hard-masked genomes for this. You're pretty much guaranteed to have a bunch of excess Ns in the resulting sequence if you were to use the hard-masked version. Either the soft-masked or plain fasta files will work fine (they should produce equivalent results in fact).