Genome type to build transcript reference with RSEM?
8.3 years ago

Hi all,

Currently I am preparing the reference transcriptome used by RSEM in RNA-seq experiments. For this, I use rsem-prepare-reference function with .GTF and .fasta files downloaded from Ensembl (latest release, v.80).

However, I have some questions regarding the masking level of the genome (which can be complete genome, as well as soft- or hard- masked for repetitive sequences). Is there any influence of the masking level when I build the transcript reference? For example, if I use a hard masked genome instead of a complete genome, will that have a huge impact on my final transcript set (considering that I will be using the same GTF coordinates in both scenarios)?

I ask that because I saw that the human transcriptome may have some level of repetitive sequences and I don't know if these sequences are completely lost in the hard-masked genome.

Does anyone have some insight on that matter?


True! I just checked my transcripts.fa file and there are some small sequences (~10-20) full of Ns...

Thank you very very much!

8.3 years ago

I would strongly encourage you to not use the hard-masked genomes for this. You're pretty much guaranteed to have a bunch of excess Ns in the resulting sequence if you were to use the hard-masked version. Either the soft-masked or plain fasta files will work fine (they should produce equivalent results in fact).


