Question

Saccharomyces cerevisiae - mapping against transcriptome

1

Entering edit mode

15 months ago

yeast_rnaseq ▴ 10

Hi,

I'm trying to do a typical transcriptomics analysis on mRNA samples from several Saccharomyces cerevisiae CEN.PK113-7D strains. By typical I mean:

1) Cleaning raw reads: trimming of adapters and removal of low quality reads. 2) Mapping reads against reference genome/transcriptome and quantification of reads 3) Performing differential gene expression (DGE) analysis from quantified reads in the different samples/conditions 4) Function or pathway enrichment from DGE results

The sequencing company did step 1 already, so my starting point would be to map the "clean" reads into a reference genome/transcriptome. I don't know if I'm overcomplicating everything or if my views are not entirely correct, but I'm already stuck here...

I was initially planning to use salmon to map the reads against the transcriptome. However, I'm struggling to find a good transcriptome for yeast. I can see that in most complex eukaryotes, like human or mouse, if I download the cDNA transcriptome from ENSEMBL, as I would expect I can find several mRNA transcripts belonging to the same gene (alternative splicing), the transcripts contain 5' and 3'UTR regions, etc. In the case of yeast, however, when I download the transcriptome there's just a single transcript per gene (alternative splicing is completely ignored even though it's been demonstrated for several genes) and all the transcripts miss UTR regions, i.e, literally all transcripts start with ATG and end with a stop codon, which is completely wrong. It looks like the transcriptome has been built from computational predictions based on the genome sequence rather than from mapping sequenced cDNA into the genome.

I thought that these issues could be specific to my strain (CEN.PK113-7D, https://www.ncbi.nlm.nih.gov/bioproject/PRJNA398797/). but that's also the case for the S. cerevisiae reference genome! (https://ftp.ensembl.org/pub/release-108/fasta/saccharomyces_cerevisiae/cdna/). Am I missing out something, or there is no "experimental" transcriptome for baker's yeast uploaded to the most common databases? There's some 5' UTR and 3' UTR fasta files on SGD (http://sgd-archive.yeastgenome.org/sequence/S288C_reference/), but they belong to the reference strain S288C. Also, I don't know how correct it would be to just join those UTR regions to transcripts, as still alternative splicing would not be taken into account. For my strain I found this record http://genomebrowser.uams.edu/cenpk1137/ that has the CEN.PK113-7D genome sequence plus the genome annotation in GFF3 or BED that contains info about all exons, CDS, and UTR regions. However, I'm not sure how correct it would be to build a transcriptome from the genome + GFF3 annotation, plus I'm a bioinformatics newbie and don't know how to do it.

So basically, if I use the transcriptome for my strain that I can get from the most common databases, reads that cover only in the UTR region would be ignored, as well as probably some reads derived from splicing variants not taken into account in the single transcript annotation. Not sure if this is something standard, or ideally you should count also all those reads.

As an alternative to this situation, I'm considering mapping against the CEN.PK113-7D genome. There I would use hisat2 for the mapping and featureCounts to count reads that have mapped on exon regions. Again, I don't know if it's a common practice but then reads mapping into potential UTR regions would be ignored?

Could anyone enlighten me? Maybe I'm overthinking about all possible details that are not really relevant in the analysis...

Thanks!!

salmon transcriptome hisat2 featurecounts Yeast • 1.1k views

ADD COMMENT • link 15 months ago by yeast_rnaseq ▴ 10

0

Entering edit mode

I'm not gonna lie your question went way over my head but I always just use these prebuilt indices from kallisto (with kallisto, not salmon, however) and they work great for me. Do these have the same problem as those from ensembl?

https://github.com/pachterlab/kallisto-transcriptome-indices/releases

ADD REPLY • link 15 months ago by bkleiboeker ▴ 370

0

Entering edit mode

Thanks for your reply. It seems that they are built from ensembl, so they have the same issue. Transcripts miss out UTR regions and do not seem to consider alternative splicing variants.

ADD REPLY • link 15 months ago by yeast_rnaseq ▴ 10

score 2 · Accepted Answer · 2023-01-12

You also have the strains subdirectory at yeastgenome.org, that contains a CEN.PK113-7D assmebly from TU Delft, the resource you found is probably better. The question is how much you'll gain with the matching CEN.PK genome over the S288c and how much you loose because of the (presumably) superior annotation of S288c.

Maybe Joseph Schacherer's papers can shed some light on how closely they're related, for example his 2018 publication

To extract transcripts you can here's some potentially helpful resources

Extracting transcript sequences with gene name (gffread)

https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html

Mapping to the genome seems to me like a good alternative (though I'd prefer STAR personally) - you can map to both the S288c and CEN.PK reference genome and quickly see how different your basic QC stats are, like unmapped reads, etc. For more details, of course you'd have to dig deeper a bit.