Question: Using Ensembl ID History Converter on GRCh37 Transcripts for salmon
0
gravatar for stefanos.bamopoulos
3.1 years ago by
stefanos.bamopoulos40 wrote:

Hello guys,

I have a question regarding the use of salmon with the GRCh37 ensembl reference.

For my analysis I run salmon for the purpose of gene-quantification using the following reference transcriptome: ftp://ftp.ensembl.org/pub/grch37/release-88/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.cdna.all.fa.gz

In order to get gene-estimates I need to provide a mapping between Ensembl Transcript IDs and Ensembl Gene IDs (as a tabular file). I created this file using the GTF-File provided on the Ensembl FTP-server: ftp://ftp.ensembl.org/pub/grch37/release-88/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz

I noticed that 16723 Transcripts do not have a corresponding Ensembl Gene ID in the GTF-file. I believe this to be due to those transcript being added as patches to the GRCh37 later on. Using the Ensembl ID History Converter I can convert the Transcript IDs with missing Gene IDs to the corresponding Transcript IDs of the newer releases and then find their corresponding Gene IDs.

Now my question: Should I include in my analysis the transcripts that do not have a corresponding Gene ID in the original GTF-File? Or is it incorrect to include them, because I use information from 2 different Ensembl releases?

Note: I used GRCh37 and not the newest ensembl release to ensure comparability with other analyses I run.

Thanks in advance!

Stefan

rna-seq salmon grch37 ensembl • 1.4k views
ADD COMMENTlink modified 3.1 years ago by Magali_Ensembl130 • written 3.1 years ago by stefanos.bamopoulos40
2
gravatar for Magali_Ensembl
3.1 years ago by
United Kingdom
Magali_Ensembl130 wrote:

Hi Stefan,

The gene sets for GRCh37 have been frozen since release 76, so any files taken from the FTP after that release will refer to the same gene set.

However, the files you are using are referring to slightly different subsets of the annotation.

For the GTF files, we provide 3 different files:

  • .chr.gtf contains all gene annotations for all the chromosomes in the assembly
  • .gtf is our default recommended file, it contains all gene annotations for all reference toplevel sequences. This includes genes on chromosomes as well as scaffolds.
  • .chr_patch_hapl_scaff.gtf contains all gene annotations on all toplevel sequences, including alternate sequences like haplotypes and patch fixes. This will result in duplicate annotation as genes can be annotated on the reference and haplotype. It is probably not recommended in your use case

For the FASTA files, we provide different files based on the annotation type.

  • .cdna contains all transcript sequences that will be transcribed
  • .ncrna contains all non coding transcript sequences. I suspect this might not be useful for transcriptome analysis.

This means that the transcripts from the FASTA file which do not have a corresponding Ensembl Gene ID in the GTF file will be transcripts annotated on the alternate sequences, which I suspect are not useful to you. This can be verified in the FASTA header, where the location tag will be something like chromosome:GRCh37:HG7_PATCH:142074311:142074850:1

Additionally, the FASTA header should also contain the Ensembl Gene IDs, so you might be able to use that directly ENST00000417509.1 havana:known chromosome:GRCh37:7:75616301:75623934:-1 gene:ENSG00000189077.6 gene_biotype:polymorphic_pseudogene transcript_biotype:processed_transcript

Hope that helps,

Magali

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Magali_Ensembl130

Hi Magali,

first of all, thank you very much for taking the time to answer! It definitely clarified a few things for me. You suspect correctly that I do not wish to include haplotypes or alternative sequences in my differential expression pipeline, since it would interfere with the analysis (e.g. for p-value adjustment). If I understand you correctly though, salmon will still try to map to transcripts on alternate sequences/haplotypes, since they will still be present in the fasta file, even if I do not infer gene counts from them. Do you recommend removing the sequences from the fasta file before running salmon, or is it sufficient to exclude said transcripts before running the differential expression analysis?

Thanks again! Stefan

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by stefanos.bamopoulos40
1
gravatar for Magali_Ensembl
3.1 years ago by
United Kingdom
Magali_Ensembl130 wrote:

Hi Stefan,

That will probably depend on the method used for the alignment. If you are only looking for the best hit, there is a small chance that a salmon model might match a human transcript on an alternate sequence rather than the reference. After filtering, you would end up with no target at all for that transcript.

If you allow multiple alignments for the same model though, the salmon sequence will align against both reference and alternate sequence, in which case the filtering will leave you with the correct target.

Hope that helps,

Magali

ADD COMMENTlink written 3.1 years ago by Magali_Ensembl130

Hi Magali,

thanks for your insight. You helped me out a lot!

Best,

Stefan

ADD REPLYlink written 3.0 years ago by stefanos.bamopoulos40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1416 users visited in the last hour