I noticed that 16723 Transcripts do not have a corresponding Ensembl Gene ID in the GTF-file.
I believe this to be due to those transcript being added as patches to the GRCh37 later on.
Using the Ensembl ID History Converter I can convert the Transcript IDs with missing Gene IDs to the corresponding Transcript IDs of the newer releases and then find their corresponding Gene IDs.
Now my question:
Should I include in my analysis the transcripts that do not have a corresponding Gene ID in the original GTF-File?
Or is it incorrect to include them, because I use information from 2 different Ensembl releases?
Note: I used GRCh37 and not the newest ensembl release to ensure comparability with other analyses I run.
The gene sets for GRCh37 have been frozen since release 76, so any files taken from the FTP after that release will refer to the same gene set.
However, the files you are using are referring to slightly different subsets of the annotation.
For the GTF files, we provide 3 different files:
.chr.gtf contains all gene annotations for all the chromosomes in the
.gtf is our default recommended file, it contains all gene annotations for all reference toplevel sequences. This includes genes on chromosomes as well as scaffolds.
.chr_patch_hapl_scaff.gtf contains all gene annotations on all toplevel sequences, including alternate sequences like haplotypes and
patch fixes. This will result in duplicate annotation as genes can be
annotated on the reference and haplotype. It is probably not
recommended in your use case
For the FASTA files, we provide different files based on the annotation type.
.cdna contains all transcript sequences that will be transcribed
.ncrna contains all non coding transcript sequences. I suspect this might not be useful for transcriptome analysis.
This means that the transcripts from the FASTA file which do not have a corresponding Ensembl Gene ID in the GTF file will be transcripts annotated on the alternate sequences, which I suspect are not useful to you. This can be verified in the FASTA header, where the location tag will be something like chromosome:GRCh37:HG7_PATCH:142074311:142074850:1
Additionally, the FASTA header should also contain the Ensembl Gene IDs, so you might be able to use that directly
ENST00000417509.1 havana:known chromosome:GRCh37:7:75616301:75623934:-1 gene:ENSG00000189077.6 gene_biotype:polymorphic_pseudogene transcript_biotype:processed_transcript
That will probably depend on the method used for the alignment.
If you are only looking for the best hit, there is a small chance that a salmon model might match a human transcript on an alternate sequence rather than the reference. After filtering, you would end up with no target at all for that transcript.
If you allow multiple alignments for the same model though, the salmon sequence will align against both reference and alternate sequence, in which case the filtering will leave you with the correct target.
first of all, thank you very much for taking the time to answer! It definitely clarified a few things for me. You suspect correctly that I do not wish to include haplotypes or alternative sequences in my differential expression pipeline, since it would interfere with the analysis (e.g. for p-value adjustment). If I understand you correctly though, salmon will still try to map to transcripts on alternate sequences/haplotypes, since they will still be present in the fasta file, even if I do not infer gene counts from them. Do you recommend removing the sequences from the fasta file before running salmon, or is it sufficient to exclude said transcripts before running the differential expression analysis?
Thanks again! Stefan