50% Missed exons when trying to replicate annotation reference
Entering edit mode
4 months ago
Mike ▴ 10

Hello, I have an annotation file which I'm trying to replicate.

So far, the results are horrible:

enter image description here

I'm intentionally not using the reference annotation I have in order to get there "without help".

I'm building the index using STAR, clean my fastqs and then aligning with:

STARlong --runThreadN 4 --genomeDir {args.dest} --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --readFilesIn {fastqs} --outFileNamePrefix {args.org} --sjdbOverhang {max_len-1} --twopassMode Basic --outSAMattributes All

I then sort with samtools and then with stringTie:

stringtie -p 4 {sorted_bam} -o stringOutput{itr}.gtf

Finally I merge all gtf's with stringtie --merge

I compare the final merged gtf file to the annotation reference I have and the results are in the image above.

I don't have the SRA data used to make the reference annotation, so I tried downloading 2 different projects from NCBI but they both led me to these poor results (more or less)

  1. What am I doing wrong? is the main problem I'm facing is not having the SRA inputs used to make the annotation I'm trying to replicate?
  2. I'm planning to add/change the current parameters, but I think it won't have a significant effect, am I correct?
  3. I'll be happy to receive any information you can share with me on this subject, the main goal is to make a gene prediction & annotation pipeline.

Thanks a lot!

star gffcompare stringtie annotation pipeline • 336 views
Entering edit mode
4 months ago

Transcriptome assembly is notoriously difficult to get right.

Some reasons are objective and very straightforward -many of your transcripts are not expressed at sufficient levels.

Other reasons have to do with the complexity of the task at hand, sometimes it is quite difficult to tell transcripts apart.

Look at some of your missed exons ... see if you observe a pattern. Are these missed exons even covered?

By and large, the problem usually is that you get too many false positives.

Entering edit mode
4 months ago
liorglic ▴ 870

Obtaining a high quality annotation usually requires multiple types of evidence, not just RNA-seq, e.g.:

  • Protein and/or full length transcript sequences from closely related species or accessions
  • Ab-initio gene predictions
  • Reference gene lift-over

Results from multiple alignment and prediction tools are then combined to produce gene models. I suggest you take a look at an annotation pipeline such as MAKER, or EvidenceModeler, which combines annotation results.

Having said that, when comparing to the reference annotation you should keep in mind that:
a. The reference annotation is not necessarily correct, and most gene models there had never been validated.
b. Reference annotations usually undergo some stage of manual curation, which improves the quality at the cost of much hard work.
Therefore, I wouldn't expect to be able to obtain an annotation that would be highly similar to the reference annotation, especially since you are not using the same evidence.

Entering edit mode

Hi Lior, thank you for the informative and much helpful comment! I can't seem to find anything useful regarding "Reference gene lift-over", could you provide more context or explain what you meant by that a bit more? Thanks again.


Login before adding your answer.

Traffic: 1513 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6