Question

Using StringTie output GTF with featureCounts to assign reads- low assigned percentage

1

Entering edit mode

4.5 years ago

sg197 ▴ 40

Hi,

I originally used featurecounts to assign reads to known transcripts of the mm10 genome. The percentage of fragments assigned was between 70-75% for my samples. Using this original gtf file the number of features (exons) were 841916, and meta-features (genes) were 55421.

I since came across StringTie and wanted to repeat the assignment of reads, but instead using the stringtie output gtf which should contain novel transcripts as well as the known ones from the original gtf.

However when I used featurecounts with this new gtf I get much lower assignment of reads (20-25%). Also the number of features is much smaller (438272) whereas metafeatures is larger (80351), meaning fewer exons but more genes in the new gtf?? Code below to make new gtf and then assign features.

stringtie all_samples_sortedByCoord.out.bam -o all_samples_sortedByCoord.gtf -p 8 -G gencode.vM23.primary_assembly.annotation.gtf --fr -A all_samples_sortedByCoord.tab

featureCounts -T 4 -p -g gene_id -s 2 -a all_samples_sortedByCoord.gtf -o PE_samples_featureCounts_novel_gtf.txt BA*_sortedByCoord.out.bam

Not sure where I've gone wrong, why are less reads being assigned to a file which should contain both known and novel transcripts compared to just the known transcripts I originally did. Why are there more genes but fewer metafeatures (exons) in my new gtf? Any help appreciated!

RNA-Seq featurecounts stringtie • 1.8k views

ADD COMMENT • link updated 4.5 years ago by Mark ★ 1.5k • written 4.5 years ago by sg197 ▴ 40

score 0 · Answer 1 · 2019-10-30

0

Entering edit mode

4.5 years ago

Mark ★ 1.5k

In the StringTie manual it states:

Note that if option -e is not used the reference transcripts need to be fully covered by reads in order to be included in StringTie's output. In that case, other transcripts assembled from the data by StringTie and not present in the reference file will be printed as well.

Try the -e option in stringtie to see what effects it has.

ADD COMMENT • link 4.5 years ago by Mark ★ 1.5k

0

Entering edit mode

Thanks for the suggestion, I tried it and using that output gtf it gave me the original assigned percentage with featurecounts. But I read in the manual that -e option causes reads with no reference transcript to be skipped, so I think this is missing out any novel transcripts? manual description: Limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts given with the -G option (requires -G, recommended for -B/-b). With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes, for example.

ADD REPLY • link 4.5 years ago by sg197 ▴ 40

0

Entering edit mode

Yes that's weird indeed. I don't use stringtie at all so this is new to me. I think what needs to be done is the two operations need to be merged. If you follow their "Differential expression analysis" workflow you'll see a merged step that generates a merged GTF file: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

This should satisfy both of your requirements of having novel and known transcripts annotated.

ADD REPLY • link 4.5 years ago by Mark ★ 1.5k