Question: Stringtie elimination of important transcripts
0
gravatar for ra2967
4 weeks ago by
ra296710
ra296710 wrote:

Dear all,

I have nine RNA-Seq files that I aligned using the hisat2 aligner and this default command:

hisat2 -x grch37_snp_tran/genome_snp_tran -1 reads_R.fastq.gz -2 reads_F.fastq.gz -S sample.sam

The -x file was the one they recommened in the hisat2 website.

Files looked fine and I proceed to the bam, sorted bam and bai using samtools.

However, when I use stringtie with this command:

stringtie file.sorted.bam -G Homo.sapiens.GRCh37.75.gtf -o file.gtf

It happens that in some sample I am losing genes as important as CCND1, while I keep them in other files. It is very strange as I have some transcripts with zero coverage/FPKM/TPM, so I expected that they were in my file.gtf. Am I doing something wrong? If I use the same command for all the samples, how can it be that I lose this transcript in one but not in others? In addition, how it can be that I don't have the same number of genes in all my generated files if I am using the same Homo.sapiens.GRCh37.75.gtf?

Thanks

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by ra296710

I answer it on the bottom

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ra296710

What version of stringtie are you using? This issue has been mentionned on the stringtie github issues: https://github.com/gpertea/stringtie/issues/141

One user reported that using version v1.3.3b as opposed to 1.2.3 fixed the issue (plotted), but the other users still report seeing this with newer versions. If would try using the latest version or 1.3.3b specifically and see what happens.

You could also try different aligners: https://github.com/gpertea/stringtie/issues/158 (STAR seems to do well according to the last poster's comment. STAR is actually my go-to aligner and I did use it with stringtie in the past.)

This user reports this issue as well (albeit on the older 1.2.3 with the acknowledged bug) but due to hisat2 parameters: https://github.com/gpertea/stringtie/issues/61

Related issue: https://github.com/gpertea/stringtie/issues/102

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by manuel.belmadani460
1

Have you looked at the Hisat bam file to see if there acutally is any reads for that gene? Can easily be done with a genome browser such as IGV.

ADD REPLYlink written 4 weeks ago by kristoffer.vittingseerup1.4k
1
gravatar for manuel.belmadani
4 weeks ago by
Canada
manuel.belmadani460 wrote:

My guess is that there might be insufficient coverage (for example) in some samples for that loci, or something causing it to be discarded. For example, the stringtie default parameters can exclude short transcripts:

-G <guide_gff> reference annotation to include in the merging (GTF/GFF3)
-o <out_gtf> output file name for the merged transcripts GTF (default: stdout)
-m <min_len> minimum input transcript length to include in the merge (default: 50)
-c <min_cov> minimum input transcript coverage to include in the merge (default: 0)
-F <min_fpkm> minimum input transcript FPKM to include in the merge (default: 0)
-T <min_tpm> minimum input transcript TPM to include in the merge (default: 0)

There might be other defaults at play here. See: https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

My suggestion to troubleshoot this: Look into your BAM file with samtools, and see if anything gets mapped at that location. Something like:

samtools view sample.bam "Chr11:69455855-69469242" > CCND1.sam

Compare with your samples that do have the gene. If you don't see anything, then there's nothing stringtie can do about it. Maybe there's a hisat parameter you can tweak, or maybe it's just not in your raw reads for some reason. Feel free to comment and tell us what you see after that.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by manuel.belmadani460
1

There is also the :

-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5

argument for the individual StringTie runs.

ADD REPLYlink written 4 weeks ago by kristoffer.vittingseerup1.4k

Sounds about right - it migth simply not be detected/expressed in a subset of the samples. That is also why when you do a StringTie analysis you afterwards needs to merge and requantify - see the overview figure on the StringTie website here.

ADD REPLYlink written 4 weeks ago by kristoffer.vittingseerup1.4k
0
gravatar for ra2967
4 weeks ago by
ra296710
ra296710 wrote:

Thank you all,

However, if this was the case, I would not have any "zero coverage"/ zero FPKM in my final gtf. In all my gtf (one per sample) I have transcripts that have zero FPKM, and they are not removed.

Regarding the merge algorithm, we performed it and we are not recovering this gene in the sample: although it is in the merged gtf, it doesn't appear in the gtf from this sample (but it appears in others). May it be a parameter of the hisat2? I will check the samtools suggestion from manuel and I will let you know, but I don't think that I have zero reads for this gene just in one of the samples...

ADD COMMENTlink written 4 weeks ago by ra296710

And that is why you need to re-quantify all your samples based on the merged GTF file as described in step 4 of the StringTie workflow. By using the -eB option you will get an expression table in the end which will include zero for non-expressed.

ADD REPLYlink written 4 weeks ago by kristoffer.vittingseerup1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2415 users visited in the last hour