Stringtie elimination of important transcripts
2
0
Entering edit mode
5.3 years ago
ra2967 ▴ 20

Dear all,

I have nine RNA-Seq files that I aligned using the hisat2 aligner and this default command:

hisat2 -x grch37_snp_tran/genome_snp_tran -1 reads_R.fastq.gz -2 reads_F.fastq.gz -S sample.sam

The -x file was the one they recommened in the hisat2 website.

Files looked fine and I proceed to the bam, sorted bam and bai using samtools.

However, when I use stringtie with this command:

stringtie file.sorted.bam -G Homo.sapiens.GRCh37.75.gtf -o file.gtf

It happens that in some sample I am losing genes as important as CCND1, while I keep them in other files. It is very strange as I have some transcripts with zero coverage/FPKM/TPM, so I expected that they were in my file.gtf. Am I doing something wrong? If I use the same command for all the samples, how can it be that I lose this transcript in one but not in others? In addition, how it can be that I don't have the same number of genes in all my generated files if I am using the same Homo.sapiens.GRCh37.75.gtf?

Thanks

RNA-Seq alignment assembly stringtie hisat2 • 2.4k views
ADD COMMENT
0
Entering edit mode

I answer it on the bottom

ADD REPLY
0
Entering edit mode

What version of stringtie are you using? This issue has been mentionned on the stringtie github issues: https://github.com/gpertea/stringtie/issues/141

One user reported that using version v1.3.3b as opposed to 1.2.3 fixed the issue (plotted), but the other users still report seeing this with newer versions. If would try using the latest version or 1.3.3b specifically and see what happens.

You could also try different aligners: https://github.com/gpertea/stringtie/issues/158 (STAR seems to do well according to the last poster's comment. STAR is actually my go-to aligner and I did use it with stringtie in the past.)

This user reports this issue as well (albeit on the older 1.2.3 with the acknowledged bug) but due to hisat2 parameters: https://github.com/gpertea/stringtie/issues/61

Related issue: https://github.com/gpertea/stringtie/issues/102

ADD REPLY
2
Entering edit mode

Have you looked at the Hisat bam file to see if there acutally is any reads for that gene? Can easily be done with a genome browser such as IGV.

ADD REPLY
1
Entering edit mode
5.3 years ago

My guess is that there might be insufficient coverage (for example) in some samples for that loci, or something causing it to be discarded. For example, the stringtie default parameters can exclude short transcripts:

-G <guide_gff> reference annotation to include in the merging (GTF/GFF3)
-o <out_gtf> output file name for the merged transcripts GTF (default: stdout)
-m <min_len> minimum input transcript length to include in the merge (default: 50)
-c <min_cov> minimum input transcript coverage to include in the merge (default: 0)
-F <min_fpkm> minimum input transcript FPKM to include in the merge (default: 0)
-T <min_tpm> minimum input transcript TPM to include in the merge (default: 0)

There might be other defaults at play here. See: https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

My suggestion to troubleshoot this: Look into your BAM file with samtools, and see if anything gets mapped at that location. Something like:

samtools view sample.bam "Chr11:69455855-69469242" > CCND1.sam

Compare with your samples that do have the gene. If you don't see anything, then there's nothing stringtie can do about it. Maybe there's a hisat parameter you can tweak, or maybe it's just not in your raw reads for some reason. Feel free to comment and tell us what you see after that.

ADD COMMENT
1
Entering edit mode

There is also the :

-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5

argument for the individual StringTie runs.

ADD REPLY
0
Entering edit mode

Sounds about right - it migth simply not be detected/expressed in a subset of the samples. That is also why when you do a StringTie analysis you afterwards needs to merge and requantify - see the overview figure on the StringTie website here.

ADD REPLY
0
Entering edit mode
5.3 years ago
ra2967 ▴ 20

Thank you all,

However, if this was the case, I would not have any "zero coverage"/ zero FPKM in my final gtf. In all my gtf (one per sample) I have transcripts that have zero FPKM, and they are not removed.

Regarding the merge algorithm, we performed it and we are not recovering this gene in the sample: although it is in the merged gtf, it doesn't appear in the gtf from this sample (but it appears in others). May it be a parameter of the hisat2? I will check the samtools suggestion from manuel and I will let you know, but I don't think that I have zero reads for this gene just in one of the samples...

ADD COMMENT
0
Entering edit mode

And that is why you need to re-quantify all your samples based on the merged GTF file as described in step 4 of the StringTie workflow. By using the -eB option you will get an expression table in the end which will include zero for non-expressed.

ADD REPLY

Login before adding your answer.

Traffic: 2394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6