Question: Too low FPKM values from StringTie
0
gravatar for SJ Basu
8 days ago by
SJ Basu30
India
SJ Basu30 wrote:

Hello People,

I am analyzing 20 samples RNA-Seq data of leukemia. To this end, I have used the Hisat2-StingTie2 pipeline to analyze the samples and arrive at FPKM counts for each gene. After I have run the complete process, I found that the FPKM values are way too low for calculating Differential gene expression and it seems there is no difference in expression across samples.

                B85         B86         B78         B89
       Minimum  0           0           0           0
First Quartile  0.004382    0.004093    0.019453    0.011406
        Median  0.150255    0.153044    0.167367    0.144218
Third Quartile  1.41        1.463101    1.180386    1.064904
       Maximum  120290.64   144817.40   116797.73   119615.86
      Skewness  91.670      97.440      86.229      93.291

These FPKM values were taken from StringTie file using -A options. These samples were processed using Illumina truseq stranded total mrna kit alongside ribo-depletion kit and was sequenced on NextSeq 550 with 2X75bp reads (avg 60mil reads per sample). Peculiar thing is Hisat2 have shown >90% mapping and there isn't even rRNA contamination. And on top, I have even mapped the reads to cDNA database which also showed >55% reads mapping.

The commands were as follows:

hisat2 -p 60 --dta -x hsa_GRCh38_index -1 sample_R1_pair.fastq.gz -2 sample_R2_pair.fastq.gz -S sample_map2genom.sam && samtools view -bSh -@ 45 sample_map2genom.sam | samtools sort -@ 45 > sample_map2genom.bam

stringtie ${i}_map2genom.bam -G Homo_sapiens.GRCh38.95.gtf -o sample_assembly_strg.gtf -p 60 -A ${i}_abundance.txt -B

This pipeline has worked great for me before but this particular output is totally puzzling me. Has anyone encountered this before ?? Please advice.

Thanks and regards

ADD COMMENTlink modified 8 days ago • written 8 days ago by SJ Basu30
1
gravatar for shawn.w.foley
8 days ago by
shawn.w.foley190
shawn.w.foley190 wrote:

How many genes have >0, >1, or >10 FPKM. What type of samples are these (cell line, tissue, etc). And how many genes are in your genome?

It's not unreasonable to have an Ensembl annotation with >55,000 genes of which only ~8-15,000 genes are expressed. I'm wondering if annotated but unexpressed genes are hiding the true signal. This would be especially apparent in tissue culture cell lines.

ADD COMMENTlink written 8 days ago by shawn.w.foley190

Hi Shawn,

These are WBCs (leukemia) extracted from FFPE samples but rna quality was good.

For number of genes >0 is ~35k , >1 is ~15-18k and >10 is ~3-5k

"I'm wondering if annotated but unexpressed genes are hiding the true signal"...I didn't exactly get that....

ADD REPLYlink written 8 days ago by SJ Basu30
1

Poor wording on my part. I think your data looks reasonable, 15-18k with >1FPKM is in the realm of what I normally see. I think the fact that you have ~75% of genes with <1 FPKM is because most of them just aren't expressed. I would set an expression threshold and then analyze everything above that threshold in your differential expression.

I normally generate a density plot in R, it'll often show a bimodal curve, then I determine my cutoff for expression empirically.

ADD REPLYlink written 8 days ago by shawn.w.foley190

Actually the problem is I don't see a "CURVE", its just a really long, sharp PEAK at ~0 FPKM value....it looks I might as well take a cut-off of FPKM value of 2 and still would retain more than 25K genes.

Another thing is, which FPKM do you consider ? The one that is given by -A option or the one that comes out from t_data.ctab....

ADD REPLYlink written 8 days ago by SJ Basu30
1

Try making a density plot of the log10(FPKM), that should show a more appropriate curve. And I'm not sure which output to recommend, I haven't used HiSat before, I tend to stick to STAR/HTSeq/DESeq2 or more recently Salmon/DESeq2.

ADD REPLYlink written 6 days ago by shawn.w.foley190
1

All RNA-seq count data is shifted toward zero, so, that is expected. RNA-seq counts follow a negative binomial distribution. I don't believe that production of FPKM expression values results in an overall change from this data distribution. A transformation of normalised counts via, e.g., variance-stabilisation, logging, Z-scaling, or a combination of these, will alter the distribution.

Also keep in mind that FPKM units are not suitable for differential expression comparisons.

ADD REPLYlink modified 5 days ago • written 5 days ago by Kevin Blighe37k

Hi Kevin,

Thanks for reassurance and the information. As for the FPKM units, so which units can reliably be used for differential expression comparisons ?? TPM ??

ADD REPLYlink written 5 days ago by SJ Basu30
1

Hey, if you just follow the recommendation by shawn (Salmon --> DESeq2), then you'll have data that is suitable for differential expression. Salmon will require input FASTQ files, I believe, and bypasses the production of a BAM file.

The other method, STAR --> HTSeq --> DESeq2, will produce a BAM file, with raw counts then derived by HTSeq.

Note, that normalisation and differential expression analysis occurs in DESeq2.

As you have already used StringTie, you can actually use a function, prepDE.py, which conveniently converts your HISAT2 / StringTie dataset to raw counts that are suitable for DESeq2: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq

Criticism of FPKM for differential expression analysis is out there - just do a search. You will not find many voices actively promoting it for such analyses (although publications do exist).

ADD REPLYlink written 4 days ago by Kevin Blighe37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1953 users visited in the last hour