Question

Question in RNA-Seq gene expression profiling

0

Entering edit mode

6.3 years ago

Cy • 0

Hi all, I'm new in rna-seq analysis and confused about the gene expression profiling. How to obtain the overview gene expression profiling such as how many total number of protein-coding gene, non-coding gene and pseudogene? I tried with the workflow from article: Toward a Reference Gene Catalog of Human Primary Monocytes. (https://doi.org/10.1089/omi.2016.0124)

FASTQC
Trimmomatic
HISAT
StringTie
Cuffnorm (The FPKM >0.1 threshold was used to determine expressed transcripts)
Cuffmerge
Cuffdiff

This article also reported as by applying an FPKM >0.1 threshold, we have identified a total of 20,371 genes and 82,996 transcripts expressed in our monocyte datasets.

The part I confused is how to applying an FPKM >0.1 threshold and which file should I applied to (cuffnorm output file: gene.fpkm_table or transcript.gtf file)? And how they identified the amount of protein-coding, non-coding and pseudogene from these 20,371 genes?

There have many article reported their result as how much of total genes and transcripts in their datasets, but I really confused how they obtain it.

I really need some help to understand this. Thank you

rna-seq gene expression profiling • 1.3k views

ADD COMMENT • link updated 6.3 years ago by Ashastry ▴ 60 • written 6.3 years ago by Cy • 0

score 0 · Answer 1 · 2019-03-22

This workflow is quiet old. FPKM normalization has repetitively been shown to be non-reliable for differential analysis as it fails to properly account for differences in library composition (see the StatQuest videos on Youtube for a nice and esay-access illustration on FPKM or search PubMed or the web for more scientific literature). Spend quality time reading this guide and this example workflow. As for FPKM thresholds please use google and the search function, this has been asked many times before. Short answer: There is no strict/reliable cutoff. It is also not necessary to prefilter your genes for differential expression as low-count genes will most commonly not be statistically significant. They typically lack the necessary power (= not enough counts). To get information on the type of gene you have to look up the genes in the respective annotation databases, such as GENCODE, RefSeq etc see How to sort genes into coding and non-coding mRNA?

score 0 · Answer 2 · 2019-03-22

0

Entering edit mode

6.3 years ago

Ashastry ▴ 60

If they are saying they used 0.1 as the cutoff then they probably used that cutoff on the counts table and not the gtf file. There are several arguments about FPKM cutoff and you can look them up on biostars to understand it.

If you want to have an overview of what is in your RNAseq data, I would recommend using Rseqc/Picard. These tools will take your bam files and then give you stats about the number of exons, introns and you can also check if you have any contaminations in your data(like rRNA). If you run multiQC on the folder containing the results from these tools, it compiles the stats from all samples and you can visualize it in a great way. Another way of visualizing your data is to align your bam to a genome browser (something I learnt from a senior and it has been very useful to me in validating DE genes).

ADD COMMENT • link 6.3 years ago by Ashastry ▴ 60

0

Entering edit mode

I don't want to distract from the main answers (since I think you need to do some testing for everybody, meaning you wouldn't lock down the workflow ahead of time). However, in terms of the FPKM threshold, maybe these are relevant:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

http://cdwscience.blogspot.com/2019/02/variance-stabilization-and-pseudocounts.html

You could also do something like require genes to have a certain FPKM threshold for a threshold of samples, which is more like I have here (although that is way messier to look at, and one of the main points is that I think you kind of need to make your own templates to make sure you understand everything that you are doing).

ADD REPLY • link 6.3 years ago by Charles Warden 8.3k