Question: Filtering in RNA-Seq to get Differentially Expressed lncRNA
0
gravatar for conor93
26 days ago by
conor9310
Hong Kong
conor9310 wrote:

Hi All,

I am quite new to RNA-Seq analysis and I have some confusion as to how to perform certain filtering steps in my pipeline.

Experiment Conditions: I have three conditions tested based on viral infection experiments in human cells; Uninfected, Wildtype Virus Infected, and Mutant Virus Infected with three replicates in each condition (9 samples in total).

I have a few goals in mind:

  1. I would like to identify significantly upregulated human long-non coding RNA between my conditions.

  2. Compare expression of significantly upregulated human lncRNA with protein coding genes (Co-Expression?)

I have aligned my raw fastq files using merged annotation consisting of the human (GENCODE) and viral genomes and annotation from LNCipedia (lncRNA only) using HISAT2. After QC of alignments, i performed HTSeq-counts to generate a read count matrix. As i am aware of the bias of using lncRNA annotation only, HTSeq was repeated using full annotation (human genome (GENCODE), viral genome and lncipedia annotation) to generate a second count matrix with both lncRNA and protein coding genes.

DESeq2 was then used to perform DE analysis on both sets of counts. Genes which had row counts less than 30 where filtered out. I now have 3 gene-list csv files of differentially expressed genes (Wildtype vs Uninfected and Mutant vs Uninfected and Wildtype vs Mutant) . (6 files in tota- 3 from deseq2 using lncRNA only count matrix and 3 from deseq2 using full annotation count matrix)

Recently I was told that I also need to filter by a FPKM threshold value of 1.0. I performed StringTie on the bam files generated from HISAT2 using the same annotation (GENOME Human genome, Viral genome and lncipedia annotation) to generate FPKM values but i notice that many lncRNA shown to be significant from the DESeq2 result files only have FPKM values less than 0.1. I am unsure as to how to filter out low expression genes as i am aware that lncRNA have low expression in contrast to mRNA.

Questions:

  1. Would it be better to use the full annotation rather than lncRNA only annotation for HTSeq to identify differentially expressed lncRNA?
  2. Are there more appropriate filters that I should use to identify differentially upregulated lncRNA genes?
  3. At what stage in the RNA-Seq pipeline is filtering supposed to be done (before or after DESeq2)?

Your thoughts and feedback are most appreciated. Thanks,

assembly rna-seq R • 175 views
ADD COMMENTlink modified 25 days ago by Charles Warden5.4k • written 26 days ago by conor9310
2
gravatar for Kevin Blighe
26 days ago by
Kevin Blighe30k
Republic of Ireland
Kevin Blighe30k wrote:

Hey conor,

Alignment of reads to the genome and then the determination of raw counts via HTseq, followed by differential expression analysis using DESeq2, is common / typical. Filtering raw counts <30 seems high but this may not affect your results that much. I usually go for <5 or <10.

Another way to do this is to pseudo-align to a reference transcriptome FASTA using Kallisto or Salmon, and then feed the count estimates from these into DESeq2.

I don't see the point in the StringTie / FPKM step, given what you have already performed. Nobody should even be using FPKM anymore. FPKM is derived from a normalisation method that renders your samples incomparable via differential expression. Indeed, no cross-sample normalisation is performed. Feed this back to your colleague who recommended FPKM, too.

Would it be better to use the full annotation rather than lncRNA only annotation for HTSeq to identify differentially expressed lncRNA?

You can use either. General results should be the same where it really matters. The differences may only lie in transcripts (lncRNAs) that lay on the fringes of statistical significance, which may otherwise have not even been used in whichever downstream analyses you aim to perform. By only focusing on lncRNAs, you save yourself a harsh FDR threshold because you have less tests.

In any case, your reads have already been aligned to the genome, which helps to mitigate some bias in how you then derive raw counts via HTseq.

Are there more appropriate filters that I should use to identify differentially upregulated lncRNA genes?

From DESeq2, typically start at FDR Q < 0.05 and absolute log (base 2) fold change > 2. In DESeq2, note that you should perform lfc shrinkage, like this:

res <- results(dds, contrast=c("Condition", "Treatment", "Control"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH", parallel=TRUE)
res <- lfcShrink(dds, contrast=c("Condition", "Treatment", "Control"), res=res)

At what stage in the RNA-Seq pipeline is filtering supposed to be done (before or after DESeq2)?

Usually on the raw counts, but 30 seems high to me. One can also remove when the counts have been normalised. Ask 2 people and you'll get 2 different answers.

Kevin

ADD COMMENTlink written 26 days ago by Kevin Blighe30k

Hi Kevin,

I have used the filtering criteria as per your suggestions, but what i have recently noticed is that with some highly expressed lncRNA genes, particularly with intronic ones, they tend to overlap with highly expressed protein coding genes. When i look at the raw counts generated using lncRNA only annotation, these genes show very high read counts and they pass the FDR and log2FC filters. But when I look at the read counts for the same lncRNA genes generated using the full annotation (whole human genome with lncipedia annotation) the counts for these genes across most samples are down to zero in many cases. I know this probably due to the way htseq regards these counts as ambiguous (i used the default UNION mode in htseq), but in this context, would the use of lncRNA only annotation simply produce an overestimation of the number of long non-coding RNA differentially expressed.

So my question is, if i do use lncRNA only annotation to generate the count matrix with HTSeq and do subsequent DE with DESeq2, is there a particular way to assess if the identified differentially expressed lncRNA genes are indeed being expressed and if so how to filter out these false-positives. I have performed kallisto analysis on the same data with full annotation, but i am having difficulty in understanding how to interpret both pipelines together (kallisto and hisat2/htseq) (Hope that made sense....)

Thanks,

Conor

ADD REPLYlink written 25 days ago by conor9310

Hey Conor, I imagine that the issue there is with multi-mapping reads. HTseq is tuned to not naïvely count all reads that align to multiple locations. There are other issues, too, such as situations where 2 genes share the same genomic position. It is impossible through short-read NGS to understand the origin of the majority of such reads.

Take a look at FAQ: https://htseq.readthedocs.io/en/master/count.html

Other read counting tools that can manage these types of 'multi-mappers' include featureCounts, mmquant, and even the 'old faithful`, BEDTools (the functionality of which most of these other tools are just replicating).

ADD REPLYlink modified 25 days ago • written 25 days ago by Kevin Blighe30k

Hi Kevin,

I can definitely identify this problem in regards to multi-mapped reads, but what i'd like to know is, is there any way to differentiate expression of lncRNA genes located at intronic sites of protein-coding genes, which are also simultaneously being expressed. I haven't been able to find much literature which addresses this.

Thanks,

Conor

ADD REPLYlink written 24 days ago by conor9310
1
gravatar for Charles Warden
25 days ago by
Charles Warden5.4k
Duarte, CA
Charles Warden5.4k wrote:

If you use different annotation sources, separate quantification can avoid issues with ambiguous reads in overlapping annotations from the different sources (where a similar or related overlapping annotation has a different name in the different annotation files). Ideally, having a merged list prior to quantification is theoretically best, but you'd have to have some system for naming the overlapping annotations.

I agree with Kevin that is is probably most helpful to filter prior to testing.

While I might have a preference on characterizing genes with higher FPKM values, I think 50% (or 20%, etc) greater than 1 is often a bit strict. However, if no expression is above 0.1, I would be a little concerned. Also, I might be concerned about having high expression that isn't consistent among your replicates (which I think can be more of an issue for lncRNAs), but I think DESeq2 replicate concordance is typically pretty good (although I'm sure you'll see exceptions to any general rule, if you make enough comparisons).

ADD COMMENTlink written 25 days ago by Charles Warden5.4k

Hi Charles,

Thanks for getting back to me,

Ideally, having a merged list prior to quantification is theoretically best, but you'd have to have some system for naming the overlapping annotations.

May I ask what system you mean when dealing with overlapping annotations?

Thanks,

Conor

ADD REPLYlink written 25 days ago by conor9310

If you are using GENCODE annotations, it should be OK to use both coding and non-coding gene annotations (unless you were doing something like using RefSeq coding genes and GENCODE lncRNAs).

My experience has been in a slightly different context, so I don't know if it relevant for your project. However, if you were studying cancer, there are MiTranscriptome lncRNAs (where you can get some functional assessment based upon their expression between cancer types). In this case, the name indicates the cancer gene expression pattern - however, I don't know if there is a strict lack of stranded overlap with your coding annotations.

In other words, if there is overlap between the coding genes and the lncRNAs (even on different strands, if you have an unstranded library), then overlapping regions would be considered ambiguous in htseq-count. So, if you concatenate the annotation files (which have overlapping annotations, but with different gene names), you may find there are more genes with low counts (compared to separate quantification).

ADD REPLYlink written 24 days ago by Charles Warden5.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1770 users visited in the last hour