Filtering in RNA-Seq to get Differentially Expressed lncRNA
2
1
Entering edit mode
2.6 years ago
conor93 ▴ 60

Hi All,

I am quite new to RNA-Seq analysis and I have some confusion as to how to perform certain filtering steps in my pipeline.

Experiment Conditions: I have three conditions tested based on viral infection experiments in human cells; Uninfected, Wildtype Virus Infected, and Mutant Virus Infected with three replicates in each condition (9 samples in total).

I have a few goals in mind:

1. I would like to identify significantly upregulated human long-non coding RNA between my conditions.

2. Compare expression of significantly upregulated human lncRNA with protein coding genes (Co-Expression?)

I have aligned my raw fastq files using merged annotation consisting of the human (GENCODE) and viral genomes and annotation from LNCipedia (lncRNA only) using HISAT2. After QC of alignments, i performed HTSeq-counts to generate a read count matrix. As i am aware of the bias of using lncRNA annotation only, HTSeq was repeated using full annotation (human genome (GENCODE), viral genome and lncipedia annotation) to generate a second count matrix with both lncRNA and protein coding genes.

DESeq2 was then used to perform DE analysis on both sets of counts. Genes which had row counts less than 30 where filtered out. I now have 3 gene-list csv files of differentially expressed genes (Wildtype vs Uninfected and Mutant vs Uninfected and Wildtype vs Mutant) . (6 files in tota- 3 from deseq2 using lncRNA only count matrix and 3 from deseq2 using full annotation count matrix)

Recently I was told that I also need to filter by a FPKM threshold value of 1.0. I performed StringTie on the bam files generated from HISAT2 using the same annotation (GENOME Human genome, Viral genome and lncipedia annotation) to generate FPKM values but i notice that many lncRNA shown to be significant from the DESeq2 result files only have FPKM values less than 0.1. I am unsure as to how to filter out low expression genes as i am aware that lncRNA have low expression in contrast to mRNA.

Questions:

1. Would it be better to use the full annotation rather than lncRNA only annotation for HTSeq to identify differentially expressed lncRNA?
2. Are there more appropriate filters that I should use to identify differentially upregulated lncRNA genes?
3. At what stage in the RNA-Seq pipeline is filtering supposed to be done (before or after DESeq2)?

Your thoughts and feedback are most appreciated. Thanks,

RNA-Seq R assembly • 2.3k views
2
Entering edit mode
2.6 years ago

Hey conor,

Alignment of reads to the genome and then the determination of raw counts via HTseq, followed by differential expression analysis using DESeq2, is common / typical. Filtering raw counts <30 seems high but this may not affect your results that much. I usually go for <5 or <10.

Another way to do this is to pseudo-align to a reference transcriptome FASTA using Kallisto or Salmon, and then feed the count estimates from these into DESeq2.

I don't see the point in the StringTie / FPKM step, given what you have already performed. Nobody should even be using FPKM anymore. FPKM is derived from a normalisation method that renders your samples incomparable via differential expression. Indeed, no cross-sample normalisation is performed. Feed this back to your colleague who recommended FPKM, too.

Would it be better to use the full annotation rather than lncRNA only annotation for HTSeq to identify differentially expressed lncRNA?

You can use either. General results should be the same where it really matters. The differences may only lie in genes (e.g. lncRNAs) that lie on the fringes of statistical significance, which may otherwise have not even been used in whichever downstream analyses you aim to perform. By only focusing on lncRNAs, you save yourself a harsh FDR threshold because you have less tests.

In any case, your reads have already been aligned to the genome, which helps to mitigate some bias in how you then derive raw counts via HTseq.

Are there more appropriate filters that I should use to identify differentially upregulated lncRNA genes?

From DESeq2, typically start at FDR Q < 0.05 and absolute log (base 2) fold change > 2. In DESeq2, note that you should perform lfc shrinkage, like this:

res <- results(dds, contrast=c("Condition", "Treatment", "Control"),
independentFiltering = TRUE, alpha = 0.05, pAdjustMethod = 'BH', parallel = TRUE)
res <- lfcShrink(dds, contrast=c("Condition", "Treatment", "Control"), res = res)


At what stage in the RNA-Seq pipeline is filtering supposed to be done (before or after DESeq2)?

Usually on the raw counts, but 30 seems high to me. One can also remove when the counts have been normalised. Ask 2 people and you'll get 2 different answers.

Kevin

0
Entering edit mode

Hi Kevin,

I have used the filtering criteria as per your suggestions, but what i have recently noticed is that with some highly expressed lncRNA genes, particularly with intronic ones, they tend to overlap with highly expressed protein coding genes. When i look at the raw counts generated using lncRNA only annotation, these genes show very high read counts and they pass the FDR and log2FC filters. But when I look at the read counts for the same lncRNA genes generated using the full annotation (whole human genome with lncipedia annotation) the counts for these genes across most samples are down to zero in many cases. I know this probably due to the way htseq regards these counts as ambiguous (i used the default UNION mode in htseq), but in this context, would the use of lncRNA only annotation simply produce an overestimation of the number of long non-coding RNA differentially expressed.

So my question is, if i do use lncRNA only annotation to generate the count matrix with HTSeq and do subsequent DE with DESeq2, is there a particular way to assess if the identified differentially expressed lncRNA genes are indeed being expressed and if so how to filter out these false-positives. I have performed kallisto analysis on the same data with full annotation, but i am having difficulty in understanding how to interpret both pipelines together (kallisto and hisat2/htseq) (Hope that made sense....)

Thanks,

Conor

0
Entering edit mode

Hey Conor, I imagine that the issue there is with multi-mapping reads. HTseq is tuned to not naïvely count all reads that align to multiple locations. There are other issues, too, such as situations where 2 genes share the same genomic position. It is impossible through short-read NGS to understand the origin of the majority of such reads.

Take a look at FAQ: https://htseq.readthedocs.io/en/master/count.html

Other read counting tools that can manage these types of 'multi-mappers' include featureCounts, mmquant, and even the 'old faithful, BEDTools (the functionality of which most of these other tools are just replicating).

1
Entering edit mode

Hi Kevin,

I can definitely identify this problem in regards to multi-mapped reads, but what i'd like to know is, is there any way to differentiate expression of lncRNA genes located at intronic sites of protein-coding genes, which are also simultaneously being expressed. I haven't been able to find much literature which addresses this.

Thanks,

Conor

0
Entering edit mode

hi kevin Do you faver me? I need help. I have analyzied my RNA-Seq data. I have used this tools: Download sequences(SRA) from ncbi database. FastQC (Check quality of sequencing). Trimmomatic(the quality of each raw library is analyzed and sequencing adapters and bad quality reads are removed) I have used paired end datas as input in hisat2 and I convert sam file to bam file. then i have used htseq to count. I had htseq count. I have used deseq2 package to get up and dawn genes. The genome I used was zea mays. could you please tell me how can i separate coding and noncoding sequence from deseq2 result?

0
Entering edit mode

As you have not shared examples of any data or showed any commands, I cannot advise further. Also, my knowledge of Zea mays ssp. is minimal.

0
Entering edit mode

thank you for reply kevin deseq2 result: partial my up genes:

        logFC               adj p value
Zm00001d041070_T001 269.4464931 3.64950283  0.186943942 19.52191008 7.15E-85    1.61E-81
Zm00001d037029_T001 114.0862173 7.672199837 0.398213312 19.26655789 1.03E-82    2.18E-79
Zm00001d039335_T001 136.4866379 7.387211015 0.397223594 18.59711034 3.39E-77    4.32E-74
Zm00001d028473_T001 59.75778355 3.987002596 0.219482898 18.16543626 9.69E-74    8.24E-71
Zm00001d008007_T001 147.1912346 7.697534624 0.429727556 17.91259258 9.41E-72    6.66E-69
Zm00001d053268_T001 439.8040952 6.018885854 0.340842564 17.65884456 8.70E-70    5.45E-67
Zm00001d008526_T001 58.51596929 7.916104543 0.464029338 17.05949148 2.97E-65    1.56E-62
Zm00001d044534_T001 43.19029759 6.995537812 0.417361465 16.76134094 4.68E-63    2.29E-60
Zm00001d031991_T001 81.8949371  6.345340927 0.379550608 16.71803654 9.69E-63    4.69E-60
Zm00001d026317_T004 97.7715811  6.707198851 0.40311967  16.63823264 3.68E-62    1.68E-59
Zm00001d026452_T001 1933.107047 2.803483578 0.168532727 16.63465388 3.91E-62    1.76E-59
Zm00001d052972_T001 128.1900299 3.617879048 0.218309531 16.57224508 1.11E-61    4.86E-59
Zm00001d012982_T001 92.79295428 5.004067664 0.301993771 16.57010223 1.15E-61    4.98E-59
Zm00001d015407_T001 26.97099876 5.784248559 0.3519198   16.43626916 1.05E-60    4.42E-58
Zm00001d053438_T002 45.55245534 5.695980785 0.350632843 16.24485811 2.43E-59    9.88E-57
Zm00001d008041_T001 99.22661808 4.908960788 0.306202585 16.0317418  7.67E-58    2.90E-55
Zm00001d053269_T001 82.97002115 7.293252694 0.458017365 15.92352878 4.35E-57    1.62E-54
Zm00001d003062_T001 482.5299175 4.216236147 0.270404891 15.59230727 8.21E-55    2.73E-52
Zm00001d042492_T001 50.12685464 7.786971418 0.507871978 15.33254789 4.63E-53    1.42E-50
`
0
Entering edit mode

So, you have the gene IDs. You need to match these to a gene biotype. Do you have a GTF for this species?

0
Entering edit mode

yes, there is GTF file and i have it

0
Entering edit mode

Cool, does it encode the gene biotype?

0
Entering edit mode

yes, I thought. thank you in advance

1
Entering edit mode
2.6 years ago

If you use different annotation sources, separate quantification can avoid issues with ambiguous reads in overlapping annotations from the different sources (where a similar or related overlapping annotation has a different name in the different annotation files). Ideally, having a merged list prior to quantification is theoretically best, but you'd have to have some system for naming the overlapping annotations.

I agree with Kevin that is is probably most helpful to filter prior to testing.

While I might have a preference on characterizing genes with higher FPKM values, I think 50% (or 20%, etc) greater than 1 is often a bit strict. However, if no expression is above 0.1, I would be a little concerned. Also, I might be concerned about having high expression that isn't consistent among your replicates (which I think can be more of an issue for lncRNAs), but I think DESeq2 replicate concordance is typically pretty good (although I'm sure you'll see exceptions to any general rule, if you make enough comparisons).

0
Entering edit mode

Hi Charles,

Thanks for getting back to me,

Ideally, having a merged list prior to quantification is theoretically best, but you'd have to have some system for naming the overlapping annotations.

May I ask what system you mean when dealing with overlapping annotations?

Thanks,

Conor

0
Entering edit mode

If you are using GENCODE annotations, it should be OK to use both coding and non-coding gene annotations (unless you were doing something like using RefSeq coding genes and GENCODE lncRNAs).

My experience has been in a slightly different context, so I don't know if it relevant for your project. However, if you were studying cancer, there are MiTranscriptome lncRNAs (where you can get some functional assessment based upon their expression between cancer types). In this case, the name indicates the cancer gene expression pattern - however, I don't know if there is a strict lack of stranded overlap with your coding annotations.

In other words, if there is overlap between the coding genes and the lncRNAs (even on different strands, if you have an unstranded library), then overlapping regions would be considered ambiguous in htseq-count. So, if you concatenate the annotation files (which have overlapping annotations, but with different gene names), you may find there are more genes with low counts (compared to separate quantification).