Question: Analysis of RNA-seq data
0
gravatar for lorenzosola96
2 days ago by
lorenzosola960 wrote:

Hi to all,

I have RNA-seq data from 31 tissues of 2 different individuals. My goal is to evaluate the expression of various genes of interest and to compare the expression level among these tissues. I read throughout a lot of papers and posts and I tried to set up my pipeline. I mean first of all after quality check of my reads, I trimmed them and mapped with HISAT2. To have an initial idea about gene expression I visualized the results on IGV (I perfectly known that with IGV it is not possible to quantify the expression but I did it just to have an idea). Then I calculated TPM with Salmon from raw reads and I realized different plots in Python (actually TPM from Salmon matched with IGV results). Now my intention is to proceed with Differential Expression Analysis with DESeq2. I have two questions:

1) is it a good pipeline to analyze data from RNA-seq data step by step? (I perfectly know that for expert bioinformaticians all these procedures seem to be like wasting time but unfortunately for me it is the first time I have to analyze RNA-seq data and I want to be sure step by step).

2) Reading different papers and posts I realized that TPM is not so good to compare gene expression among different samples (in my case tissues) and that there isn’t a real cut-off to consider a gene as low or high expressed. So it could be a good thing compare TPM of genes I m interesting in with ”housekeeping” genes to have an idea about low/medium/esxpression before DEA?

As I said before I want to proceed with DESeq2 as every people recommend to have stronger results.

Thank you in advance for your time and suggestion!

rna-seq • 136 views
ADD COMMENTlink modified 2 days ago by Istvan Albert ♦♦ 85k • written 2 days ago by lorenzosola960

Please read the DESeq2 manual at Bioconductor, it tells you everything you have to know when starting with transcript abundance estimates from salmon towards proper differential analysis.

ADD REPLYlink modified 2 days ago • written 2 days ago by ATpoint41k

Thank you for you suggestion. Actually I’m reading and following that pipeline. I’m not sure you replied exactly to my questions. I did this post to have ideas/feedbacks about my work and to know if I understood correctly that manual. Considering what I wrote, I’m right or wrong? Thank you again!

ADD REPLYlink written 2 days ago by lorenzosola960

Sorry for not being precise. If you follow the DESeq2 manual you will be fine, so yes that pipeline is basically fine. You should realize thought that differential expression is commonly done on the gene level rather than the transcript level, the latter is what salmon outputs, therefore (as explained in the manual) you need to use tools like tximport to summarize transcript level abundances to the gene level. As for 2) no comparing transcript abundances is not ideal, use the normalized gene level counts from DESeq2, e.g. via vst function. Alternatively use the normalized counts directly via counts(dds, normalized=TRUE).

ADD REPLYlink modified 1 day ago • written 2 days ago by ATpoint41k

Indeed I really appreciate your suggestions especially the second one. So basically txt import takes the output from Salmon (transcript level abundances) and transform this to the gene level to perform Differential Expression Analysis right?

ADD REPLYlink written 1 day ago by lorenzosola960

Yes, that is the idea. The problem with transcripts is that individual isoforms per gene share large part of the sequence information (most exons are identical between isoforms), therefore it is difficult to know which isoform a read comes from. For tx-level DE analysis you need special approaches that take this so-called mapping uncertainty into account. For the gene level, which is more robust, one simply sums these per-tx counts and then performs DE on these values. The tximport manual has some literature references for a more extensive discussion. You can directly use the output of tximport for DESeq2, see https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-tximeta

In most cases one only wants to know whether a gene is differentially-expressed and not whether individual isoforms change, therefore the gene level analysis is the most common DE approach.

ADD REPLYlink modified 1 day ago • written 1 day ago by ATpoint41k

Thank you for your very precious advices. In this case I have to focus on differentially-expressed genes not on isoforms. Anyway this can be very useful for future analysis.

ADD REPLYlink written 1 day ago by lorenzosola960
0
gravatar for Istvan Albert
2 days ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

The "raw" readcounts are not appropriate to quantify expression because long transcripts produce more reads over one copy of the transcript.

But when you view it in IGV what you see are the per base coverages, those do in fact correlate with the gene expression, thus what you do see in IGV is not so different from what you would get as a TPM.

That being said it is true that we should not take these measures too accurately, but both TPM and IGV coverages may be used as a qualitative rather than quantitative assessment of the data.

As for housekeeping genes, I would not try to identify these by a certain level of expression: low medium or high, that does not seem to make biological sense. What we call housekeeping genes those where the expression is tuned to the tissue and it stays the same over different conditions. It is not about the magnitude of that expression level.

To identify housekeeping genes find those where the deviations seem to be the smallest both within replicates and across them.

ADD COMMENTlink modified 2 days ago • written 2 days ago by Istvan Albert ♦♦ 85k

Thank you your suggestion is greatly appreciate and it is exactly what I waiting for. So if I understood right now I did qualitative evaluation in the right way. Now to do a quantitative assessment I have to proceed with DESeq2 (DEA) right? Instead talking only about TPM how can I set up a cut off to understand if the expression of ”my gene” is low/medium/high in the different tissues? I read different papers and I realized there isn’t a real cut off because it is depends from many aspects like the library size and so on.

ADD REPLYlink written 2 days ago by lorenzosola960
1

DESeq2 by the way offers a test against differential expression so you can then filter for genes with strong evidence of fold changes close to zero, that is more reliable than pulling deviations, see https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#tests-of-log2-fold-change-above-or-below-a-threshold

ADD REPLYlink written 1 day ago by ATpoint41k

Thank you again very clear indeed if the fold changes will be very close to zero it will mean no expression right? So we can filter these genes because are basically they are not expressed.

ADD REPLYlink written 1 day ago by lorenzosola960
1

No, fold changes refer differences between conditions. If you have strong evidence against differential expression then (e.g. house keepers) they will have high expression, but expression levels are almost the same between conditions. If they are lowly expressed they would probably not be significant because you simply do not have enough evidence (so counts) to make a statement about their expression level. In general, the higher the expression, the larger the statistical power.

ADD REPLYlink modified 1 day ago • written 1 day ago by ATpoint41k

As ATpoint also stated "fold change" deals with change between the condition, not the absolute value of the expression in each condition.

The fold change is basically the log2(expression2/expression1). A zero fold change means expression1 is equal to expression2 but no statement is made on the magnitude of it.

The actual value of expression could be 1, 100 or 10,000. Housekeeping genes are among those that do express above zero, without changing much (foldchange is close to 0) and the deviation in expression level across all samples is sufficiently small. These conditions are necessary but not sufficient to designate a gene as housekeeping.

The p-values you get with the statistical method DeSeq2 cannot be used to identify housekeeping genes since that is not what the statistical test was designed for.

Do a literature search to see what other scientists have come up with like this:

another method would be to evaluate known house keeping genes and using filtering criteria observed for those to find new ones.

ADD REPLYlink modified 18 hours ago • written 1 day ago by Istvan Albert ♦♦ 85k

If you check the so-called MA-plots a proper house keeping gene from my understanding would be on the right of it near y=0, so small change, high expression. DESeq2 has a couple of MA-plots in the manual, they are awesome for QC purposes and visualization of results (relationship between FC and expression level). Actually I disagree with the above statemtn that DESeq2 cannot identify house keeping genes. See the comment I made above, there is a test to test against differential expression, so if you take those genes and additionally filter for proper base mean (so average expression between samples) then you have candidates. You would obviously need additional information on the function of the genes but it might narrow down the list. Of course a bona fide list of house keepers from the literature is preferred, or can be combined with the DE expression results.

ADD REPLYlink modified 1 day ago • written 1 day ago by ATpoint41k

Thank you Albert and thank you ATpoint. Indeed my idea was to do some searches on literature in order to see house-keeping genes in different tissues and compare their expression with the expression my genes. Of course I will take in account all the paratemers you stated above. After these comments I have a clear idea about the next part of the analysis. Very useful suggestions, thank you again!!!

ADD REPLYlink written 1 day ago by lorenzosola960

the problem with using the p values from the differential expression method to identify housekeeping genes is that some (maybe most) high p-values that indicate non-differentially expressed genes will come out as such not because the expressions levels are consistently the same across all samples (same basemans), but because the expression varies too much even within replicates. So one would need to account for the variance as well. It is not so easy to filter for that information since you don't get that in the DESeq2 output.

ADD REPLYlink modified 18 hours ago • written 18 hours ago by Istvan Albert ♦♦ 85k

You are not getting my point. You can explicitely define an alternative hypothesis to test against differential expression the same way you test for differential expression in the standard workflow, but with this so-called altHypothesis you still take dispersion and baseMean into account and filter for low p-value as usual, with the difference that the Null hypothesis is different. The rejection of the Null (so padj < 0.05) here would mean that the gene is not DE. See http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#tests-of-log2-fold-change-above-or-below-a-threshold Testing for FCs below a threshold, it is the MA-plot at this link on the topright that is of interest here.

ADD REPLYlink modified 18 hours ago • written 18 hours ago by ATpoint41k

right, so fundamentally this is not a matter of using the correct "pipeline" or cutoffs, instead it is about what is an appropriate interpretation of your particular data.

There is no formula that works in every case since it is down to the mechanisms of each organism operates. Reading the materials from the vignettes and seeing how the data looks in other cases is very helpful. In the end, the particular details of the experiment and the data will be decisive in determining what makes sense.

ADD REPLYlink written 1 day ago by Istvan Albert ♦♦ 85k

I totally agree with you, It is not just matter to able to follow a pipeline but the most important thing is the interpretation of the results and the details of the experiment. Let's me say that if I follow correctly the most reccomended/used pipelines I will be more sure when I will analyze my results. Thank you again!

ADD REPLYlink written 1 day ago by lorenzosola960
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2049 users visited in the last hour