Question: Should I keep all HTSeq table informations for edgeR analysis?
0
gravatar for silas008
9 days ago by
silas00880
Brazil
silas00880 wrote:

I am using edgeR to analyse a RNAseq data and I do not know if I should keep the information below for the analysis:

__no_feature
__ambiguous
__too_low_aQual
__not_aligned
__alignment_not_unique

I think I should keep them because I think edgeR need the information of the reads counted in these features to calculate CPM and logCPM but I am not sure.

Can someone help me?

Thank you

edger rnaseq htseq • 101 views
ADD COMMENTlink modified 8 days ago by SMK350 • written 9 days ago by silas00880
2
gravatar for SMK
8 days ago by
SMK350
Gentopia, Belgium
SMK350 wrote:

I noticed this when reading the codes at StatQuest: Filtering genes with low read counts. (line 97 - 103):

If you used htseq-count to count the reads per gene, then the last five rows of data are not for specific genes, but are summaries of how the counting went. We must remove them from our new data frame, or else edgeR will treat those last 5 rows as individual genes and throw off the statistics.

So you have to filter out the last five lines in the raw *.htseq, otherwise, your DGEList object will keep containing that summary information:

> head(raw.data)
              id  wt1  wt2  wt3  ko1  ko2  ko3
1 LOC001_0001.1    0    0    0    0    0    0
2 LOC001_0002.1   82   71   77   91   79   84
3 LOC001_0003.1 2651 2785 2759 2802 2142 3117
4 LOC001_0004.1  543  436  455  629  430  550
5 LOC001_0005.1  399  373  324  420  279  386
6 LOC001_0006.1  108   91   82  116   88  125
> y <- DGEList(counts = raw.data[, 2:ncol(raw.data)], genes = raw.data[, 1])
> tail(n = 5, y$genes)
                       genes
12858           __no_feature
12859            __ambiguous
12860        __too_low_aQual
12861          __not_aligned
12862 __alignment_not_unique
ADD COMMENTlink modified 8 days ago • written 8 days ago by SMK350
1
gravatar for ATpoint
8 days ago by
ATpoint16k
Germany
ATpoint16k wrote:

If you have the original fastq files, I suggest you use salmon to quantify them against a reference transcriptome of your choice. salmon can correct for GC bias and has an elaborate way to deal with multimappers. Following salmon read your quantifications into tximport to aggregate them to the gene level and correct for different transcript length, then feed it into edgeR. Please see the manuals of salmon and the vignettes of tximport at Bioconductor for the necessary steps.

ADD COMMENTlink written 8 days ago by ATpoint16k
0
gravatar for shawn.w.foley
9 days ago by
shawn.w.foley550
USA
shawn.w.foley550 wrote:

Yes, edgeR takes the full HTSeq output in order to perform its normalizations.

ADD COMMENTlink written 9 days ago by shawn.w.foley550
1

Are you sure edgeR uses these last 5 lines with __ prefix? I don't think those are important. Do you have any source for your claim?

For your information, here is the edgeR code: https://rdrr.io/bioc/edgeR/src/R/readDGE.R
As you can see a warning will be raised whenever a line starting with __ is read, as those are not actual gene lines.

ADD REPLYlink written 9 days ago by WouterDeCoster39k

I understand your point. But in this case how can edgeR know the total number of reads to calculate CPM, for example?

The total mapped reads number included those information. No?!

Thank you for helping.

ADD REPLYlink written 9 days ago by silas00880
1

CPM is calculated versus the total number of reads assigned to a gene, not necessarily the total number of (mapped) reads..

ADD REPLYlink written 8 days ago by WouterDeCoster39k

I am analysing miRNAs. If I cut those lines my CPM will be "number of reads/number of miRNA aligned reads". But in most of the cases I prefered "number of reads/number of aligned reads". For exemple, if I want to compare miRNAs with piRNAs in the same sample, if I cut those lines it is not possible to know the difference between the the total number of miRNAs e total number of piRNAs. For exemplo which of them is more expressed in my sample.

I think the best option is to keep "no feature" and "ambiguos" to have the number of total reads aligned by the aligner. Am I wrong?

ADD REPLYlink written 6 days ago by silas00880

Sum of mapped reads in each sample?

> head(raw.data)
              id  wt1  wt2  wt3  ko1  ko2  ko3
1 LOC_0001.1    0    0    0    0    0    0
2 LOC_0002.1   82   71   77   91   79   84
3 LOC_0003.1 2651 2785 2759 2802 2142 3117
4 LOC_0004.1  543  436  455  629  430  550
5 LOC_0005.1  399  373  324  420  279  386
6 LOC_0006.1  108   91   82  116   88  125

> raw.data %>% summarize_if(is.numeric, sum, na.rm = TRUE)
      wt1     wt2     wt3     ko1     ko2     ko3
1 4930639 4553016 4053316 5669966 3908059 5089968

> y <- DGEList(counts = raw.data[, 2:ncol(raw.data)], genes = raw.data[, 1])
> y$samples
    group lib.size norm.factors
wt1     1  4930639            1
wt2     1  4553016            1
wt3     1  4053316            1
ko1     1  5669966            1
ko2     1  3908059            1
ko3     1  5089968            1
ADD REPLYlink modified 8 days ago • written 8 days ago by SMK350

Sorry, I didn't understand. Did you keep those lines in the first case and deleted them before use DGElist? If this is the case I agree with that.

More specifically, my question is: if I exclude those lines will edger normalize the data by the total number of mapped reads or by the number of reads mapped as genes (or something else, like miRNA, in my case?

ADD REPLYlink written 4 days ago by silas00880

Sorry for being unclear, I meant that the input to edgeR is the entire HTSeq output, it doesn't need to be edited prior to running the script.

ADD REPLYlink written 8 days ago by shawn.w.foley550
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1654 users visited in the last hour