Question: how to select genes after obtaining read counts
0
gravatar for statfa
3 months ago by
statfa310
statfa310 wrote:

Hi.

I downloaded the fastq files from ebi.ac.uk Then I aligned them to hg38 on galaxy. And then I downloaded the latest version of gtf (release version 27) from http://www.gencodegenes.org/stats/archive.html#a27 to obtain read counts. As you see, the gtf file has 58288 genes of which 19836 genes are protein coding.

The problem is that when you obtain the count matrix from your bam files using this gtf, you will be given a matrix with 58288 rows displaying the total number of genes in the gtf file. And thousands of these genes are counted zero for all samples. How should I filter my count data to a reasonable number of genes? 58288 genes is way too much and weird. Especially, when many of them are zero.

rna-seq genes read counts gtf • 249 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by statfa310

58288 genes is way too much and weird.

In what way? There is only 19836 protein coding genes out of that lot.

Also note that some of those genes have zero counts because they may be covered by multi-mapped reads, which are not counted by default by both fearureCounts/htseq-count.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax46k

Hey genomax. How are you? I want to estimate the parameters of NB distribution from a real count data. And I think that I should filter some of those genes because they may not represent a real datset and may negatively affect the NB distribution parameters.

And yes. 19836 genes are protein coding. Should I filter all of those 58288 genes that have zero counts?

ADD REPLYlink modified 3 months ago • written 3 months ago by statfa310

Yes, you should filter out zero counts, and also low-expressed genes which might be observed due to noise. You can calculate FPKM or TPM Click here for simple explanation for each gene and plot the distribution and decide an arbitrary cut-off. If the distribution follows a normal distribution, you can use standard deviation approach to filter, if not use some non-parametric methods like lower-quartile.

ADD REPLYlink written 3 months ago by arta410

Thank you very much arta. Yeah I think I will prefilter zero count transcripts first for simulation purpose. Then I will filter low expressed ones for analysis.

ADD REPLYlink written 3 months ago by statfa310

You should pre-filter based on the biotypes rather than the zero counts. As we have discussed in your other threads zero counts for some genes is a real world result. Best course here would be to use the protein coding genes for the simulation.

ADD REPLYlink written 3 months ago by genomax46k

Are you planning to use DESeq2/edgeR for the DE analysis? You could use only the protein coding genes if that is all you are interested in.

ADD REPLYlink written 3 months ago by genomax46k

Yes, I have done that in the past. In fact, one thing that I did was analyse different biotypes separately, such as protein coding, antisense RNA, linc-RNA, etc. One other thing I did was analyse all transcripts annotated in RefSeq (~22,000, covering various transcripts).

I think that it's also quite standard to remove transcripts that have low mean raw counts, such as those with mean < 5 or < 10.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe17k

Aha, Thank you so much Kevin. I wanted to use the count table for simulation purpose.

ADD REPLYlink written 3 months ago by statfa310

No, I was going to simulate data using that count table.

ADD REPLYlink written 3 months ago by statfa310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1140 users visited in the last hour