List of human protein coding genes with given name (known function?)
2
0
Entering edit mode
2.9 years ago

Hello,

To put it simply, I am doing differential expression analysis on human RNA-seq data and I want to focus my analysis of genes that are: 1) Protein coding, so no SNOR or MIR 2) Genes with a given name (I see genes like AC006064.5 and I am not sure what they are)

I am hoping that by refocusing my analysis on a smaller set of genes I would get more hits as FDR seems to be impacted by sample size (amount of genes tested).

Any advice on sub-sampling a .gtf file to protein coding genes only?

Thanks

RNA-Seq Annotation Human • 1.5k views
2
Entering edit mode
2.9 years ago
ATpoint 62k

The GTF file should typically contain annotation information on the gene type. Select for example only protein_coding. That is how it is called in the GENCODE files, not sure what the exact term would be in RefSeq GTF. I typically remove a couple of RNA species before doing DEG, which is all smallRNAs (micro, sno, sn..., because they are not well-captured in standard RNA-seq and therefore not reliable) and TECs (to be experimentally confirmed) genes. What is left is (at least imho) the meaningful part of a standard RNA-seq experiment which then goes into dispersion estimation and model fitting. What you could probably do is to remove further gene types after the main DEG but prior to multiple testing correction if you have the feeling that you lose too many significant genes being eliminated by FDR. The FDR is definitely affected by the number of tested genes, and reducing the number of tests might make sense. From a "strategical" standpoint you could probably remove everything from which you a priori know that you will not benefit from finding these genes significant. That would be genes such as non-protein coding transcripts of (unknown) function. Finding these as significant might potentially harbour an interesting biological finding but if it is not the focus of your study or you are not willing to chase something like lincRNAs one could probably filter them out to reduce multiple testing burden. I am by no means a statistician so feel free to comment.

0
Entering edit mode

Makes a lot of sense, thanks. So I just did "cat Homo_sapiens.GRCh38.95.gtf | grep "gene_biotype \"protein_coding\"" > Homo_sapiens.GRCh38.95.protein_coding.gtf", is that ok in terms of creating a new gtf file?

0
Entering edit mode

Yeah, I guess that would do it. Basically reducing genes to protein-coding is what many microarray platforms do from what I understand, so I guess there is not really an argument against it.

0
Entering edit mode
10 months ago
Shicheng Guo ★ 9.1k
wget http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.chr.gtf.gz ./
zcat Homo_sapiens.GRCh38.104.chr.gtf.gz | grep "gene_biotype \"protein_coding\"" > Homo_sapiens.GRCh38.104.protein_coding.gtf
awk '$3=="gene"{print$14}' Homo_sapiens.GRCh38.104.protein_coding.gtf | sed 's/\"//g' | sed 's/\;//g'| sort -u > Homo_sapiens.GRCh38.104.protein_coding.txt


### here is the output

A1BG
A1CF
A2M
A2ML1
A3GALT2
A4GALT
A4GNT
AAAS
AACS
AAGAB
AAK1
AAMDC
AAMP
AANAT
AAR2
AARD
AARS1