differential RNA seq analysis
1
0
Entering edit mode
5.9 years ago

I am having trouble selecting a proper GTF file to perform differential RNA-seq analysis (in mm9), using annotation for protein coding genes (or at least all genes, but not pseudogenes). Importantly, I want at the end the quantification of gene expression per gene (not per transcript). Example command:

htseq-count -f bam -t CDS ${name}_mapped_reads/accepted_hits.bam Mus_musculus/Ensembl/GRCm38/Annotation/Archives/archive-2014-05-23-16-04-56/Genes/genes.gtf

I have tried using two different files, both downloaded from the following link: http://cole-trapnell-lab.github.io/cufflinks/igenome_table. First I downloaded the ensembl file and then the UCSC one but, having inspected the contents- neither satisfies the above criteria.

Does anyone know of a file/ how to filter in order to have a GTF file that will fulfil the above described criteria?

rna-seq • 1.5k views
ADD COMMENT
0
Entering edit mode
5.9 years ago

The -t CDS will direct htseq-count to only take the rows with "CDS" in the third column (this is my assumption, which is true for featureCounts, and IMHO has a high probability of being true for htseq-counts, too). To check which other entries are possible, you could use the following command:

$ cut -f3 gencode.vM16.annotation.gtf | sort | uniq
CDS
##contact: gencode-help@sanger.ac.uk
##date: 2017-12-01
##description: evidence-based annotation of the mouse genome (GRCm38), version M16 (Ensembl 91)
exon
##format: gtf
gene
##provider: GENCODE
Selenocysteine
start_codon
stop_codon
transcript
UTR

In my example file, you probably would want to go with "gene", so -t gene should do the trick.

Removing pseudogenes is a bit more annoying, that information tends to be stuck in the last column of a GTF file, in my file this looks like this:

$ egrep pseudogene gencode.vM16.annotation.gtf | head -n 1
chr10   HAVANA  transcript  129376961   129377401   .   +   .   gene_id "ENSMUSG00000107790.1"; transcript_id "ENSMUST00000205105.1"; gene_type "unprocessed_pseudogene"; gene_name "Olfr783-ps1"; transcript_type "unprocessed_pseudogene"; transcript_name "Olfr783-ps1-201"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTMUSG00000056564.1"; havana_transcript "OTTMUST00000139699.1"

# find out your options for gene type
$ cut -f9 gencode.vM16.annotation.gtf |  sed 's/.*gene_type//' | sed 's/;.*//' | sed 's/.\"\(.*\)\"/\1/' | sort | uniq
3prime_overlapping_ncRNA
antisense_RNA
bidirectional_promoter_lncRNA
##contact: gencode-help@sanger.ac.uk
##date: 2017-12-01
##description: evidence-based annotation of the mouse genome (GRCm38), version M16 (Ensembl 91)
##format: gtf
IG_C_gene
IG_C_pseudogene
IG_D_gene
IG_D_pseudogene
IG_J_gene
IG_LV_gene
IG_pseudogene
IG_V_gene
IG_V_pseudogene
lincRNA
macro_lncRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
polymorphic_pseudogene
processed_pseudogene
processed_transcript
protein_coding
##provider: GENCODE
pseudogene
ribozyme
rRNA
scaRNA
scRNA
sense_intronic
sense_overlapping
snoRNA
snRNA
sRNA
TEC
transcribed_processed_pseudogene
transcribed_unitary_pseudogene
transcribed_unprocessed_pseudogene
TR_C_gene
TR_D_gene
TR_J_gene
TR_J_pseudogene
TR_V_gene
TR_V_pseudogene
unitary_pseudogene
unprocessed_pseudogene

You could then use egrep -v to extract only those rows of the gtf that do not match the gene types you don't like, or, alternatively, egrep 'gene_type "protein_coding"' gencode.vM16.annotation.gtf might give you what you want.

ADD COMMENT

Login before adding your answer.

Traffic: 2154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6