Question: HTSEq-Count returns 0 counts for all genes in the counts file!
0
gravatar for manish.nu
4 weeks ago by
manish.nu0
manish.nu0 wrote:

Hi all!

I've been teaching myself to do RNA-Seq.

I prepared my libraries using KAPA mRNA HyperPrep Kit and had them sequenced single read 50bp

To generate my counts' file for DESeq2, I

Downloaded the reference/GTF files from Genecode:

Gencode/GRCh38.p10.genome.fa
Homo_sapiens.GRCh38.91.gtf

Indexed them as:

STAR --runMode genomeGenerate -- genomeDir $genomeDir --genomeFastaFiles /hg/GRCh38.p10.genome.fa --sjdbGTFfile /hg/Homo_sapiens.GRCh38.91.gtf --sjdb Overhang 49 --runThreadN 20

Aligned the fastq files as:

STAR --runThreadN 20 --genomeDir $genome_dir --readFilesCommand zcat --readFilesIn /projects/FastQ/post/SK1-1_S1.fastq.gz --outSAMtype BAM Unsorted --outFileNamePrefix /projects/aligned/post/P1_pre_

created HTSeqCount files using:

htseq-count -f bam -q -m union -s yes -t exon -i gene_id /projects/aligned/P1_post_Aligned.out.bam /Homo_sapiens.GRCh38.91.gtf  > /projects/counts/htseq/P1_pre_hts_counts

My counts file look like this:

ENSG00000284741 0
ENSG00000284742 0
ENSG00000284743 0
ENSG00000284744 0
ENSG00000284745 0
ENSG00000284746 0
ENSG00000284747 0
ENSG00000284748 0
__no_feature    22375495
__ambiguous 0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  18649806

All of the genes are reflected = 0. Can someone please tell me what am I doing wrong?

rna-seq htseq-count • 227 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by manish.nu0
1

Does your GTF contain exon lines, actually?

ADD REPLYlink written 4 weeks ago by Macspider2.2k

this is what my gtf file shows:

$ head Homo_sapiens.GRCh38.91.gtf 
#!genome-build GRCh38.p10
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.25
#!genebuild-last-updated 2017-06
1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1   havana  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1   havana  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

It looks like it has 'exon' but I wouldn't trust myself right now.

ADD REPLYlink modified 4 weeks ago by GenoMax42k • written 4 weeks ago by manish.nu0
1

Hi, Two suggestions - 1) Plz check the Log.final.out created by STAR at the end of the run. Check the mapping % there.

2) If the mapping% is ok, then as a check you could re-run STAR with

--quantMode GeneCounts

option. You wouldn't need to run a count generator separately.

ADD REPLYlink written 4 weeks ago by Amitm1.4k

Very stupidly deleted the log files but as far as I remember seeing them in the beginning, all files mappped >30% except one which was in 20-30%

I'm very tempted to run the alignment again with --quantMode GeneCounts

ADD REPLYlink written 4 weeks ago by manish.nu0

Are the chromosome identifiers in your fasta genome the same as in the gtf?

ADD REPLYlink written 4 weeks ago by WouterDeCoster26k
1

The standard Genecode fasta has a header like "chr1 1". As far as I remember, STAR uses the full header to name its chromosomes. Can you report the result of samtools view -H /projects/aligned/P1_post_Aligned.out.bam | head and compare it to the first column of /Homo_sapiens.GRCh38.91.gtf ?

ADD REPLYlink written 4 weeks ago by michael.ante2.2k

Hi WouterDeCoster, to me, it looks like they're not the same (see below please). I went by the assumption that if you download your genome file and GTF file from the same source (Gencode in my case), they will be 'compatible'. This makes me doubt my whole alignment step.

Would you recommend that I start from scratch (Indexing>Alignment) and could you please suggest a source for a GTF file? Can I still salvage this by switching the GTF file?

ADD REPLYlink written 4 weeks ago by manish.nu0
2

If you get the correct GTF file from Gencode (http://www.gencodegenes.org/releases/current.html ) then the counting will work.

##description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90)
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2017-08-01
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; 
gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by GenoMax42k

It's working! Thanks so much!

ADD REPLYlink written 4 weeks ago by manish.nu0
2

Just using the right gtf for counting would be sufficient, no need to realign, unless you want to use the gtf in alignment as well for defining the known exon structure.

ADD REPLYlink written 4 weeks ago by WouterDeCoster26k
$ samtools view -H P1_post_Aligned.out.bam | head
@HD VN:1.4
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717

GTF File:

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1"; 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

Does this information help? To me it looks like they're not the same.

ADD REPLYlink modified 4 weeks ago by GenoMax42k • written 4 weeks ago by manish.nu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1226 users visited in the last hour