Question: feature count at gene and transcript level
2
gravatar for Prakash
4 weeks ago by
Prakash1.7k
India
Prakash1.7k wrote:

Hi all,

I have aligned my RNAseq data to the reference genome and extracted raw count using featureCounts tool. I used two different GTF file from NCBI, one having only Refseq id as gene_id and other having gene symbol as gene_id.

Refseq as gene_id

chr11 mm10_refGene exon 100883834 100885169 0.000000 + . gene_id "NM_001164062"; transcript_id "NM_001164062";

Gene symbol as gene_id

chr11 stdin exon 100883834 100885169 . + . gene_id "Stat5a"; transcript_id "NM_001164062"; exon_number "18"; exon_id "NM_001164062.18"; gene_name "Stat5a";

Surprisingly when I looked into the raw count for above feature that I have highlighted is showing different count from featuteCounts output for these two different GTF file It would be really great if someone could help me understanding these different results or If I am missing something here.

gene_id Count

NM_001164062     0

NM_011488  11

Stat5a  14046

Here is the code that I used to extract the count

featureCounts --verbose -p -B -T 20 -a mm10_refseq.gtf -o Emp Emp_R1.sorted.bam

and IGV browser snapshot of the bam coverage at the highlighted features.

Screenshot-from-2019-10-12-12-48-10-png

rna-seq • 174 views
ADD COMMENTlink written 4 weeks ago by Prakash1.7k

There are probably many differences between your GTF files in addition to the one line you posted. Based on the formatting, they don't look like they are actually from NCBI.

Based on that screenshot, there might be something wrong with your GTF file. The exons should be connected like they are in the RefSeq track.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by igor8.8k

Actually both the gtf file were downloaded from UCSC table browser, First one is NCBI refseq track from table UCSC Refseq (refgene) downloaded as GTF file and second one downloaded as "all fields from selected table" and further converted to GTF format using genePredToGtf

I checked several features and they are having same number of count from both the gtf file but only few features I guess is behaving differently.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Prakash1.7k

I am not sure how the table browser works in this particular case, but it's possible that it does not convert between formats properly. The initial table may not have all the necessary fields. I would use proper GTF files.

Also, I would modify the initial question to specify how you generated the GTF files. Although some of the contents may be from NCBI, the files are not.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by igor8.8k

What I understood that there is some issue with NCBI REFseq GTF file downloaded from UCSC table browser.

The second GTF file having gene symbol as gene_id i generated using following line of code

rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/refGene.txt.gz ./
gzip -d refGene.txt.gz
cut -f 2- refGene.txt > refGene.input
genePredToGtf file refGene.input mm10refGene.gtf
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Prakash1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 962 users visited in the last hour