Using RefSeq gene annotation file to determine gene lengths
1
0
Entering edit mode
2.8 years ago

I am trying to calculate Fragments Per Kilobase of transcript per Million mapped reads (FPKM) for an RNA-seq experiment. The gene annotation file I used to count genes with htseq was UCSC's NCBI RefSeq file (https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz). This file has a not so straightforward structure without column headings:

head mm10.ncbiRefSeq.gtf
chr1    ncbiRefSeq  transcript  3199733 3672278 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3";  gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3199733 3207317 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "1"; exon_id "XM_006495550.3.1"; gene_name "Xkr4";
chr1    ncbiRefSeq  3UTR    3199733 3207317 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "1"; exon_id "XM_006495550.3.1"; gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3213439 3216968 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1    ncbiRefSeq  3UTR    3213439 3216021 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1    ncbiRefSeq  CDS 3216025 3216968 .   -   2   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3421702 3421901 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "3"; exon_id "XM_006495550.3.3"; gene_name "Xkr4";
chr1    ncbiRefSeq  CDS 3421702 3421901 .   -   1   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "3"; exon_id "XM_006495550.3.3"; gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3670552 3672278 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "4"; exon_id "XM_006495550.3.4"; gene_name "Xkr4";
chr1    ncbiRefSeq  CDS 3670552 3671348 .   -   0   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "4"; exon_id "XM_006495550.3.4"; gene_name "Xkr4";

How does one use this file to calculate feature lengths that can then be used to calculate FPKM?

I can see that columns 4 and 5 are the start and stop of each feature so you could get length from that, but how do you condense all of the other information so you just have gene_id and length?

I am sure this has been done before I just don't know how you parse through all the gene regions to just get one length per gene.

Thanks in advance!

refseq rstudio htseq featurelength gtf fpkm geneid • 1.3k views
ADD COMMENT
0
Entering edit mode

that file is a typical GFF (or GTF, very similar) format. You will quickly find what the columns are when searching for it.

Not 100% sure of it but I think the AGAT package might have some sub-programs to calculate lengths of genes etc.

AGAT - Another Gff/Gtf Analysis Toolkit

Additional note: FPKM is a somewhat deprecated measure (except for very specific use-cases) , nowadays people rely more on TPM or CPM values.

ADD REPLY
0
Entering edit mode

Not such feature in AGAT :/ but might be a good idea to implement something in AGAT to perform this type of taskā€¦

ADD REPLY
0
Entering edit mode
2.8 years ago

Don't try to do this yourself. Gene length is not enough, because genes can have multiple transcripts of differing length.

Use either RSEM, or something like Salmon. Those are smart enough to calculate this properly, and they will return FPKM.

ADD COMMENT
3
Entering edit mode

s/FPKM/TPM/ :)

ADD REPLY

Login before adding your answer.

Traffic: 1430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6