Cufflinks Output File Genes.Fpkm_Tracking Number Of Genes
3
2
Entering edit mode
10.1 years ago

Hi,

I'm analyzing 16 datasets from RNA-seq of human cell line samples. I've ran TopHat and cufflinks and now want to look at the FPKM values for human protein coding genes. First I wonder why I get different number of genes in the output file genes.fpkm_tracking for the different samples, I also used biomaRt to extract the protein coding genes from the total lists of >60 thousands genes that were generated from cufflinks, and the I found out that there were only about 6000 protein coding genes in the output which sound like a very low number, I expected all ~20 000 protein coding genes to be there but with many of them unexpressed (ie. FPKM 0) ...

Would appreciate if anyone could explain how this works.

Frida

rna cufflinks ensembl • 7.4k views
ADD COMMENT
0
Entering edit mode

You need to supply more information. When you ran tophat and cufflinks, did you supply a GTF file of gene descriptions? If you give the software your description of 20,000 protein coding locations, and restrict it to counting those locations, it will give you data about those locations (including FPKM 0). Describe your pipeline more clearly. What were your parameters? How did you interface cufflinks and biomaRt?

ADD REPLY
0
Entering edit mode

Hi, thanks. I had a gtf-file but just realized that I had used the flag -g instead of -G that guides a RABT assembly, wich I did not intend. I hope that will solve the problem.

ADD REPLY
2
Entering edit mode
10.0 years ago
Emily 23k

You're going to want to install the Perl API.

Then run this script:

#usr!/bin/perl
use warnings;
use strict;

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db (
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous');

open (IN, "<inputfilename.txt");
open (OUT, ">protein_coding.txt");

my $ga = $registry->get_adaptor( "human", "core", "gene");

while (<IN>) {
    my @fields = split / /, $_;
    my $gene = $ga->fetch_by_stable_id($fields[0]);
    if ($gene->biotype eq 'protein_coding' or 'IG_V_gene' or 'IG_D_gene' or 'TR_C_gene' or 'TR_J_gene' or 'TR_V_gene' or 'IG_J_gene' or 'IG_C_gene' or 'TR_D_gene'){
        print OUT $_;
    }
}

You can edit the file names to match your files.

ADD COMMENT
0
Entering edit mode

It's so nice when people appreciate your help.

ADD REPLY
0
Entering edit mode
10.1 years ago
Emily 23k

Don't use BioMart to try to extract whole genome data. It can't handle that amount of data and breaks down partway through your query, only giving you a partial dataset, but not giving you any warning that it's done it. What format do you have your output in, and what format are you looking to get?

ADD COMMENT
0
Entering edit mode

Hi,

interesting, I didn't know that biomaRt had that kind of limitation, my output is just a table with ENSG-id's and FPKM values and I tried to extract the protein coding genes by using the bio_type annotations.

ADD REPLY
0
Entering edit mode

Your best best is our Perl API. Write a simple script that runs through your list and gets the biotype of each gene. If the biotype is one of the protein coding ones (ie protein_coding and the various IG and TR biotypes), then print the gene ID and the FPKM values into a new document. I can help you with this if you need it.

ADD REPLY
0
Entering edit mode

Hi Emily, thanks for your answer. Yes, If you could help me with that it would be super. /Frida

ADD REPLY
0
Entering edit mode

How's your Perl? What does your input file look like (give me a sample line)?

ADD REPLY
0
Entering edit mode

Input file looks like this:

ENSG00000241599 - - ENSG00000241599 RP11-34P13.9 - 1:160445-161525 - - 0 0 0 OK

What I want to do is simply to extract all protein coding genes and analyze their FPKM values which is indicated in column 10.

ADD REPLY
0
Entering edit mode

How's your Perl?

ADD REPLY
0
Entering edit mode

Hm, what do you mean by that?

ADD REPLY
0
Entering edit mode

Perl is a programming language. From your answer I'm going to guess "non-existent".

ADD REPLY
0
Entering edit mode

Yes, my pearl skills are limited, I mainly use R..

ADD REPLY
0
Entering edit mode
10.0 years ago
seidel 11k

If you know R, and your GTF file used to generate cufflinks output is based on ensembl ids, I don't think this is a difficult task for biomaRt at all. If I understand correctly, you just want to know which ids are protein coding, and then grab those results from your result files. Here is an example of doing that, assuming you can point to a directory containing directories of cufflinks results:

# biomaRt
library(biomaRt)
bm <- useMart("ensembl")
bm <- useDataset("hsapiens_gene_ensembl", mart=useMart("ensembl"))

# grab some annotation
anno <- getBM(attributes=c("ensembl_gene_id", "external_gene_id", "transcript_biotype", "description"), mart=bm)

coding <- c("protein_coding","IG_V_gene","IG_D_gene"
            ,"TR_C_gene","TR_J_gene","TR_V_gene"
            ,"IG_J_gene","IG_C_gene","TR_D_gene")

# narrow it down
anno.coding <- anno[anno$transcript_biotype %in% coding,]

# get a list of cufflinks directories
cufflinksDirs <- dir()

# create something to hold the data
fpkm <- data.frame(matrix(NA, nrow=nrow(anno.coding), ncol=length(cufflinksDirs)))
rownames(fpkm) <- anno.coding$ensembl_gene_id

for( i in 1:nrow(cufflinksDirs) ){ 
  fname <- paste(cufflinksDirs[i], "/genes.fpkm_tracking",sep="")
  # tell me
  cat(fname, "\n")
  # read in the data
  x <- read.table(file=fname, sep="\t", header=T, as.is=T)
  # match data ids to master table
  iv <- match(x[,1], rownames(fpkm))
  fpkm[iv,i] <- x[,"FPKM"]
}

colnames(fpkm) <- cufflinksDirs
write.table(fpkm, file="fpkm.txt", sep="\t", col.names=NA)
ADD COMMENT

Login before adding your answer.

Traffic: 3443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6