Question: Cufflinks Output File Genes.Fpkm_Tracking Number Of Genes
2
gravatar for frida.danielsson
5.1 years ago by
European Union
frida.danielsson40 wrote:

Hi,

I'm analyzing 16 datasets from RNA-seq of human cell line samples. I've ran TopHat and cufflinks and now want to look at the FPKM values for human protein coding genes. First I wonder why I get different number of genes in the output file genes.fpkm_tracking for the different samples, I also used biomaRt to extract the protein coding genes from the total lists of >60 thousands genes that were generated from cufflinks, and the I found out that there were only about 6000 protein coding genes in the output which sound like a very low number, I expected all ~20 000 protein coding genes to be there but with many of them unexpressed (ie. FPKM 0) ...

Would appreciate if anyone could explain how this works.

Frida

ensembl cufflinks rna • 4.9k views
ADD COMMENTlink modified 5.1 years ago by seidel6.8k • written 5.1 years ago by frida.danielsson40

You need to supply more information. When you ran tophat and cufflinks, did you supply a GTF file of gene descriptions? If you give the software your description of 20,000 protein coding locations, and restrict it to counting those locations, it will give you data about those locations (including FPKM 0). Describe your pipeline more clearly. What were your parameters? How did you interface cufflinks and biomaRt?

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by seidel6.8k

Hi, thanks. I had a gtf-file but just realized that I had used the flag -g instead of -G that guides a RABT assembly, wich I did not intend. I hope that will solve the problem.

ADD REPLYlink written 5.1 years ago by frida.danielsson40
2
gravatar for Emily_Ensembl
5.1 years ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

You're going to want to install the Perl API.

Then run this script:

#usr!/bin/perl
use warnings;
use strict;

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db (
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous');

open (IN, "<inputfilename.txt");
open (OUT, ">protein_coding.txt");

my $ga = $registry->get_adaptor( "human", "core", "gene");

while (<IN>) {
    my @fields = split / /, $_;
    my $gene = $ga->fetch_by_stable_id($fields[0]);
    if ($gene->biotype eq 'protein_coding' or 'IG_V_gene' or 'IG_D_gene' or 'TR_C_gene' or 'TR_J_gene' or 'TR_V_gene' or 'IG_J_gene' or 'IG_C_gene' or 'TR_D_gene'){
        print OUT $_;
    }
}

You can edit the file names to match your files.

ADD COMMENTlink written 5.1 years ago by Emily_Ensembl18k

It's so nice when people appreciate your help.

ADD REPLYlink written 5.1 years ago by Emily_Ensembl18k
0
gravatar for Emily_Ensembl
5.1 years ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

Don't use BioMart to try to extract whole genome data. It can't handle that amount of data and breaks down partway through your query, only giving you a partial dataset, but not giving you any warning that it's done it. What format do you have your output in, and what format are you looking to get?

ADD COMMENTlink written 5.1 years ago by Emily_Ensembl18k

Hi,

interesting, I didn't know that biomaRt had that kind of limitation, my output is just a table with ENSG-id's and FPKM values and I tried to extract the protein coding genes by using the bio_type annotations.

ADD REPLYlink written 5.1 years ago by frida.danielsson40

Your best best is our Perl API. Write a simple script that runs through your list and gets the biotype of each gene. If the biotype is one of the protein coding ones (ie protein_coding and the various IG and TR biotypes), then print the gene ID and the FPKM values into a new document. I can help you with this if you need it.

ADD REPLYlink written 5.1 years ago by Emily_Ensembl18k

Hi Emily, thanks for your answer. Yes, If you could help me with that it would be super. /Frida

ADD REPLYlink written 5.1 years ago by frida.danielsson40

How's your Perl? What does your input file look like (give me a sample line)?

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Emily_Ensembl18k

Input file looks like this:

ENSG00000241599 - - ENSG00000241599 RP11-34P13.9 - 1:160445-161525 - - 0 0 0 OK

What I want to do is simply to extract all protein coding genes and analyze their FPKM values which is indicated in column 10.

ADD REPLYlink written 5.1 years ago by frida.danielsson40

How's your Perl?

ADD REPLYlink written 5.1 years ago by Emily_Ensembl18k

Hm, what do you mean by that?

ADD REPLYlink written 5.1 years ago by frida.danielsson40

Perl is a programming language. From your answer I'm going to guess "non-existent".

ADD REPLYlink written 5.1 years ago by Emily_Ensembl18k

Yes, my pearl skills are limited, I mainly use R..

ADD REPLYlink written 5.1 years ago by frida.danielsson40
0
gravatar for seidel
5.1 years ago by
seidel6.8k
United States
seidel6.8k wrote:

If you know R, and your GTF file used to generate cufflinks output is based on ensembl ids, I don't think this is a difficult task for biomaRt at all. If I understand correctly, you just want to know which ids are protein coding, and then grab those results from your result files. Here is an example of doing that, assuming you can point to a directory containing directories of cufflinks results:

# biomaRt
library(biomaRt)
bm <- useMart("ensembl")
bm <- useDataset("hsapiens_gene_ensembl", mart=useMart("ensembl"))

# grab some annotation
anno <- getBM(attributes=c("ensembl_gene_id", "external_gene_id", "transcript_biotype", "description"), mart=bm)

coding <- c("protein_coding","IG_V_gene","IG_D_gene"
            ,"TR_C_gene","TR_J_gene","TR_V_gene"
            ,"IG_J_gene","IG_C_gene","TR_D_gene")

# narrow it down
anno.coding <- anno[anno$transcript_biotype %in% coding,]

# get a list of cufflinks directories
cufflinksDirs <- dir()

# create something to hold the data
fpkm <- data.frame(matrix(NA, nrow=nrow(anno.coding), ncol=length(cufflinksDirs)))
rownames(fpkm) <- anno.coding$ensembl_gene_id

for( i in 1:nrow(cufflinksDirs) ){ 
  fname <- paste(cufflinksDirs[i], "/genes.fpkm_tracking",sep="")
  # tell me
  cat(fname, "\n")
  # read in the data
  x <- read.table(file=fname, sep="\t", header=T, as.is=T)
  # match data ids to master table
  iv <- match(x[,1], rownames(fpkm))
  fpkm[iv,i] <- x[,"FPKM"]
}

colnames(fpkm) <- cufflinksDirs
write.table(fpkm, file="fpkm.txt", sep="\t", col.names=NA)
ADD COMMENTlink written 5.1 years ago by seidel6.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 744 users visited in the last hour