If you know R, and your GTF file used to generate cufflinks output is based on ensembl ids, I don't think this is a difficult task for biomaRt at all. If I understand correctly, you just want to know which ids are protein coding, and then grab those results from your result files. Here is an example of doing that, assuming you can point to a directory containing directories of cufflinks results:
# biomaRt
library(biomaRt)
bm <- useMart("ensembl")
bm <- useDataset("hsapiens_gene_ensembl", mart=useMart("ensembl"))
# grab some annotation
anno <- getBM(attributes=c("ensembl_gene_id", "external_gene_id", "transcript_biotype", "description"), mart=bm)
coding <- c("protein_coding","IG_V_gene","IG_D_gene"
,"TR_C_gene","TR_J_gene","TR_V_gene"
,"IG_J_gene","IG_C_gene","TR_D_gene")
# narrow it down
anno.coding <- anno[anno$transcript_biotype %in% coding,]
# get a list of cufflinks directories
cufflinksDirs <- dir()
# create something to hold the data
fpkm <- data.frame(matrix(NA, nrow=nrow(anno.coding), ncol=length(cufflinksDirs)))
rownames(fpkm) <- anno.coding$ensembl_gene_id
for( i in 1:nrow(cufflinksDirs) ){
fname <- paste(cufflinksDirs[i], "/genes.fpkm_tracking",sep="")
# tell me
cat(fname, "\n")
# read in the data
x <- read.table(file=fname, sep="\t", header=T, as.is=T)
# match data ids to master table
iv <- match(x[,1], rownames(fpkm))
fpkm[iv,i] <- x[,"FPKM"]
}
colnames(fpkm) <- cufflinksDirs
write.table(fpkm, file="fpkm.txt", sep="\t", col.names=NA)
You need to supply more information. When you ran tophat and cufflinks, did you supply a GTF file of gene descriptions? If you give the software your description of 20,000 protein coding locations, and restrict it to counting those locations, it will give you data about those locations (including FPKM 0). Describe your pipeline more clearly. What were your parameters? How did you interface cufflinks and biomaRt?
Hi, thanks. I had a gtf-file but just realized that I had used the flag -g instead of -G that guides a RABT assembly, wich I did not intend. I hope that will solve the problem.