Thanks for taking the time to answer a newcomer's question.
I am looking for whether some proteins are expressed in some RNA-Seq data I was given. Unfortunately, the RNA-Seq data is not raw - it's a list of RPKMs associated with RefSeq genes. It was the only 'raw' dataset provided on GEO.
Over half of these 37,000 genes have an RPKM of 0; others have a wide range of expression. However, I cannot find any RefSeq IDs corresponding to genes that express my proteins of interest. I wonder if this is my fault, or the database's.
Here's what I did:
1. Take a list of the proteins of interest, find HUGO genes encoding those proteins > feed HUGO gene names into BioMart and get a list of associated RefSeq IDs > search my database for those IDs (none!)
2. Input the 37,000 RefSeq IDs I have into DAVID, generate an annotation report, and search for mentions of my protein of interest (none!)
What other sanity checks should I do before I claim that this RNA-Seq database does not, in fact, contain data for genes encoding the proteins we're interested in? I don't know if my proteins are actually expressed in this tissue. It seems so unlikely that they wouldn't appear anywhere in the database to jump to that conclusion - I've always thought of RNA-Seq as being 'comprehensive', and that every protein-encoding gene would have even a tiny number of reads. It would be weird if this database had so many genes with an RPKM of 0, and yet other genes were completely excluded.