I am assisting on a project which is attempting shotgun proteomic analysis on a non-model organism (invertebrate) and the sample being queried has had RNA-seq performed in parallel. I was not involved in the assembly and only have a fasta of the de novo assembly created using unknown parameters. This fasta has 96k entries with an average length of 552 and ranges from 224 to 14000 in length. The number of entries seems too large and they are too short in my opinion, but I have no experience or anything to compare it to.
If this was DNA, and I am using Mascot (a protein search algorithm), Mascot will do a 6-frame translation on the fly, though since I want to do a 3-frame translation, I need to do this before loading the database.
The first issue I have is that doing a 3-frame translation (either transeq or getorf via EMBOSS) means my search space will be almost 300k entries, which is pretty ridiculous imo (the uniprotKB for all mammalia is 1.4 million, meaning this one individual has as many as 1/5 of all known mammal protein sequences). This has deleterious effects on the search performance (both computational costs as well as decreased detection).
So how do you gauge whether a de novo assembly is "good" and usable for an application such as proteomics if all you have to go on is the generated fasta? I was thinking about maybe using something like CD-HIT to get rid of redundancy in the fasta, but I am sure there are other clever suggestions folks could offer up.
Also, is there anything else I am missing as to best practices here?
The optimum answer is going to be to take the original fastq files and deal with them myself and therefore know the assembly conditions, but for the time being I wanted to see what the community had to say, if anything.