I'm biology trying to analyze data from a RNA-seq experiment. My experience in this field is zero and I'm learning as much as I can.
First let me contextualize my problem: I have a de novo assembly transcriptome obtained from a mollusk conformed by 125k transcripts/contigs.
My first attempt to annotate that was to blastx against H.sapiens proteome and I found a reasonable number of 35k unique hits. However, I think human proteome could be too much stringent to find matches so I construct a database composed by 9 proteomes (octopus, oyster, mytilus, sea urchin, drosophila, amphioxus, lamprey, zebrafish, human) to do that I have use
blastdb_aliastool. Once I had my database I blasted it using
blastx (I only change e-value threshold to 1E-3 and -outfmt 5) but I obtained a lot of tophits with the description "unchraracterized". In this case I had the 87% of my transcripts with one hit.
In order to fix that, I repeat the annotation but including the option
-max_target_seqs 5 -max_hsps 1 and then I ran the next commands to filter out the "uncharacterized" hits, I reordered all the hits for the same query (max. 5) based on the score and e-value and I recovered only the top hit (this time witouth "uncharacterized" hits):
grep -v -i "uncharacterized" FILE_A.xls > FILE_B.xls sort -k1,1 -k12,12nr -k13,13n FILE_B.xls | sort -u -k1,1 --merge > FILE_C.xls
This is the order of my fields in my output:
Query Name | Query Length | Subject Name | Subject Length | Alignment Length | Query Start | Query End | Subject Start | Subject End | Query Sequence | Subject Sequence | Hsp Score | Hsp Expect | Hsp Identities | Percent Match | Number_of_gaps
Finally I obtained more or less the same number of hits than the first time (
blastx only vs human proteome), again more or less 35k and my annotation was reduced to 25%. I have checked and it looks fine but I come here to know your opinion (not only about the code, also about the procedure). In my case the annotation it's fundamental because my DGE analysis follows the pipeline: tximport (to summarize to gene level) > DESeq2; so it depends totally TranscriptID==Gene.
It's the approach/code right? It' s common to find more or less the same results vs only human than vs a more complete database (when I take out the uncharacterized proteins).
Thank you for your attention