Hello all, I stuck with the following problem
Having performed a de novo RNA seq alignment with Trinity and then 6 frame translation I would like to filter my Amino Acid sequence fasta file vs a known reference proteome such as Uniprot, to exact all sequences that do not occur in it.
Unfortunately I cannot filter by ID due to the de novo tags. Does anybody have a solution to filter each sequence (700,000) to see if the same sequence occurs in the reference file.?
Thanks Diamond is a good idea, I shall give it a go- It is hosted on Galaxy so processing power is easy.. What I am hoping to separate out is a list of all sequences that are not in the reference proteome (as I can then attempt to align this to novel MS data).