Question: How to Filter peptide sequence FASTA File vs known Proteome
9 weeks ago by
ieuangw0 wrote:

Hello all, I stuck with the following problem

Having performed a de novo RNA seq alignment with Trinity and then 6 frame translation I would like to filter my Amino Acid sequence fasta file vs a known reference proteome such as Uniprot, to exact all sequences that do not occur in it.

Unfortunately I cannot filter by ID due to the de novo tags. Does anybody have a solution to filter each sequence (700,000) to see if the same sequence occurs in the reference file.?

rna-seq assembly • 189 views
ADD COMMENTlink modified 9 weeks ago by evabrown950 • written 9 weeks ago by ieuangw0
9 weeks ago by
United States
genomax64k wrote:

I would suggest that you take the RNAseq assembly (trinity would create an assembly) and use DIAMOND to search against UniProt database. Use a tab delimited output for ease of parsing and only long/strict hits. DIAMOND does require good bit of RAM so make sure you do this on appropriate hardware.

You could also try using blat (Jim Kent from UCSC) since you are only looking for very similar (or identical?) hits.

genomax64k

Thanks Diamond is a good idea, I shall give it a go- It is hosted on Galaxy so processing power is easy.. What I am hoping to separate out is a list of all sequences that are not in the reference proteome (as I can then attempt to align this to novel MS data).

ieuangw0
