I am trying to analyze MS data for novel proteins. So far my workflow has been to take mzml files and convert them to mgf files. I then search the mgf files against a fasta file containing annotated protein sequences from Uniprot (this amounts to ~20,000 proteins). The search is done by using SearchGUI.
After doing this, I obtain txt files containing the proteins that were found and the spectra that matched these proteins. What I want to do is to search the unmatched spectra against a customised database in order to discover novel proteins. Similar to how this paper (Erady, 2020) describes it:
In order to evade the increase in false-positive rates, MS data is first mapped to known proteins in UniProt database, and then the unmatched spectra are mapped to the custom proteogenomic database as done by us previously in Prabakaran et al.
This seems like a pretty common thing to do as I've seen a number of papers describe it. However, I can't figure out a way to do it. Does anyone have some experience doing this?