I would like to ask about defining the level of filtering by sequence identity (pident, %) from tblastn results.
I have a table of tblasn results in Galaxy including about 800,000 sequences. I would like to filter them by sequence identity but if I filter them with 98% I lose almost all sequences. I would like to know what is the accepted level for filtering considering that this is from protein! data. I think this should not be as strict as a blastn filtering (commonly 98 or 99%). Please give me advice and link me to any publication which tells me a proper percentage.
All answers are greatly appreciated. :)