My background is in RNA-seq, but I’m now starting to work with protein (amino acid) sequences. There is a lot of literature for best practices and quality checks throughout RNA-seq analysis (from raw reads to say differential expression analysis or variant calling).
I’m having trouble finding similar literature for working with protein sequences. Things I’m thinking about are 1) If I need to filter redundant sequences, how do different thresholds for sequence similarity affect my results? 2) How do I check the accuracy of a multiple sequence alignment? 3) What about accuracy when I align to a structure? 4) Do I give any special considerations towards gaps (insertions/deletions)? 5) Other things I’m not aware of?
Does anyone have resources that would help answer these questions?
If necessary to know, I’m aligning multiple sequences to a structure, clustering and evaluating point mutations.