Hello all! I am analysing a large number of genes and want to find the top few paralogs for each one. However, considering that essentially all genes are some form of paralog, I need to set a cut-off. I'm hoping to bring up paralogs within the last 500 million years, shortly before, during, and after the two rounds of vertebrate genome duplication).
I've downloaded the protein fasta file which has the protein sequence of every gene I'm interested in. My plan is to run a blastp to bring up the top paralogs. I know this is a kinda pointless question as each gene has a different rate of change. E.G. GPCRs retain conserved sections which allow paralog relationships to be mapped back over a billion years, while their protein ligands differentiate so quickly that paralogy analysis can barely go back a few hundred million. So, creating a unified set of restrictions that will encapsulate a consistent rate of change for everything is pointless . . . yet saying that . . .
In a very general sense, can anyone recommend any restrictions on what settings to use on my blast analysis to prevent the more spurious matches? At the moment I've settled on 1E-20 and I'm considering an alignment % threshold. Alternatively, can anyone recommend a program that focuses on retrieving in-species paralogs (for example, http://inparanoid.sbc.su.se/cgi-bin/index.cgi but for in-species paralogs).
I tried using the Ensembl biomart paralogy data but it isn't in a format I can use. I need to manually review all the data of thousands of genes so I need to be able to view it in an excel format of:
- Query 1 . . . Target 1 . . . Target 2 . . . etc
- Query 2 . . . Target 1 . . . Target 2 . . . etc
Ensembl provides it in:
- Query 1 Target 1
- Query 1 Target 2
- Query 1 Target . . .
- Query 2 Target 1
- Query 2 Target 2
- Query 2 Target . . . .
(Unless anyone knows a convenient method to align the target results of each query into a single long row, rather than each one getting their own row).
Please assume I have no coding/scripting skills whatsoever. The tiny amount of skill I do possess is poor and haphazard (I'm in a non-bioinformatic lab doing my best to teach myself). Thanks for any help or suggestions you can provide!