Paralog analysis of a proteome
0
0
Entering edit mode
7.5 years ago
MWBFurlong • 0

Hello all! I am analysing a large number of genes and want to find the top few paralogs for each one. However, considering that essentially all genes are some form of paralog, I need to set a cut-off. I'm hoping to bring up paralogs within the last 500 million years, shortly before, during, and after the two rounds of vertebrate genome duplication).

I've downloaded the protein fasta file which has the protein sequence of every gene I'm interested in. My plan is to run a blastp to bring up the top paralogs. I know this is a kinda pointless question as each gene has a different rate of change. E.G. GPCRs retain conserved sections which allow paralog relationships to be mapped back over a billion years, while their protein ligands differentiate so quickly that paralogy analysis can barely go back a few hundred million. So, creating a unified set of restrictions that will encapsulate a consistent rate of change for everything is pointless . . . yet saying that . . .

In a very general sense, can anyone recommend any restrictions on what settings to use on my blast analysis to prevent the more spurious matches? At the moment I've settled on 1E-20 and I'm considering an alignment % threshold. Alternatively, can anyone recommend a program that focuses on retrieving in-species paralogs (for example, http://inparanoid.sbc.su.se/cgi-bin/index.cgi but for in-species paralogs).

I tried using the Ensembl biomart paralogy data but it isn't in a format I can use. I need to manually review all the data of thousands of genes so I need to be able to view it in an excel format of:

  1. Query 1 . . . Target 1 . . . Target 2 . . . etc
  2. Query 2 . . . Target 1 . . . Target 2 . . . etc

Ensembl provides it in:

  1. Query 1 Target 1
  2. Query 1 Target 2
  3. Query 1 Target . . .
  4. Query 2 Target 1
  5. Query 2 Target 2
  6. Query 2 Target . . . .

(Unless anyone knows a convenient method to align the target results of each query into a single long row, rather than each one getting their own row).

Please assume I have no coding/scripting skills whatsoever. The tiny amount of skill I do possess is poor and haphazard (I'm in a non-bioinformatic lab doing my best to teach myself). Thanks for any help or suggestions you can provide!

blast • 1.7k views
ADD COMMENT
0
Entering edit mode

Are you sure about the aim of your quest? Paralogous sequences are those within a species, from what you describe, it looks more like you are looking for orthologs. If you are working on model organisms, using paralogs/orthologs, that are already annotated by e.g. ensembl compara might be a preferred approach to just blasting.

ADD REPLY

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6