Question: Sequence similarity scores between two sets of genes from different genomes
1
gravatar for avaneesh.t
2.1 years ago by
avaneesh.t20
avaneesh.t20 wrote:

I have a set of genes from the Yeast genome (~3000) and a set of genes from Human genome (~6000). I want to align each yeast gene against each human gene, and get a similarity score for each pair. The lengths of the genes would be different, many pairs would be very dissimilar.

1) How would I go about doing this, say with R ? 2) Are there some specific things i should take into while doing my analysis?

ADD COMMENTlink modified 2.1 years ago by genebow150 • written 2.1 years ago by avaneesh.t20

Do you need to achieve this in R? There are loads of great commanline utilities for alignment

ADD REPLYlink written 2.1 years ago by jrj.healey12k
0
gravatar for Benn
2.1 years ago by
Benn6.6k
Netherlands
Benn6.6k wrote:

Maybe inParanoid can help you further?

http://inparanoid.sbc.su.se/cgi-bin/index.cgi

There are R bioconductor libraries available, but limited (only 1 yeast species: S. cervisiea).

https://bioconductor.org/packages/release/BiocViews.html#___InparanoidDb

ADD COMMENTlink written 2.1 years ago by Benn6.6k

Inparanoid and other databases give me a list of orthologs. While this would help me validate my pariwise "similarity scores" (orthologs should have higher similarity scores?), they do not tell me how similar non-orthologous genes are.

ADD REPLYlink written 2.1 years ago by avaneesh.t20

Do you want 3000 x 6000 similarity scores (18 M)??

ADD REPLYlink written 2.1 years ago by Benn6.6k

Yes. That is the idea. Though, now that you bring that up, I should probably try and target a smaller subset.

ADD REPLYlink written 2.1 years ago by avaneesh.t20

It is possible to do these 18M alignments by your computer, but how to interpret the results is something to consider.

If you want to do these 18M pairwise alignments, you can use EMBOSS command line tool for it. Depending on if you like global or local alignment, you can use needle or water, respectively. The results will also contain identity for each pair, so you'll need some bash skills to extract them in the right way (e.g., using GREP).

For example:

needleall -auto true -asequence yeast.fasta -bsequence human.fasta \
-datafile EDNAFULL -outfile yeast_human.needleall -aformat markx0

grep "Identity:" yeast_human.needleall > yeast_human.needleall.identity
ADD REPLYlink written 2.1 years ago by Benn6.6k
0
gravatar for genebow
2.1 years ago by
genebow150
USA/Chicago
genebow150 wrote:

For a large set of genomes, alignment may not work since it takes very long time. You may consider to use alignment free method. My paper is as follows with MATLAB code available, the link to the programs is inside the paper. The method can process different lengths of DNA sequences (even scaling).

Yin, C., Chen, Y., & Yau, S. S. T. (2014). A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. Journal of theoretical biology, 359, 18-28.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by genebow150
0
gravatar for genebow
2.1 years ago by
genebow150
USA/Chicago
genebow150 wrote:

Also please check this paper for the improved method for even scaling and code.

Yin, C., & Yau, S. S. T. (2015). An improved model for whole genome phylogenetic analysis by Fourier transform. Journal of Theoretical Biology. doi:10.1016/j.jtbi.2015.06.033

[https://www.mathworks.com/matlabcentral/fileexchange/52072-phylogenetic-analysis-of-dna-sequences-or-genomes-by-fourier-transform][1]

ADD COMMENTlink written 2.1 years ago by genebow150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1859 users visited in the last hour