Question

How To Search For Putative Homologs Of A Protein In Genome Sequencing Raw Data?

3

Entering edit mode

10.4 years ago

sanchezcavani ▴ 220

I have got a DNA-seq paired end dataset (fastaq). Would anyone please tell me how I should search for homologs of my gene sequence in this dataset?

I think I need to first convert fastq format file to fasta and then use blat. Would it be OK?

New comment from the user

I want to confirm that the homolog of my gene from species A (query) exists in species B genome (subject). Species A and species B are closely related species (the sequence similarity is expected to be 70% at the DNA level). Now there is no whole genome sequence available in species B. So the DNA seq reads are the only materials that I can use. Besides, will it be possible to get (or assemble) the full length of the homolog in species B?

• 5.9k views

ADD COMMENT • link updated 10.4 years ago by Ashutosh Pandey 12k • written 10.4 years ago by sanchezcavani ▴ 220

score 3 · Answer 1 · 2013-11-15

3

Entering edit mode

10.4 years ago

Ashutosh Pandey 12k

If by homolog you mean sequences with high sequence identity then what you have suggested seems fine. Convert fastq to fasta and then create a blat database. But remember that fastq sequences may be only 75-100 bp and lot smaller than your gene of interest. So the first problem is that if you blat your gene against your paired-end fastq blat database you will get lot of reads in return (Reads aligning at different parts of gene). Second problem is that your blast database may be huge (I dont know how big your fastq dataset is) and it may require lot of computational resource to do blat against it. If you can be more clear what is your ultimate goal or why you want to do this particular task I may help you in a much better way.

ADD COMMENT • link 10.4 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks so much. I want to confirm that the homolog of my gene from species A (query) exists in species B genome (subject). Species A and species B are closely related species (the sequence similarity is expected to be 70% at the DNA level). Now there is no whole genome sequence available in species B. So the DNA seq reads are the only materials that I can use. Besides, will it be possible to get (or assemble) the full length of the homolog in species B? Thanks a lot!

ADD REPLY • link 10.4 years ago by sanchezcavani ▴ 220

0

Entering edit mode

You made it more clear now. So you are looking for an ortholog. Why don't you align the fastq sequences from species B to species A and see if your gene of interest in species A has a good coverage. That will answer your first question. For your second question, as these species are only 70% identical at the DNA level, it would be hard to get the full length or exact sequence just by aligning. You don't know if the gene in species B has extra exons and extended 3' UTR that you can't just see by alignment. You may have to perform de-novo assembly and come up with the exact sequence. I am not an expert with it. hope somebody can answer it better. I will add your comments in the question.

ADD REPLY • link 10.4 years ago by Ashutosh Pandey 12k