How to obtain regions in a whole genome that do not align with any genes/proteins in a blast search?
Entering edit mode
5.7 years ago
mirza ▴ 180


I am given a genome sequence and am asked to do blast search against the whole nr database and mark/ extract regions (sequences) that do not align with any genes/ proteins in the database from this genome. How to obtain such sequences that do not align or show homology with any of the genes/ proteins in the databases so far, from a whole genome seq? What should be my strategy? Are there any tools available?

whole genome alignment unmapped blast • 1.7k views
Entering edit mode
5.7 years ago

Do you want to look at the whole genomic sequence or only predicted genes? Anyway, I would:

  • Blast (blastx for whole genomic sequence, blastp for predicted protein coding genes, using a sensible cutoff e.g. 1e-6 or -10)
  • IF looking for genes only, you simply select those without hits, done
  • IF looking for all genomic regions, extract the blast HSP coordinates into subject-based ranges (chr, start, end), e.g. in bed or gff format, this can be done with bioperl (preferentially) or using the tabular blast format.
  • load the regions into bed-tools or R and get all the gaps, that is regions with 0 coverage, when comparing to the chromosomes. If necessary, extend chromosome length to the real sequence length.
Entering edit mode

Thank you so much Michael. I have the whole genomic sequence right now.


Login before adding your answer.

Traffic: 1369 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6