Question

Count Gene Copy Numbers In Whole Genome Sequence

3

Entering edit mode

12.2 years ago

Zewei ▴ 30

Hi, All

I'm trying to count the copy number of rDNA in the genome of G. trabeum sequenced by JGI (http://genome.jgi.doe.gov/Glotr11/Glotr11.home.html). But I have no idea how to start.

They have data available for download, but I cannot open them with Chromas Pro. Is there any program that I can use?

Thanks.

copynumber • 5.6k views

ADD COMMENT • link updated 12.2 years ago by Gustavo ▴ 530 • written 12.2 years ago by Zewei ▴ 30

score 3 · Answer 1 · 2012-02-18

You start by downloading some of the data, examining the file contents (just a plain text editor will do) and figuring out if any of the files contain what you require (rDNA genes).

In this case, I would start with the "filtered models". You can eliminate the files with proteins and CDS, since they are protein-related. You can probably also eliminate the transcripts file, since the naming suggests it contains just nucleotide sequences for the protein-coding genes.

I've looked at the file Glotr1_1_GeneCatalog_genes_20100928.gff and again I see only CDS and exons, so that looks like protein-coding genes too.

So I conclude that the sequences are not yet annotated with rDNA genes. In which case, you would have to download the assembly scaffolds file and search for them yourself, e.g. by creating a BLAST database from the file and querying with rDNA sequences. Or you could use the JGI BLAST page.

score 2 · Answer 2 · 2012-02-18

One way to explore this is to grab relevant segments of sequence from GenBank, e.g. the partial 28S subunit:

--> http://www.ncbi.nlm.nih.gov/nuccore/166361628?report=fasta

Copy that sequence and run a blast search:

--> http://genome.jgi.doe.gov/pages/blast.jsf?db=Glotr1_1

Within a few seconds you should get the result of this search. Even though the query is rather short (slightly over 1 kb), the result shows matches to three different contigs spanning the query (contigs 19158, 19242 and 19727) plus a bunch of much shorter and less significant matches.

Based on this very preliminary exploration, one could simplistically conclude there is evidence for just one copy of the rDNA sequence in this genome draft. On the other hand, if there are several nearly identical copies, they would probably be "over collapsed" in the unfinished draft, particularly if it is so fragmented in this locus.

At this point, it looks like the next step would be to try to figure out how much coverage there is for this locus relative to the unique regions of this genome. This would probably give the best hint to estimate the copy number.