sequence BLASTed against itself.... why?
0
1
Entering edit mode
9.3 years ago

I am reading a paper that describes the identification of "single copy genes" in plant species.

I'm trying to understand why the below described process is useful:

To establish a useful criterion for declaring a gene as single copy, each of the five data sets was blasted against itself using BLASTN

If the gene has been duplicated, there will be 2 of it naturally so how does blasting it against itself tell you that there's two or more of it?

genome blast • 3.7k views
ADD COMMENT
2
Entering edit mode

When referring to a paper, you should provide a link or PMID to the article.

Because if you have a single copy of a gene, you won't find more than one hit against it. Basically if there's only one gene X in a genome and I blast gene X against that genome, I would expect to find only one hit for it. If there's two copies of X, you'd expect to find two blast hits for it and so on.

If I have a bag of colored balls and there's only one green ball, when I look in the bag for green balls I should only find one.

ADD REPLY
0
Entering edit mode

Please note this refers to the reference genome, which is a theoretical construct, not the real genome of any particular cell. In human, a 'single copy' gene will probably have two copies on an autosome, or one or two on a sex-chromosome. And many genes will be multicopy in the individual regardless of the reference genome. The article chrisclarkson20 found is talking about artificial maps.

ADD REPLY
0
Entering edit mode

Yes of course, this would be the 'monoploid' genome (which may or may not be the biological reality), however I was always under the impression that this is what the term "genome" meant.

http://ghr.nlm.nih.gov/handbook/hgp/genome:

A genome is an organism's complete set of DNA, including all of its genes.

I guess it depends on how you interpret "set", in a strict sense it means (imo) that the additional copies contributed by polyploidy (n>1) wouldn't be included. However duplicates of a gene within a chromosome would be included since they're distinct genetic elements.

For what it's worth wikipedia has this to say:

When people say that the genome of a sexually reproducing species has been "sequenced", typically they are referring to a determination of the sequences of one set of autosomes and one of each type of sex chromosome, which together represent both of the possible sexes. Even in species that exist in only one sex, what is described as a "genome sequence" may be a composite read from the chromosomes of various individuals.

Although, as long as the ploidy is both known and known to be consistent for the cells/tissue you should be able to apply the same method by dividing by the ploidy number. E.g. if you have diploid data, a gene with two copies should find 4 BLAST hits.

Using the example above, it is similar to having two bags filled with the same mix of balls, if there's one green ball per bag, you expect to find two total but still only one per bag. If you have two red balls per bag, you'd expect to find four after looking in the two bags, but that still means there's two per bag.

Barring noise and issues that may come up from isoforms (which could confuse the BLAST process), you should be able to use this approach with just about any sequencing data from any kind of organism as long as you know the ploidy of your source material and know that it is consistent enough to average out noise from any weirdness.

I also realize that many classes of genes (e.g. transcription factors) will have more than one copy and indeed it seems in plants to be a very large portion. The point of the post was to try and demonstrate the general concept, not to get into details of the biology that can vary widely depending on the species in question.

ADD REPLY
0
Entering edit mode

I believe this is the paper you are referring to. I am suspecting that you are misunderstanding the "itself" part. It refers to the data set not the individual gene. An all-against-all BLASTN search was performed for all the genes in a data set (e.g. Arabidopsis) with an e-value threshold of 1e-10. Genes that do not have any BLAST hits are considered as single-copy genes.

ADD REPLY
0
Entering edit mode

yes of course sorry that is indeed the paper. much better thank you

ADD REPLY

Login before adding your answer.

Traffic: 3139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6