Question

What does it mean if a gene is not found in an assembly?

0

Entering edit mode

5.0 years ago

bioinfo2345 ▴ 40

I have two questions:

1) Can you reliably and robustly predict the absence of a gene (either missing entirely or being non-functional) from an organism simply by not finding it in an assembly based on whole genome sequencing from a bacterial cell culture taken from clinical samples?

2) If no, is the lack of a gene in an assembly still some sufficient degree of evidence for absence in the original biological context? Would you bet on the gene being absent in the organism if you did not find it in an assembly even if you knew it could not be robustly scientifically inferred?

I suspect the answer to these questions is no, because:

the sampling could have gone wrong (i.e. sampled one clone from an infection that contains multiple clones and this particular clone happens to lack the gene but not the other).
DNA extraction could have gone wrong, so even though the gene exists in the organism, it might not end up in the DNA that gets successfully extracted.
The kit used for converting the DNA into a form that can be sequenced on a specific sequencing platform might have been less than theoretically perfect.
The library happened to be low complexity.
The gene might be more difficult to sequence than other genes due to sequence biases.
The sequence quality for the reads from that gene might be of too low quality and be filtered out in the quality filtering step.
The gene might have features that makes it difficult to assemble or exist in multiple copies so that the assembly collapses it and the specific variant one is looking for might not be detected.
Due to the specific idiosyncrasies of the assembler, the gene happened to be split among many contigs.
The algorithm used to detect the gene from the assembly might have limitations.
The database you were using did not even contain the gene you were looking for.

...or any number of other biological or bioinformatics reasons.

In other words, there are so many things that could theoretically have gone wrong that it is unwise to claim that the gene is not in the organism just because it is not in an assembly.

Is this largely accurate? Would you consider it obviously flawed to conclude absence of a gene in the organism from the mere observation that it is not found in an assembly?

Assembly gene identification • 1.7k views

ADD COMMENT • link updated 5.0 years ago by Carambakaracho ★ 3.2k • written 5.0 years ago by bioinfo2345 ▴ 40

0

Entering edit mode

IMO if you have +20X coverage for your contigs and the assembly is of decent quality (~100 contigs or less, N50 > 50k) and blastn returns nothing, then conclusion is that the sequenced organism does not have that specific gene

ADD REPLY • link 5.0 years ago by 5heikki 11k

0

Entering edit mode

This is disproved by cases where you have e. g. low complexity library. Entirely possible to have high coverage, few contigs and high N50, but still missing a considerable part of the genome.

Just switching assembler could also change the gene content by hundreds of protein-encoding genes, albeit a bit old paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0021400

ADD REPLY • link 5.0 years ago by bioinfo2345 ▴ 40

0

Entering edit mode

I don't know about eukaryote genomes, but I disagree with that you can have high coverage, few contig, high N50 prokaryote genome assembly (you defined bacteria in OP) and still miss a considerable part of the genome. "Hundreds of genes" is less than 1% of all genes in the context of your typical mammalian genome..

ADD REPLY • link 5.0 years ago by 5heikki 11k

0

Entering edit mode

The scenario that I outlined was that due to a low complexity library, you only have, let us for the sake of argument say half of the genome that you have sequenced over and over (approximately twice the depth). Your N50 would be large (good coverage), you would have few contigs and high coverage.

That was only an example of the impact of assembler choice. Hundreds of genes might be small as a proportion of all total genes, but it could impact analyses.

ADD REPLY • link 5.0 years ago by bioinfo2345 ▴ 40

0

Entering edit mode

I haven't done any wetlab stuff for a very long time, but IMO starting from a cell culture, it would take extraordinary skill to somehow manage to extract the DNA covering just 50% of some prokaryotic genome..

ADD REPLY • link 5.0 years ago by 5heikki 11k

0

Entering edit mode

Dear bioinfo2345, you're stretching all theoretical possible problems to their very extreme.

there are so many things that could theoretically have gone wrong that it is unwise to claim that the gene is not in the organism just because it is not in an assembly.

In its extreme case, there is truth in your statement. I wouldn't use "unwise", but there is no absolute guaranty for the absence of a gene being true. There almost never is. As an absence of a PCR band doesn't prove the absence of the amplification target.

BTW, if you know the sequence of your gene of interest (which you must, otherwise you couldn't make assumptions on its presence in the first place), all you need is a PCR. That is, you don't have the extraordinary skills 5heikki mentioned. As mentioned above, no band doesn't prove the absence, but adds evidence.

Finally, much depends on the effort you put into it. A lousy DNA prep is the foundation of a lousy assembly, a sloppy designed primer set decreases the chances for a successfull PCR. On the contrary, you can do targeted sequencing to focus on your gene, up to the point of primer walking ;-)

There's no guaranty, but there are so many options

ADD REPLY • link 5.0 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

PCR can handle the absence of evidence problem by also running the sample with other primers in such a way that you will get another predictable product if the gene is missing.
There are many cases where you would like to identify thousands of genes per sample in a rapid and high-throughput way. As you can probably imagine, saying "all you need is a PCR" is not adequate here.

ADD REPLY • link 5.0 years ago by bioinfo2345 ▴ 40

0

Entering edit mode

I am sure you are aware that there are many experimental papers out there that are based on errors in the lab of various kinds and there are many steps that can go wrong or be incompletely carried out.

ADD REPLY • link 5.0 years ago by bioinfo2345 ▴ 40

0

Entering edit mode

Another thing you can do is to pick a reference genome that is the closest to your sequenced genome. Then you map the reads from your genome on the reference genome and see what is missing (other than your gene of interest that is)..

ADD REPLY • link 5.0 years ago by 5heikki 11k

0

Entering edit mode

It is a good idea but unfortunately genome sizes varies substantially within species' of interest. It is also still an evidence of absence argument.

Do you think synteny could be robustly used to estimate gene absence? Assuming that the order of genes are:

gene A - gene B - gene C

then is finding a contig with gene A and gene C, but no contig with all three sufficient to argue for gene loss?

ADD REPLY • link 5.0 years ago by bioinfo2345 ▴ 40

0

Entering edit mode

Extract your gene of interest +/- 10k bp region from a large number of reference genomes of your species. Do they align nicely? If yes, that region is conserved across your species. Now map your reads against this region. In case you see something like "|||||||||||||.....|||||||||||||||" (with pipes indicating good coverage and dots indicating absence of coverage) the conclusion is that your sequenced organisms includes the conserved region, but it is missing a part of it. Then you have the 0.00001% chance that your organism has this region but it has been moved into another part of the genome and somehow magically in DNA lib preparation/sequencing/whatever this particular region was left out. You're always going to have uncertainty..

ADD REPLY • link 5.0 years ago by 5heikki 11k

score 0 · Answer 1 · 2019-05-07

0

Entering edit mode

5.0 years ago

GokalpC ▴ 100

I would suggest mapping reads to transcriptomes to see if the gene of interest has at least some degree of presence among the reads. It is quite likely that you will find something that may resemble the gene (albeit with low homology) that you are searching for.

ADD COMMENT • link 5.0 years ago by GokalpC ▴ 100

0

Entering edit mode

In this scenario, transcriptome data is not available and there is no chance of it every being available for many, many years and low homology never happens / is never relevant.

Your suggestion also does not address the issue of if absence of evidence is evidence of absence. It is merely an assembly-free absence of evidence argument. I was more looking for positive evidence of absence. For instance, can you reliably trust synteny to detect absence of genes?

ADD REPLY • link 5.0 years ago by bioinfo2345 ▴ 40

score 0 · Answer 2 · 2019-05-07

Only commenting to what I think I have knowledge to comment

the sampling could have gone wrong (i.e. sampled one clone from an infection that contains multiple clones and this particular clone happens to lack the gene but not the other).

This can always happen - a patient might have super infections and subpopulations

DNA extraction could have gone wrong, so even though the gene exists in the organism, it might not end up in the DNA that gets successfully extracted.

Might happen for plasmid encoded genes - like many resistance genes.

The kit used for converting the DNA into a form that can be sequenced on a specific sequencing platform might have been less than theoretically perfect.

I think this is an theoretical possibility but less relevant in practice.

The sequence quality for the reads from that gene might be of too low quality and be filtered out in the quality filtering step.

The gene might have features that makes it difficult to assemble or exist in multiple copies so that the assembly collapses it and the specific variant one is looking for might not be detected.

Due to the specific idiosyncrasies of the assembler, the gene happened to be split among many contigs.

In most cases you will find at least one copy (see for example ribosomal clusters)

The algorithm used to detect the gene from the assembly might have limitations.

This would be blast - one of the most often used and stable software in bioinformatics