Question

illumina sequencing reads with high quality scores have low percentage mapping to refseq and classification?

0

Entering edit mode

4.9 years ago

cecilio11 ▴ 110

Hello Biostars,

I am assembling my first illumina genome. It is a bacterial genome about 2.5 MB in size. genus Ignatzschineria.

FastQC shows the following stats for the R1 and R2 reads:

File type-----------------------------Conventional base calls

Encoding----------------------------Sanger / Illumina 1.9

Total Sequences---------------------16770385

Sequences flagged as poor quality---0

Sequence length---------------------124

%GC---------------------------------41

I used several other tools to evaluate the raw genomes before the assembly.

Bowtie2 maps 11.86% (~2 million) of the total reads to the reference genome Ignatzschineria larvae Minimap2 maps 19.12% (3.2 million) of the total reads to the reference genome Ignatzschineria larvae

Kraken2 (default options) is able to classify only 42.9% (7187522) of those 16.7 million reads as follows: 42.6% bacterial and 0.0239 viral. Only 5.78 million reads classify to Ignatzschineria larvae.

This is a very LOW classification/mapping percentage to a sister species (assumimg that we indeed sequenced an Ignatzschineria sp). Kraken2 reports 57.1% of the 16770385 reads are unclassified. I wonder to what living entities those 57% unclassified reads belong? I used Kraken2 databases that include archaea, bacteria, viruses, plasmids, humans, parasites of invertebrates. I did not try plants. The sample was collected from a field experiment on blowflies and the bacteria were cultured in the lab. The target colonies were isolated from those cultures.

I used Metaxa2.22 to check the 16sRNA contained in the reads, and the results of this survey are as follows:

Out of 54584 hits:

14059 ----------unclassified Gammaproteobacteria

3613 ----------Igntazschineria

33173 ----------unclassified Xanthomonadaceae

Of course, there are other minor hits.

So, I checked the kraken2 database and I was able to verify that all published genomes of Xanthomonadaceae are present.

Does any of you have a suggestion what tool/database would be appropriate for finding out what are the entities that are not classified by Kraken2 and reported as "unclassified Xanthomonadaceae" by Mataxa?

Any help will be appreciated.

Regards,

genome sequencing • 1.5k views

ADD COMMENT • link 4.9 years ago by cecilio11 ▴ 110

1

Entering edit mode

This post is kind of hard to read and perhaps that is one reason it has had no responses yet.

The sample was collected from a field experiment on blowflies and the bacteria were cultured in the lab. The target colonies were isolated from those cultures.

I assume this experiment is referring to a pure colony of a bacterium that was used to create a library for genome sequencing. I am not sure why you have so much other contamination in your data if this is supposed to be one single organism. If you did not use a single colony to make the libraries then perhaps this would explain some of the other stuff you appear to have picked up in this experiment.

Ignatzschineria larvae

I assume that is the actual name of the bacterium. Word larvae just happens to be a species name?

ADD REPLY • link 4.9 years ago by GenoMax 141k

0

Entering edit mode

Ignatzschineria larvae I assume that is the actual name of the bacterium. Word larvae just happens to be a species name?

Yes sir, a funny and confusing specific epithet for a bacterial species, right? "larvae".

I assume this experiment is referring to a pure colony of a bacterium that was used to create a library for genome sequencing. I am not sure why you have so much other contamination in your data if this is supposed to be one single organism. If you did not use a single colony to make the libraries then perhaps this would explain some of the other stuff you appear to have picked up in this experiment.

Sir, we have our hypothesis on how we got the "other stuff" in our genome. But that is not the point of my post.

I would appreciate if someone could direct me to some databases/tools (besides the ones I used) that would allow me to find out what the "other stuff" is. I used the latest version of Kraken2, which uses the latest NCBI taxonomy database. I also used Metaxa2, which uses, according to their website, the latest release of the Silva database for 16sRNA.

Perhaps someone having more experience than I do on environmental sampling of bacteria for genome sequencing could help?

Thank you for giving me the opportunity to clarify a bit more this post of mine that is so hard to read.

Regards

cecilio11

ADD REPLY • link 4.9 years ago by cecilio11 ▴ 110