Bacterial sequence reads supporting two nucleotides at same location
0
0
Entering edit mode
7.2 years ago
ambi1999 ▴ 50

Hi,

We have sequenced four different mutated forms of a bacteria (Pseudomonas aeruginosa) and task is to find the snps which are exclusive to each sample. Aligned reads to a reference genome as a first attempt. Then did denovo and aligned reads to the denovo assembly (usign SPAdes). In both cases at quite a few locations more than one nucleotides are being supported by many reads. Just to add the fact that reads are only supporting maximum of two nucleotides at quite a few locations, and never three nucleotides. For example below is the read count from igv for a particular location.

"CP000744.1:54,471
<hr>Total count: 439
A      : 265  (60%,     121+,   144- )
C      : 0
G      : 0
T      : 174  (40%,     98+,   76- )
N      : 0

How should we interpret this considering that bacteria are haploid?

First interpretation could be that there was sequencing error and correct sequence was A. Sequencing error to me seems unlikely because of such high numbers (265, 174) and also because same pattern is repeated in other locations as well.

Second interpretation could be there was contamination and more than one type of cells were present in the sample? This may be a possibility but I first want to make sure that I am not missing out on some other reason.

Thanks for reading post, Ambi.

sequencing bacerial genome denovo • 2.2k views
ADD COMMENT
0
Entering edit mode

Did you pick from single colonies in each case for the genome preps? Or from a smear?

ADD REPLY
0
Entering edit mode

Hi jrj.healey,

Thanks for your reply. Due to the unstable nature of the small colony variants we had to use more than one colony for each isolate, which means that even though we used a pre culture, because we used several colonies there might be some clonal variation within that pure culture. So intrasample diversity is one possibility but I am not sure how to interpret the results regarding snps exclusive to a sample. As an example following is the sequence count at a location for four different samples. The reference at this location was T. My task is to find the snps which are exclusive to each sample.

Ref at location 54453 is T

SAMPLE 1: Wild type

CP000744.1:54,453

Total count: 294

A : 0

C : 131 (45%, 53+, 78- )

G : 1 (0%, 0+, 1- )

T : 162 (55%, 95+, 67- )

N : 0

SAMPLE 2: MUTATED FORM 1

Total count: 440

A : 2 (0%, 2+, 0- )

C : 264 (60%, 119+, 145- )

G : 0

T : 174 (40%, 100+, 74- )

N : 0

SAMPLE 3: MUTATED FORM 2

Total count: 239

A : 0

C : 86 (36%, 42+, 44- )

G : 0

T : 153 (64%, 92+, 61- )

N : 0

SAMPLE 4: MUTATED FORM 3

Total count: 231

A : 0

C : 91 (39%, 48+, 43- )

G : 0

T : 140 (61%, 73+, 67- )

N : 0

In MUTATED FORM 1 C is 60% and T is 40%, in all other samples (Wild type, mutated form 2 and mutated form 3) T is almost 60% and C is almost 40%. How to interpret these results? Could we say that at this location MUTATED FORM 1 has a snp while all other three remaining samples do not have snp at this location?

Could the interpretation be that the at this location a pure cell of MUTATED FORM 1 should have C while other samples should have T. The T being present in MUTATED FORM 1 are actually contamination (meaning these cells actually belong to other types namely wild or MUTATED FORM 2 or MUTATED FORM 3)."

Thx, Ambi.

ADD REPLY
0
Entering edit mode

quite a few locations

Can you quantify that statement further? It is certainly possible that you have two populations with slightly different SNP's.

ADD REPLY
0
Entering edit mode

I meant its not a one off scenario. I have not counted but there are more than 100 such locations.

Ambi.

ADD REPLY
0
Entering edit mode

It's certainly possible that there are multiple genotypes in your sample. But also, I see this a lot with supposedly pure samples, as a result of structural variations. There are a few SNPs that differ from the reference, with extremely high confidence (like, 98% of the reads indicating it)... those are real. And then, there are the weird ones, with perhaps 40% of the reads indicating a variant. Sometimes this is due to Illumina platform-specific sequencing error. You can often determine this by looking at the location in IGV and seeing that the variant is only present in reads on the plus or minus strand, so I highly recommend trying that. Otherwise - again, in IGV, you may see that reads don't map very well to a position - some have mismatches, and some have indels, but there's a poor consensus. That could indicate a replication event where all the calls are true, but I think it usually indicates a structural variation that can't be easily explained by the cigar strings from short read mappers. To investigate, you would need a SV caller, or to do something like aligning the assembly to the reference with MUMmer.

ADD REPLY
0
Entering edit mode

Hi Brian,

Thx. Any recommendation for SV caller? The samples are mutated forms of bacteria (Pseudomonas aeruginosa).

Ambi.

ADD REPLY
0
Entering edit mode

Sorry, but no, I have zero experience in writing or using SV callers.

ADD REPLY
0
Entering edit mode

Hi Brian,

And then, there are the weird ones, with perhaps 40% of the reads indicating a variant. Sometimes this is due to Illumina platform-specific sequencing error. You can often determine this by looking at the location in IGV and seeing that the variant is only present in reads on the plus or minus strand, so I highly recommend trying that.

I tired checking if the variant is only present in the plus or minus strand but unfortunately it does not seem to be the case. Just to confirm if I am doing it correctly, I have pasted igv images at two locations. For the first location 401, C is ref and T is variant. As you can see T is present in both forward and reverse reads. Its the same situation for the second location 942.

Location 401 forward reads

enter image description here

Location 401 reverse reads

enter image description here

Location 942 forward reads

enter image description here

Location 942 reverse reads

enter image description here

I am now trying sv caller but before that just wanted to rule out possibility of Illumina platform-specific sequencing error.

Cheers, Ambi.

ADD REPLY
0
Entering edit mode

Those do not look like platform-specific error. Given the lack of other variants nearby, it also does not look like an SV/misassembly (other than a large replication/collapsed repeat), although you are zoomed in too far to be confident of that.

So it's probably a duplication that you may identify with the SV caller, or else mixed strains.

ADD REPLY
0
Entering edit mode

Thx Brian.

Here are the zoomed out versions.

Zoomed out version for location 401

enter image description here

Zoomed out version for location 942

enter image description here

There are nearby variants which are listed below. So I think SV/misassembly is still a possibility? Please note that I have generated denovo assembly containing multiple scaffolds and it is being used as reference.

SNPs in node 26 scaffold:
Location Ref Alt    
     401   C T         
     444   A G 
     615   G A 
     713   T C 
     942   C A 
    1152   G T 
    1255   G A 
    1256   T C 
    1323   T G 
    1665   G A 
    1680   A G 
    1683   C T 


SNPs in node 2 scaffold:
Location Ref Alt    
  571710   C G 
  571713   C G 
  571716   C G 
  571728   C T 
  571732   G T

Thanks for all your input.

Cheers, Ambi/

ADD REPLY
0
Entering edit mode

You have several of nearby SNPs, within read length, which is convenient; and they all seem to share phase. Is it like that throughout the genome (SNPs every couple hundred bp), or just around these locations? What's the overall hetozygous SNP rate? If it looks like this everywhere, you definitely have two strains mixed together; if most of the genome is completely homozygous, it's more likely a duplication event (or misassembled inexact repeat).

ADD REPLY
0
Entering edit mode

Hi Brian,

You have several of nearby SNPs, within read length, which is convenient; and they all seem to share phase.

What you mean by share phase? Do you mean these are on the same scaffold? How does it matter if snps are on same scaffold? My understanding is that Phased vs unphased is relevant if there are two copies of same chromosomes such as in a human genome.

Is it like that throughout the genome (SNPs every couple hundred bp), or just around these locations? What's the overall hetozygous SNP rate? If it looks like this everywhere, you definitely have two strains mixed together; if most of the genome is completely homozygous, it's more likely a duplication event (or misassembled inexact repeat).

There are only few snps (less than 100) but otherwise the rest of genome is homozugous. So I think two strains mix together can be ruled out.

I will explore the possibility of duplication event or misassembled inexact repeat. Any recommendations/tools for it?

Thanks again for all your help.

Cheers, Ambi.

ADD REPLY

Login before adding your answer.

Traffic: 1526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6