Question

What should be an ideal coverage of an assembled putative eukaryotic Plasmid?

0

Entering edit mode

8.1 years ago

jigarnt ▴ 30

Hi All,

I have Illumina sequenced a putative plasmid using Hi seq 125 bp pair-end library. Size of the putative plasmid on the gel is around 3kb and I got around 10 contigs of that size when I assembled it in SPADES. Now, if it is a plasmid which I think so it is, I am bound to get very high coverage. In that case, what could I possibly do next to find out if it is a plasmid or not?

Assembly sequence plasmid spades • 2.6k views

ADD COMMENT • link updated 8.1 years ago by Chris Fields ★ 2.2k • written 8.1 years ago by jigarnt ▴ 30

1

Entering edit mode

Have you compared the 10 contigs to each other to see how similar they are and if they could be collapsed into a smaller set? They may be related to each other. Was the data generated from isolated "putative" plasmid DNA or did the sample have other DNA? Do eukaryotic plasmids have an identifiable origin of replication that you could look for (just thinking out aloud)?

ADD REPLY • link 8.1 years ago by GenoMax 142k

0

Entering edit mode

Hi Genomax2,

I had Gel extracted my putative plasmid from the Genomic DNA, so there is no question of contamination in it. The 10 contigs which I got are in the size range from 6.6kb to 2.5kb and coverage ranging from 66 to 2. I did a Nucleotide BLAST and I am getting Hits of E. coli plasmid for most of my contigs. As I dont know what should be the ideal coverage of a plasmid, I am baffled in selecting any one contig. Does a Prokaryotic and Eukaryotic plasmid have a similar origin of replication?

ADD REPLY • link 8.1 years ago by jigarnt ▴ 30

1

Entering edit mode

Have you blasted the contigs against each other? That would be one way to judge their similarity. You could also use Mauve and try to align them to each other.

In any case you probably have coverage that is much deeper than necessary to do this assembly. Try the options suggested by @Chris below.

ADD REPLY • link 8.1 years ago by GenoMax 142k

score 1 · Answer 1 · 2016-03-17

1

Entering edit mode

8.1 years ago

Chris Fields ★ 2.2k

SPAdes should indicate the coverage of the scaffolds generated in the scaffold name. You can also determine this by mapping the reads back and assessing average coverage for the scaffolds.

If you have very high coverage (if this is the only data in a HiSeq lane and the size is ~3Kb, it's likely extremely high unless you have other low-coverage stuff in there such as genomic seq), it may be worth either simply directly downsampling the data or filtering sequences w/ low abundance kmers followed by normalization (khmer can do this), then retrying the assembly, sometimes it helps. We did this for plant chloroplast genomes w/ low coverage WGS data, worked a charm.

ADD COMMENT • link 8.1 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Hi Chris,

It was the only data in my Hiseq Lane and So, I want to know how much is extremely high coverage. Contigs of my coverage ranges from a highest of 66 to lowest of 2. I had set my K mer value as -k 21,33,55,77, which are default I assume? What values would you recommend?

ADD REPLY • link 8.1 years ago by jigarnt ▴ 30

0

Entering edit mode

I generally suggest around 100-200x max so 60x is fine, but the coverage you mention doesn't make much sense in the context of how much a typical HiSeq run yields (~400M reads). Is this low-pass WGS sequencing? By my (back of the napkin) calculation you'd have ~15 million-fold coverage with a simple plasmid; ~45-50Gb of data from a typical HiSeq paired-end lane for a 3,000nt genome.

Re: k-mer distribution, I mean using a tool like khmer or Jellyfish to generate a kmer distribution graph. khmer can also filter the data you have based on abundance.

ADD REPLY • link 8.1 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Hi Chris,

I do not know if it is a low pass WGS or not. I fetched 2 files(R1 & R2) of 700 mb each for my Plasmid. So as you said, it could be a low pass NGS as the file size drastically differs from what you said.

ADD REPLY • link 8.1 years ago by jigarnt ▴ 30

0

Entering edit mode

Your sample may have been multiplexed with others to save you money. If you did FastQC analysis how many reads did it report were there and how many cycles of sequencing did you do?

ADD REPLY • link 8.1 years ago by GenoMax 142k

0

Entering edit mode

Hi Genomax2,

I had outsourced my sequencing so I do not really know whether about sequencing cycles, but I fetched ~4 million reads.

ADD REPLY • link 8.1 years ago by jigarnt ▴ 30

0

Entering edit mode

Is that 4 mil total (R1+R2) or each? How long (bp) are the individual original reads?

ADD REPLY • link 8.1 years ago by GenoMax 142k

0

Entering edit mode

Individual reads are of 125bp in length. Reads in each file are 2.3 million so R1 & R2 combined would be 4.6 million.

Also, I tried to align contigs with the reference genome and some of them are aligning. What could be the way to find out its identity?

ADD REPLY • link 8.1 years ago by jigarnt ▴ 30

1

Entering edit mode

You have a reference genome for the plasmid? If you do then clearly the ones that are aligning must be the correct contigs.

ADD REPLY • link 8.1 years ago by GenoMax 142k

0

Entering edit mode

Hi ,

That was the reference genome of my organism from which I gel extracted the Plasmid. When I BLASTed my contigs, I was getting match with my organism and also with a E. coli Plasmid. I wonder why it was showing match against an E.coli Plasmid(Prokaryotic Plasmid). Identity of my Putative Plasmid still remains unknown.

ADD REPLY • link 8.1 years ago by jigarnt ▴ 30

0

Entering edit mode

Hi,

Is there any possibility that a ds RNA resist its degradation through RNAse and get itself sequenced on an illumina platform with the pair end library meant for DNA??

The reason why I am asking this is because of the possibility that there is no Plasmid and instead we ended up sequencing a dsRNA (mycovirus)?

ADD REPLY • link 8.1 years ago by jigarnt ▴ 30