Question: How to get plasmid or costruct sequence from fastq file in c.elegans Whole-genome Sequencing?
gravatar for giovannaventola3es9
5 months ago by
giovannaventola3es910 wrote:


I have a fastq file generated from a whole genome sequencing by C. elegans samples and I want to find in which site (or sites) a construct/plasmid has been integrated in the genome, comparing my sequence with a published genome in Ensembl (WB235).

I had a C. elegans strain with a construct integrated in the genome. From the construct design, I expect it to integrate into chromosome III of C. Elegans. So, by using a tipical SNPs and indels workflow from Varscan, I managed to identify Variants in all samples. Furthermore, I also analyzed the CNVs through BreakDance tool only in chr III.

But my question is: How can I find where the construct is integrated? I would like the position (Start and End) respect to a genome reference. It's possible?

Thanks for all.

sequencing sequence R • 312 views
ADD COMMENTlink modified 5 months ago by jrj.healey9.1k • written 5 months ago by giovannaventola3es910

If I were you, I think I would do something like below.

  1. Extract flanking sequences of plasmid sequences from whole genome sequencing fastq files.
  2. Blat the flanking sequences.
ADD REPLYlink written 5 months ago by mbk0asis390

Sorry, but I didn't understand... I have the construct sequence in fasta format, but I would like to get How this sequence is integrated in several sample fastq files. I don't have a flanking sequences... so How can I do?

ADD REPLYlink written 5 months ago by giovannaventola3es910

If the plasmids were integrated into the genome and you ran whole genome sequencing, certain reads will have hybrid sequences (part of plasmid + part of genomic sequence).

Then, you go though your reads possessing the plasmid sequences, and some of them will have the hybrid sequences which are the "flanking sequence" I mentioned.

Extract those reads. Remove plasmid sequences. Then, align the flanking sequences on genome to obtain the genomic coordination.

ADD REPLYlink written 5 months ago by mbk0asis390

Hi mbk0asis, so, to sum up, you advise me: 1- To map fastq files on Construct and to extract unmapped Paired end reads 2- To take the unmapped reads and to remap respect to genome of C. elegans and so I'll obtain the genomic coordination of the unmapped? Thanks

ADD REPLYlink written 5 months ago by giovannaventola3es910

Map the fastq on construct.

Extract mapped reads.

Trim off the plasmid sequences.

Map the remaining part of genomic sequence to reference genome.

That's my rough idea.

ADD REPLYlink written 5 months ago by mbk0asis390
gravatar for jrj.healey
5 months ago by
United Kingdom
jrj.healey9.1k wrote:

Here's an approach which might work:

  1. Align all your reads in your fastq to your known plasmid sequence (you might need to experiment with some stringency).

  2. de novo assemble your remaining reads to see if you regenerate the complete plasmid sequence.

  3. Hopefully, the reads which span the very edges of the plasmid in the genome will be retained in your new assembly (assuming they weren't thrown out by the mapping step if it was too stringent.

  4. Take the flanks of your new assembly, which with any luck will be a nice single contig containing a small amount of joining sequence.

  5. BLAST (or similar) your flanking sequences back against the reference/target genome which will give you the positiions of insertion.

The only problem I can forsee (other than not enough reads being retained after alignment as I mentioned), is if the flanking sequences are quite repetitive, in which case you might end up identifing multiple places within the genome.

Now here's what I would actually have done:

Design some primers internal to your plasmid pointing out along the genome (if you know how the plasmid integrates), then just send the DNA+Primer for Sanger sequencing for about $3. Basically end-sequence your joins, and you'd get more than enough sequence back that way to be certain of where the plasmid is.

ADD COMMENTlink modified 5 months ago • written 5 months ago by jrj.healey9.1k
gravatar for Lisa Ha
5 months ago by
Lisa Ha50
Lisa Ha50 wrote:

You can get the flanking sequences by extracting the reads that contain the beginning or end of the construct. Then you can search for the flanking sequences in the genome. To verify the position, you can do a PCR on the strain containing the construct.

ADD COMMENTlink written 5 months ago by Lisa Ha50

I Lisa Ha, Thank you for your answer, but I would like a bioinformatics tool to find the position of the construct respect to genome... I do not know if this is possible...

ADD REPLYlink written 5 months ago by giovannaventola3es910

You are unlikely to find a tool 'ready made' for this. You're going to have to get your hands dirty.

ADD REPLYlink written 5 months ago by jrj.healey9.1k

Of course... it is clear! In fact I do not want a single tool but at least an idea for a strategy.

ADD REPLYlink written 5 months ago by giovannaventola3es910

I doubt there is a single tool that does exactly what you want. You're going to have to put in a bit of effort. You can use grep on the command line to extract the reads that contain parts of your construct. Then map these to the genome and look at where the reads align (and stop aligning) with a genome viewer, something like IGV or the online Ensembl browser.

ADD REPLYlink written 5 months ago by Lisa Ha50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1051 users visited in the last hour