Question

Whole Genome Sequencing reference vs de-novo based genome assembly

0

Entering edit mode

21 months ago

FrankStarling ▴ 50

I am planning a whole genome sequencing experiment and will be performing the analyses for the first time. I have studied the analyses of WGS data yet most tutorials or workflows begin with indexing a reference genome from an online database. I am seeking advice on if I should use a experimental control as my de novo genome or to use a reference genome and compare conditions to control after analyses. The sequencing is of a human Podocyte cell line with three experimental conditions. The podocyte cell line was gfp tagged via crispr and sorted which gives two cell lines and/or conditions: GFP+ and GFP-. The third condition is a crispr gene knockout of a close homolog of the gfp tagged gene. The untreated podocyte cell line from which all other cells where generated is what I am considering and using as my experimental control. Should I build a de novo genome with these reads from the control or use a reference genome?

Would greatly appreciate any advice, thanks.

WGS genome assembly • 803 views

ADD COMMENT • link 21 months ago by FrankStarling ▴ 50

score 2 · Accepted Answer · 2022-08-03

Admittedly, cell lines can exhibit quite mangled genomes, yet I would clearly refrain from building your own de-novo genome, since it's a huge hassle and you won't be able to use any annotations like gene and transcript positions otherwise.

I know that there are papers out there that specifically addressed genome-wide off-target effects of Crispr, so best check those papers for recommended pipelines. However, my gut feeling would be that any variant calling pipeline such as Sarek should pick up most off-target effects of your Crispr experiment.

If you would like to check for random integrations of GFP, I suggest running the reads through a Bloomfilter with the sequence of your construct (e.g. with BBTools), which should retain only reads possibly mapping to your GFP tag and then align only this subset to the reference genome with quite lenient settings. If the mapping fails, you can trim the GFP part prior to alignment. Since most filtered reads will span the GFP as well as a section of the integration site, they should still (partially) map to the reference genome. Where they accumulate, the GFP tag was integrated. Possibly, a default ChIP-seq pipeline will work once you have pre-filtered the reads with the Bloomfilter and trimmed the GFP parts.