Question

Whole Genome Sequencing and structural variation identification in human..

0

Entering edit mode

7.2 years ago

aneekbiotech ▴ 10

Dear all,

I have two queries regarding the structural variation identification using whole genome sequencing in human.

If you can suggest, to identify structural variation (SV) like balanced chromosomal abnormalities (BCA) what technology would be the best or good to use in whole genome sequencing (WGS): Illumina 2 x 300 bp read length technology (fragment analysis) or mate-pair technology . What would be the difference between these two technologies (advantages/disadvantages). If Illumina 300 bp read length technology would be good enough to serve the purpose here instead of mate-pair. Also cost wise which will be effective?

Also if there is any recommended/standard best practices guidelines/pipelines for BCA breakpoint identification and/or any other SV identification analysis. How to analyze the data?

Thanks & regards, Aneek

next-gen sequence genome • 3.3k views

ADD COMMENT • link 7.2 years ago by aneekbiotech ▴ 10

0

Entering edit mode

@d-cameron

Thank you very much for the information. I will analyze and compare between these SV callers. Please let me know once you publish the paper in bioRxiv.

Thanks & regards, Aneek

ADD REPLY • link 7.2 years ago by aneekbiotech ▴ 10

score 2 · Answer 1 · 2017-02-01

Detecting large-scale copy-neutral events purely from short read sequencing data is difficult. Your entire signal is in the breakpoints so the strength of that signal is depend on the mappable coverage you have across the breakpoints. Events in extremely repetitive sequence (eg centromeres) are unlikely to be found by either short or long read sequencing. In such cases you'd have better luck with karyotyping (eg FISH).

As for your question, it depends on your library fragment size. If you do 2x300bp sequencing with a mean library fragment size of 500bp, your signal will be much weaker than the same sequencing on a library with 1,000bp fragments. Mate-pair libraries have much longer fragment lengths, thus will require lower coverage for the same signal strength. If you don't need an exact breakpoint reconstruction, then you get more value per sequenced base by sequencing shorter reads (since more fragments span the breakpoint) but if you need exact breakpoint sequence reconstruction, then sequencing longer reads is preferable due to the better mapability of the longer reads.

any recommended/standard best practices guidelines/pipelines for BCA breakpoint identification and/or any other SV identification analysis. How to analyze the data?

SV calling is still well behind SNV and small indel calling both in terms both specificity and sensitivity and hasn't yet matured to the point where you can swap out callers and get results that are mostly the same. For the 2x300bp data, I would recommend GRIDSS as the SV caller, and StructuralVariantAnnotation for downstream analysis, but LUMPY and manta are other good options. For mate-pair, you will have a smaller selection of tools as most SV callers don't support mate-pair libraries. As you're looking for large-scale event, you can't filter on event size so you're going to get a lot of false positive calls due to sequence homology.

Are you open to alternate approaches? Molecular barcoding approaches such as the 10X Genomics Chromium are showing some really impressive early results (eg http://biorxiv.org/content/early/2016/09/10/074518) but the analysis techniques are immature.

score 0 · Answer 2 · 2017-02-01

0

Entering edit mode

7.2 years ago

aneekbiotech ▴ 10

@d-cameron

Hi,

Thank you very much for the useful information. As you said I think 2 x 300 bp read length in larger fragments/inserts (1-2Kb) would be ideal for SV identification. As we are not very expert in data analysis, mate-pair analysis may not be suitable due to limited proper tools availability. Please correct me if I am wrong.

10X Genomics Chromium is an advanced newly arrived technology and not affordable for our laboratory at present.

For analysis there is python based software named SVfinder. Details can be found in this link: https://github.com/cauyrd/SVfinder I have found it very easy to use but not sure about the specificity. It would helpful to know your opinion about the program.

Another thing I would like to ask, where can I get Illumina paired-end raw fastq files for whole genomes sequencing where short read length + larger fragments technology was used for SV identification?

Thanks & regards, Aneek

ADD COMMENT • link 7.2 years ago by aneekbiotech ▴ 10

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. e.g. this post belongs against @d-cameron's answer above.

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

SVFinder is by no means the only general purpose structural variant breakpoint detection software available. Alternatives include VariationHunter, GASV, Pindel, Breakdancer, HYDRA, VariationHunter-CR, SVDetect, SVMerge, SOAPsv, SRiC, CREST, SVseq, CommonLAW, ClipCrop, GASVPro, SVseq2, SVM2, PRISM, DELLY, CLEVER, SVMiner, cortex_var, BreakPointer, SV-M, PeSV-Fisher, Bellerophon, SoftSearch, Socrates, breseq, LUMPY-sv, Gustaf, TIGRA-ext, laSV, AsmVar, RAPTR-SV, MetaSV, SoftSV, Hydra-Multi, SV-Bay, GRIDSS, and Manta, although not all of them are capable of detecting the events you are interested in.

SVFinder appears to use greedy discordant read pair clustering and no other signal in its variant calling. I expect results to broadly similar to other read pair methods. Such methods are poorly suited to 2x300bp data and are generally outperformed by multi-signal callers (such as DELLY, LUMPY, manta, and GRIDSS) even on 2x100bp data sets. I actually have a comprehensive SV variant caller benchmarking paper almost almost ready for submission - I should be able to get it up on bioRxiv in the next week or so.

10X Genomics Chromium is an advanced newly arrived technology and not affordable for our laboratory at present.

If you can halve your sequencing depth, your per sample cost might not actually go up. That said, if you're not interested in methods development you'll be very limited in your tool selection.

ADD REPLY • link 7.2 years ago by d-cameron ★ 2.9k