Question

identifying transgene insertion site in WGS

0

Entering edit mode

5.0 years ago

Assa Yeroslaviz ★ 1.8k

I would like to ask for your opinions. I have a WGS data set of mouse with a transgene inserted into it at an unidentified location causing a specific unexpected phenotype.

We would like to identify the insertion position(s).

I was thinking about trying a de-novo sequencing (SOAPdenovo) but I'm not sure if this is the correct approach. By de-novo sequencing I was hoping of identifying the transcripts containing the insertion site (it is ~6.2mb in size) and identify where it was lodged into the genome (mouse as a reference organism).

Do you think this can be a good solution?

Can anyone recommend a better approach or tool for this kind of analysis?

WGS insertion site trangene de-novo soap • 2.4k views

ADD COMMENT • link updated 5.0 years ago by d-cameron ★ 2.9k • written 5.0 years ago by Assa Yeroslaviz ★ 1.8k

score 2 · Accepted Answer · 2019-04-12

2

Entering edit mode

5.0 years ago

d-cameron ★ 2.9k

I've had success using GRIDSS to do this. I even included an example of doing this in the GRIDSS paper.

In short you:

Add transgene to your mm10 reference
(optional depending on transgene sequence) mask (replace with Ns) the mouse homolog of your gene of interest
Align reads to mm10+transgene
Call SVs (using GRIDSS)
Look for SVs to/from your transgene (ignoring those that go to your mouse homolog).

Edit: if you're trying to identify an _unknown_ transgene, then you'll need to do de novo assembly to reconstruct it. It'd still recommend running GRIDSS (v2.2 or later) against mm10 as it will report the insertion site and (~400bp of ) sequence in VCF single breakend notation.

ADD COMMENT • link 5.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Hi Cameron,

I have tried gridss before (it was still version 1.5.1 back then) and have had some good, mixed experience with it. We have got some nice results which showed us a possibility of one specific (or two different, we couldn't quite figure out the results) insertion site(s). Do you think I should try the new version (v. 2.2.0) again? We did exactly what you listed above (merging the genomes, masking the regions in the mouse chromosomes, alignment, SV -> vcf file).

I was thinking the de-novo assembly would give me a more straightforward results. or maybe even using your own tool socrates to look for exactly that.

ADD REPLY • link 5.0 years ago by Assa Yeroslaviz ★ 1.8k

0

Entering edit mode

Sorry for the delayed response.

Do you think I should try the new version (v. 2.2.0) again?

I do. V2.0 added single breakend reporting which can be quite helpful in this sort of analysis. Whilst my collaborators supply an expected construct when engaging me, I've yet to have a project where the construct I've been given has been correct. One transgene included a PhiX component that they forgot to tell me about, another sent me the full sequence for the human gene they'd inserted which I then had to trace through all the exon to exon SV to validate it was the correct transcript, and so on.

Although single breakend calls have an intrinsically higher FDR that breakpoint call, they're extremely useful in determining a) whether you're missing bits of your construct, and b) whether you have a insertion site in repetitive sequence.

I was thinking the de-novo assembly would give me a more straightforward results.

You'll still need to do the post-assembly steps of identifying the contigs containing the construct and aligning the contigs back to the reference. If you have multiple insertion sites, this will result in branches in the assembly graph which will split your contigs at the insertion sites thus putting you right back where you started.

ADD REPLY • link 5.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Hi Daniel, sorry for the late reply and thanks for answering me. I have done the analysis with the new version and got a vcf file. But the results don't defer much from the older run:

grep transgene cleaned.combined.masked.sorted.sv.vcf
##contig=<ID=transgene,length=6252>
7       28985942        gridss102_9557o C       [transgene:3[C  448.35  LOW_QUAL        AS=1;ASC=1X1N1X94M;ASQ=178.54;ASRP=9;ASSR=7;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm102-75365,asm273-1852;BEIDH=0,95;BEIDL=402,0;BQ=76.37;BSC=3;BSCQ=48.62;BUM=1;BUMQ=27.75;BVF=1;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=448.35;EVENT=gridss102_9557;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=-2,0;IQ=0.00;PARID=gridss102_9557h;RAS=1;RASQ=122.09;REF=24;REFPAIR=16;RP=6;RPQ=106.76;SB=0.5833333;SC=1X1N1X373M;SR=2;SRQ=40.96;SVTYPE=BND;VF=12        GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF   .:178.54:9:7:0:0.00:0:0.00:0.00:0:0:76.37:3:48.62:1:27.75:1:0.00:0:0.00:448.35:122.09:24:16:6:106.76:2:40.96:12
7       36210528        gridss103_615o  G       GAGGAATTCGGGAGCTTGAAGT]transgene:6232]  1356.35 PASS    AS=2;ASC=471M1X;ASQ=605.72;ASRP=27;ASSR=15;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm103-15267,asm103-15296,asm273-1023;BEIDH=491,22,0;BEIDL=0,0,9;BQ=46.51;BSC=3;BSCQ=46.51;BUM=0;BUMQ=0.00;BVF=0;CAS=0;CASQ=0.00;CQ=1333.58;EVENT=gridss103_615;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss103_615h;RAS=1;RASQ=177.93;REF=32;REFPAIR=11;RP=17;RPQ=302.48;SB=0.5;SC=605M1X;SR=13;SRQ=270.22;SVTYPE=BND;VF=23   GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF   .:605.72:27:15:0:0.00:0:0.00:0.00:0:0:46.51:3:46.51:0:0.00:0:0.00:0:0.00:1356.35:177.93:32:11:17:302.48:13:270.22:23
transgene       3       gridss102_9557h G       [7:28985942[G   448.35  LOW_QUAL        AS=1;ASC=1X1N1X399M;ASQ=122.09;ASRP=9;ASSR=7;BA=0;BANRP=4;BANRPQ=71.17;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm102-75365,asm273-1852;BEIDH=402,0;BEIDL=0,95;BQ=27.75;BSC=0;BSCQ=0.00;BUM=1;BUMQ=27.75;BVF=0;CAS=0;CASQ=0.00;CIPOS=0,2;CIRPOS=-2,0;CQ=448.35;EVENT=gridss102_9557;HOMLEN=2;HOMSEQ=GA;IC=0;IHOMPOS=0,2;IQ=0.00;PARID=gridss102_9557o;RAS=1;RASQ=178.54;REF=0;REFPAIR=0;RP=6;RPQ=106.76;SB=0.5555556;SC=1X1N1X399M;SR=2;SRQ=40.96;SVTYPE=BND;VF=12  GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF   .:122.09:9:7:4:71.17:0:0.00:0.00:0:0:27.75:0:0.00:1:27.75:0:0.00:0:0.00:448.35:178.54:0:0:6:106.76:2:40.96:12
transgene       6232    gridss103_615h  G       GACTTCAAGCTCCCGAATTCCT]7:36210528]      1356.35 PASS    AS=1;ASC=686M1X;ASQ=177.93;ASRP=27;ASSR=15;BA=0;BANRP=6;BANRPQ=106.76;BANSR=13;BANSRQ=270.22;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm103-15267,asm103-15296,asm273-1023;BEIDH=0,0,9;BEIDL=491,22,0;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;BVF=0;CAS=0;CASQ=0.00;CQ=1333.58;EVENT=gridss103_615;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss103_615o;RAS=2;RASQ=605.72;REF=1;REFPAIR=0;RP=17;RPQ=302.48;SB=0.51724136;SC=686M1X;SR=13;SRQ=270.22;SVTYPE=BND;VF=23   GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF   .:177.93:27:15:6:106.76:13:270.22:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:0:0.00:1356.35:605.72:1:0:17:302.48:13:270.22:23

I removed all the rows with results from endogenous chromosomal regions and left only the two possible insertion positions on chromosome 7. This points though to a complex insertion behavior. the LOW_QUAL rows might be the results of low coverage at this region. It seems that the results hints toward a complex insertion of the transgene combined with duplication of several genomic parts.

Unfortunately we can't really identify the correct structure after the insertion.

ADD REPLY • link 4.7 years ago by Assa Yeroslaviz ★ 1.8k

0

Entering edit mode

Unfortunately we can't really identify the correct structure after the insertion.

That does seem unusual. If you don't have any other compensating breakpoints, those calls indicate that the transgene is inserted on a double minute containing it and 7:28985942-36210528. The next steps I'd take would be to:

the manually inspect the VCF around the insertion sites (7:28985942 and 7:36210528) looking for the compensating breakpoints
- use the input.sv.bam, and assembly.sv.bam in the gridss.working subdirectories to inspect in IGV
look at the copy number profile across chr7. Are there any CN deviations from diploid that would help explain what happened?
Can you explain it as two insertion sites on chr7. Recent versions of GRIDSS reports single breakend calls which will help with this.

ADD REPLY • link 4.7 years ago by d-cameron ★ 2.9k