detecting insertion sites of transgene in mouse genome
2
1
Entering edit mode
3.6 years ago
Assa Yeroslaviz ★ 1.6k

we have a pilot project consisting of one WGS paired-end sample from mouse. The mouse were inserted with a transgene of known sequence. We would like to identify possible insertion sites for this transgene.

I was wondering which tool can help me do that.

What I was thinking to do is to map the reads to the mouse genome (using soft-clip and no mismatches allowed) and than look for reads which are splitted. This reads might be splitted, because only part of them is mapped to the mouse genome, while the other half would map to the inserted transgene (s. image - these would be the red-colored reads in the middle).

Another method would be to look for pairs with an insertion site larger than expected which might happens because the two reads spann the insertion site.

A third method would be to look for orphan reads, extract them and their mates and see if it is belong to the transgene (the blue-colored reads in the image).

While I know some tools to try and do the second option (e.g. DELLY), I am not really sure if the first and third options are not better and more precise. Unfortunately I 'm not sure how to do these.

I would appreciate any ideas to do any of the above. Are there any tools for split-read analysis of this kind or to identify orphan reads?

Assa

transgene insertion blast WGS • 1.6k views
0
Entering edit mode
2
Entering edit mode
3.6 years ago

I would align on the foreign DNA first to try to indetify the reads that align to the extremity of the sequence. Extract all these soft-clipped reads and realign them on the host genome. You should not be too stringent when aligning though.

Important questions : How many insertion sites did you expect ? What is the size of the foreign DNA ? What is the size of your reads ? Are they paired-end ?

--

Or you could align your reads on a custom genome where the foreign DNA is considered as an additional chromosome. Then use structural variants methods to detect the integration site e.g. GRIDSS : https://genome.cshlp.org/content/early/2017/11/02/gr.222109.117.abstract

0
Entering edit mode

Important questions : How many insertion sites did you expect ?

I expect one or two sites, but it is a hypothesis, so there might be some more.

Important questions : What is the size of the foreign DNA ? What is the size of your reads ?

The foreign DNA is approx. 6-9Kb large, depending what part of the transgene is considered.

Important questions : Are they paired-end ?

Yes, as I mentioned in the question.

Does it make a different what size the transgene is? I will take a look at the paper, it looks promosing

0
Entering edit mode

Do you have any ideas how to increase the number of mapped reads to the ends of the transgene sequence when using bwa as a mapper? I have read the manual, but not sure how to decrease the sensitivity. thanks

0
Entering edit mode
3.6 years ago
Carambakaracho ★ 2.8k

I assume you have whole genome shotgun sequencing reads, which means relatively low coverage at the insertion site*.

In comparable projects I aligned reads with BWA against the genome with the transgene in a combined index (use the -Y option for easier visualisation later). Then you filter you BAM file for split reads with one part in the genome and the other on the transgene. The easiest next step is to visualize the reads in something like IGV, display the softclipped sequence in the transgene and see whether you can find a stack with a sudden cutoff and the soft clipped sequence is a location in the genome. find this location and, perform the same analysis and cross validate your presumed insertion. You can also give a structural variant caller a try, something like lumpy or Nicolas' GRIDSS. For example, lumpy can work with the filtered split reads bam only. In case you want to get fancy, you can extract the positions and softclipped sequence from the bam file and align it to build a consensus sequence of the integration site, using thresholds on the fractions of reads pointing to the integration site, bearing in mind that your genome is diploid and the integration is almost certainly only in one of the two chromosomes.

(*) There's also a specialized method for finding integration sites in large genomes described in this paper. My strategy was developed for this method which yields 1000x coverage and more at the insertion site, hence the "relatively low coverage" comment above.

0
Entering edit mode

Thanks a lot for the suggestion. how can i filter my bam files for the split reads? can BWA do soft-clipping?

1
Entering edit mode

With a small script on the SA tags in the optional SAM alignment fields, for a start you could get by with something like grep SA: your.sam.

BWA does soft clipping on even the secondary split read

0
Entering edit mode

I am not really sure what happens after grepping the SA: reads. As you mentioned I visualize them in (e.g.) IGV, but what then? I guess it won't be a one-2-one situation, but probably multiple positions. How can I join the reads mapped to the transgene insert with their partners in a different chromosome?

0
Entering edit mode

the paper mentioned in the answer - is there a way to get the workflow used here? Is it relevant only for targeted sequencing and TLA-prepared samples or can I re-use it also for "normal/standard" WGS experiment?