Question

How to find integration loci of trans gene from whole genome sequence

0

Entering edit mode

5.7 years ago

shashwat36 • 0

Hello, I want to find the location of random integration for a gene on the genome. I am working with Pichia Pastoris and have random integration of a transgene. I sequenced the whole genome on Illumina Miniseq and have 80x coverage of the genome. It sounds pretty straight forward but I have struggling. Here is what I have tried:

Align paired end reads to wild type Pichia genome using bwa
combine *.sai into a bam file
sort and index the bam file
generate consensus fasta sequence from the bam file using samtools pileup | bcftools | vcfutils.pl
bwa index the resulting fasta
align trans gene sequence against that index

When I do that, I end up with zero alignments. However, I am certain that the gene is there and has been confirmed by qPCR. Can someone please help?

Thanks

next-gen genome alignment • 1.3k views

ADD COMMENT • link 5.7 years ago by shashwat36 • 0

1

Entering edit mode

Never did that myself, but here are my thoughts (hope this is paired-end data):

Add the transgene sequence as a new chromosome to the reference genome
Index with BWA, then align against it with BWA mem
Extract all reads that overlap the transgene sequence (samtools view -b -o overlap.bam in.bam chrTR) where chrTR is the name that you gave your "new chromosome"
Extract all soft-clipped reads
Align these reads against the original reference genome without the extra chromosome

This should probably give you an idea where your insertion site(s) is/are. How sure are you that you have a single integration event and not multiple ones?

ADD REPLY • link 5.7 years ago by ATpoint 81k

0

Entering edit mode

Thanks for your solution, it was really helpful. I ran the analysis as per your suggestion and I have two questions:

The vector contained a landing pad so I know the expected site of integration and I did see the soft clipped reads aligning to that loci. However, I also saw the reads aligning to another location on the chromosome. Does that mean that I have an additional random integration event or is that just noise that should be ignored? I ask because we have other stains where we try random integration in genomes and I would like to be able to tell between false alignments and actual integration sites in those genomes.
I took the soft clipped reads from the two ends of the linearized insertion vector and aligned them to the wild type genome. Theoretically reads from both ends should align to roughly same location on the genome. However, in one alignment locus, I see only the reads from one end of the vector, does that mean that there is a partial integration there? And if so, is there any way to tell of the gene of interest is present there from this data alone? Without having to run a par?

Thanks

ADD REPLY • link 5.7 years ago by shashwat36 • 0