Entering edit mode
5.7 years ago
shashwat36
•
0
Hello, I want to find the location of random integration for a gene on the genome. I am working with Pichia Pastoris and have random integration of a transgene. I sequenced the whole genome on Illumina Miniseq and have 80x coverage of the genome. It sounds pretty straight forward but I have struggling. Here is what I have tried:
- Align paired end reads to wild type Pichia genome using bwa
- combine *.sai into a bam file
- sort and index the bam file
- generate consensus fasta sequence from the bam file using samtools pileup | bcftools | vcfutils.pl
- bwa index the resulting fasta
- align trans gene sequence against that index
When I do that, I end up with zero alignments. However, I am certain that the gene is there and has been confirmed by qPCR. Can someone please help?
Thanks
Never did that myself, but here are my thoughts (hope this is paired-end data):
samtools view -b -o overlap.bam in.bam chrTR
) where chrTR is the name that you gave your "new chromosome"This should probably give you an idea where your insertion site(s) is/are. How sure are you that you have a single integration event and not multiple ones?
Thanks for your solution, it was really helpful. I ran the analysis as per your suggestion and I have two questions:
The vector contained a landing pad so I know the expected site of integration and I did see the soft clipped reads aligning to that loci. However, I also saw the reads aligning to another location on the chromosome. Does that mean that I have an additional random integration event or is that just noise that should be ignored? I ask because we have other stains where we try random integration in genomes and I would like to be able to tell between false alignments and actual integration sites in those genomes.
I took the soft clipped reads from the two ends of the linearized insertion vector and aligned them to the wild type genome. Theoretically reads from both ends should align to roughly same location on the genome. However, in one alignment locus, I see only the reads from one end of the vector, does that mean that there is a partial integration there? And if so, is there any way to tell of the gene of interest is present there from this data alone? Without having to run a par?
Thanks