Question

How to find the insertion sequence in .bam file?

0

Entering edit mode

9.2 years ago

kimpole1017 • 0

I come up to aligning the fastq file based on the reference and as a result, I got the aligned bam file. For the next step, I am going to find an insertion sequence in the specific region of the bam file. Firstly, I checked the site with PCR, and there is an insertion sequence. By the way, when I checked the bam file with bamview, there are no insertion sequences in the region. What is wrong with the steps? I think that the default setting of bwa threw the insertion sequence away.

Steps:

bwa index ref.fa;
bwa aln ref.fa read1.fq > r1.sai;
bwa aln ref.fa read2.fq > r2.sai;
bwa sampe ref.fa r1.sai r2.sai read1.fa read2.fq | samtools view -bSho out.bam
bamview and checked the bam compare to ref.fa

bam bwa • 5.0k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.2 years ago by kimpole1017 • 0

1

Entering edit mode

Are you talking about a small (few bp) or large insertion (>50bp)? What is inserted?

ADD REPLY • link 9.2 years ago by WouterDeCoster 48k

0

Entering edit mode

The sizes will be larger than 1000bp, it is insertion sequences or so to say transposons. By the way, I found a tool, called "pindel" and dealing with it. Thank you for your interest and I am always opened to your helpful information.

ADD REPLY • link 9.2 years ago by kimpole1017 • 0

score 1 · Answer 1 · 2016-05-10

if you read about NGS you will notice that there's always an adjective next to reads and indel, which is short. the reason is that NGS reads are short (a few to several tens depending on the technology chosen), and therefore the indels you can detect with them should also be short.

if you want to find an insertion (which is a sequence not contained in the reference genome) by sequencing and mapping you need to ensure that your sequencing reads cover both the insertion and an anchor point in the reference genome, or else the reads won't be able to map at all. the reads covering the insertion site only will contain a sequence that's not in the reference, therefore they won't map. even if the anchor sequence is substantial, if the insertion is too long then the mapping process can still have trouble dealing with too many mismatches, therefore the reads could still not map. so depending on the size of the insertion and the size of your sequencing reads it may be definitely possible that you may not find your insertion at all in your aligned bam file: you may have sequenced it (unaligned reads), but it may be difficult to map.