Split a bam file per read
1
0
Entering edit mode
7.8 years ago
s.vdrzeeuw ▴ 20

So I have a full genome PACbio BAM file and this has a average coverage of 3-6.

Now I want to split the bam file per read. Since readgroup is not working and sometimes still gives me multiple reads inside my BAM.

The reason why I want this is because i want to run MPILEUP with only 1 read per bam, so that I can identify insertions/deletions per read on a given location. Anyone knows a program to do this? Or another approach to do it?

Thanks!

BAM manipulation • 2.3k views
1
Entering edit mode
7.8 years ago

I wrote a tool named sam2tsv

It'll show you each base of each read for each position

$java -jar dist/sam2tsv.jar -A \ -r samtools-0.1.18/examples/toy.fa samtools-0.1.18/examples/toy.sam r001 163 ref 0 T . 7 T M r001 163 ref 1 T . 8 T M r001 163 ref 2 A . 9 A M r001 163 ref 3 G . 10 G M r001 163 ref 4 A . 11 A M r001 163 ref 5 T . 12 T M r001 163 ref 6 A . 13 A M r001 163 ref 7 A . 14 A M r001 163 ref 8 A . . . I r001 163 ref 9 G . . . I r001 163 ref 10 A . . . I r001 163 ref 11 G . . . I r001 163 ref 12 G . 15 G M r001 163 ref 13 A . 16 A M r001 163 ref 14 T . 17 T M r001 163 ref 15 A . 18 A M  UPDATE: and generating one mpileup per read... samtools view -F 4 ex1.bam | while read L; do (samtools view -H ex1.bam && echo "$L" ) | samtools view -bS - | samtools mpileup -f ex1.fa - ; done

0
Entering edit mode

@Pierre Thanks for your reply, can I also extract one mpileup per read with a filtering region II know try it like this:

while read LINE
do
chr=$( printf "$LINE" | cut -f1 )
start=$( printf "$LINE" | cut -f2 )
stop=$( printf "$LINE" | cut -f3 )
region=$chr:$start-$stop echo "Regions:"$region
echo "start:" $start echo "stop:"$stop
echo "line:" $LINE mkdir -p results/${region/-/_}

samtools view -F 4 $inBAM$region  | while read L; do (samtools view -H $inBAM && echo "$L" ) | samtools view -bS - | samtools mpileup -f $refGenome -r$region - > results/${region/-/_}/${FILE%.bam}Mpileup.txt; done
done < \$BED


unfortunately this fails. I need to have the mpileup files inside a different region directory so that I can distinguish them. Later on and i will not be left over with 1000 files if I have 200 regions in my bed file. Hope this is also possible thanks for you help already!

this is the error I got:

[bam_header_read] EOF marker is absent. The input is probably truncated.
[samopen] SAM header is present: 93 sequences.
[mpileup] fail to load index for -

0
Entering edit mode

run samtools index your.bam before running the loop.

0
Entering edit mode

I already have my bam file indexed. So this should not be the case or do you mean after splitting the bam per read.

0
Entering edit mode

There is no need to specify the region in mpileup. The reads are already extracted with view

0
Entering edit mode

Hmmm, maybe I should rephrase the question, I need to identify repeat regions for each read. Also I need to know there exact lengths and the lengths after counting of insertions and deletions. Therefore I want to have a separate bam file for each read on each repeat region. So that my follow up script can progress those files into a summary table like this:

Chr ::: startpos ::: norm_length ::: calc_length ::: longest repeat ::: shortest repeat ::: ins bases ::: deleted bases


Maybe this clarifies my needs more, still many thanks for investing your time ;)