Question

Count the numfer of reads supporting a pattern from a BAM file

0

Entering edit mode

2.9 years ago

User000 ▴ 690

Dear all,

I have a BAM file and I want to count how many reads have the following 4 patterns at chromosome region around 250bp:

AAAAAAAGAAAAA
AAAAAAAGAAAAAA
AAAAAAAAGAAAAAA
AAAAAAAAGAAAAA

I have downloaded ASCIIgenome software and tried something with samtools mpileup. But I don't know how to do what I am looking for.

samtools • 1.1k views

ADD COMMENT • link updated 2.8 years ago by jkbonfield ★ 1.2k • written 2.9 years ago by User000 ▴ 690

0

Entering edit mode

I would do it in this way:

1) extract your reads from this region Extract Reads From A Bam File That Fall Within A Given Region

2) transform BAM to FASTA

3) write a python script that mechanically counts these patterns from FASTA, should be 20 lines.

ADD REPLY • link 2.9 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

support the following pattern at region around 250:

i don't understand that sentence.

ADD REPLY • link 2.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I edited the question.

ADD REPLY • link 2.9 years ago by User000 ▴ 690

0

Entering edit mode

that's still not clear. What is the 'pattern' ? how is it different from running

grep -E '(AAAAAAAGAAAAA|AAAAAAAGAAAAAA|AAAAAAAAGAAAAAA|AAAAAAAAGAAAAA)'

ADD REPLY • link 2.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I am looking for A indels near 250 bp of the chromosome region. So I want to count how many reads support each of the 4 patterns in order to identify how many do have indesl and so on.

ADD REPLY • link 2.9 years ago by User000 ▴ 690

0

Entering edit mode

Would something like this work?

samtools view in.bam | grep -E 'AAAAAAAGAAAAA' | wc -l

ADD REPLY • link 2.9 years ago by User000 ▴ 690

score 2 · Accepted Answer · 2021-06-14

Just use grep. Your pattern is "AAAAAAAGAAAAA" with an optional A either side, so AAAAAAAGAAAAA alone matches the other 3 also.

Hence either:

samtools view in.bam | grep -c AAAAAAAGAAAAA

Or with a modern samtools:

samtools view -c -e 'seq=~"AAAAAAAGAAAAA"' in.bam

If you wish to explicitly add a region, then that can be done on the command line too, either adding one arg at a time (eg 1:250-250 which will be reads overlapping that coord) or with a -L regs.bed option.

Edit: the question is poorly worded, but it dawns on me maybe you're looking for AAAAAAAGAAAAA and not flanking A? If so that's regexp [^A]AAAAAAAGAAAAA[^A]. Read the egrep man page is probably the best starting point.