Question

Nucleotide per-position from bam file and filter on quality

0

Entering edit mode

7.8 years ago

VHahaut ★ 1.2k

Hello!

I would like to extract from a bam file (or a series of bam) the number of ATCG at each position while filtering for the one that have a low phred score.

I found several approaching solutions (among them pysamstats) but not one that would do the job.

Can someone have a program/script that would solve this issue?

Thanks in advance!

SNP • 3.0k views

ADD COMMENT • link updated 3.7 years ago by khalilard • 0 • written 7.8 years ago by VHahaut ★ 1.2k

0

Entering edit mode

3.7 years ago

khalilard • 0

Hi,

I know this post is old but I was looking for a solution to the same problem (preferably as a one-line bash code) and after a lot of head-scratching and googling, I came up with the following. I'm leaving it here for whomever wants it (or for my future self when I inevitably forget what I did):

samtools mpileup bamfile.bam -f reference.fa -r regions -q 20 -Q 20 | awk 'BEGIN{RS="\n";OFS="\t";ORS="\t"}{split("", count); a=split($5,chars,""); print $1,$2,$3,$4; for (u=1; u<=a; u++) count[chars[u]]++i; for (u in count) {printf "%s   ",u; printf "%s   ", count[u]}; print "\n"}'

The -q filters for base phred-scores and -Q filters for mapping quality if needed. The output table is not very elegant, but it will print the composition of each base pair position and their number according to the mpileup formatting. An example:

9   133730382   T   3365    ]  2    ^  2    ,  1815    G  2    .  1548    $  1
9   133730383   T   3363    ]  3    ^  3    ,  1813    .  1550    $  2

Two positions from an example BAM file (chromosome 9, positions 133730382 and 133730383) have a T with a depth of 3365 and 3363 respectively and have 1815 reads with the T in reverse and 1548 in forward for the first for the first position. Hope someone finds this helpful.

ADD COMMENT • link 3.7 years ago by khalilard • 0

score 2 · Accepted Answer · 2016-07-13

2

Entering edit mode

7.8 years ago

Pierre Lindenbaum 161k

I wrote a SAM2tsv: https://github.com/lindenb/jvarkit/wiki/SAM2Tsv, filter on quality, group by contig/pos

ADD COMMENT • link 7.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

A all-in-one package would have been better but yours did a perfectly good job. It'll just require some digging in the 16,443,247 lines that it produced^^

Thanks!

ADD REPLY • link 7.8 years ago by VHahaut ★ 1.2k

0

Entering edit mode

If someone is interested by the next step:

GOAL: For each position of a chromosome, find the ATCGN content while filtering for the low phred score:

# Python:
import time

# Phred values (Illumina)
qual=list("""!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI""")
score=range(0,41,1)
phred=dict(zip(qual, score))

# Threshold phred score once converted
threshold=20

# Chr or sequence length
length=1000000

# Create a list, long as length chr, containing ATCGN information for each position
# Take as input the sam2Tsv file from https://github.com/lindenb/jvarkit/wiki/SAM2Tsv

chr=[dict(zip(('A','T','C','G','N'), (0,0,0,0,0))) for x in range(1,length)]
input='/where/is/the/inputfile.tsv'
output='/where/to/output/xxxxxxx.txt'

# Report per position the the amount of ATCGN
start_time = time.time()

with open(input) as f:
        lines=f.readlines()
        for line in lines:
            if line.split("\t")[6].isdigit():
                if phred[line.split("\t")[5]] > threshold:
                    # If phred score is > than threshold, increase count for this base at this position
                    chr[int(line.split("\t")[6])][line.split("\t")[4]]=chr[int(line.split("\t")[6])][line.split("\t")[4]] + 1

print("--- %s seconds ---" % (time.time() - start_time))

with open(output, "w") as output:
    output.write('#ChrPos'+'\t'+'A'+'\t'+'T'+'\t'+'C'+'\t'+'G'+'\t'+'N'+'\n')
    for i in range(1,len(chr)):
        output.write(str(i)+'\t'+str(chr[i].values()).strip('[]').replace(', ','\t')+'\n')

ADD REPLY • link 7.8 years ago by VHahaut ★ 1.2k