Extract Base Based On Position From Bam File
5
2
Entering edit mode
12.4 years ago
Empyrean ▴ 170

I have a bam file and i wanted to extract the following information.

For a read in a bamfile, can i extract the base based on the position? For example, if i say i need the base at 100th position and it should output whether its A/T/G/C. Can i get this using samtools or bamtools??

mpileup samtools • 27k views
ADD COMMENT
4
Entering edit mode

Here's a one-liner that will do it:

samtools view your.bam | awk '{print substr( $10, 100, 1)}'
ADD REPLY
0
Entering edit mode

thank you chris, your solution is perfect for what i have asked but when i saw the output, i realized that the position i need is not with respect to read, but it is based on reference. i have commented to Arun above in a bit detail.. can you suggest any solution for this?

ADD REPLY
2
Entering edit mode

I ran across a problem similar to yours, and when googling to find a solution I found this thread. Since Pierre's solution did not work for me I wrote a GATKWalker of my own to solve the issue.

You run it like this:

java -jar GenomeAnalysisTK.jar -T GetBasesAtPosition --intervalToGet ref:39-41 -R [reference path] -I [bam path] -o [path to output file]

And it will generate output that looks like this:

read1      TTT
read2      TAT
read3       TTG
read4     TCT

It's available in my fork of gatk at github. Clone it from there, and then build it and you should be able to run it:

git clone https://github.com/johandahlberg/gatk.git -b devel gatk
cd gatk
ant

Is it something like this you were looking for? I realize that this a old thread, but I still think that I would leave this answer for completeness sake.

ADD REPLY
0
Entering edit mode

NOTE: This has now been fixed, and the <> are no longer an issues, so this comment may be disregarded - @RamRS on 22-Aug-2020

You have to remove the "<>" around the git url, Biostars seems to add it automatically, and I don't know how to remove them.

ADD REPLY
0
Entering edit mode

Thank you for your reply. But i am not looking to get for a interval. you can read my comment from above.. where i am not just looking for bases from read. I am looking for bases after mapping to reference i.e it has to take gaps in to consideration and get as output.. I have some specific snp positions and like to get bases on reads for those positions.. hope i havent confused you..

ADD REPLY
0
Entering edit mode

I got it that you wanted the bases relative to the reference, and that is what this will give you. so if you specify --intervalToGet chr1:1-3, you would get the first three bases from chromosome 1. However, it won't be able to handle gaps. I don't think that it should be to much work to check if it is a gap on that position by looking at the cigar string, but I don't have the time for it at the moment. Feel free to fork the code and do it your self how ever. :) Hope you find a solution.

ADD REPLY
0
Entering edit mode

thank you johan, also can i give multiple locations? like instead of chr1:1-3, i wanted the bases for following positions on chr:1 - 1, 10, 15 , 20 , 25 , 30. how can i get this information?

ADD REPLY
0
Entering edit mode

It's not possible in the current implementation, I'm afraid. But I might fix that later on, since I think I might need it myself. But right now you have to run it once for each locus, which isn't very practical if you have a lot of loci.

ADD REPLY
0
Entering edit mode

I've been looking at this walker since I needed exactly what is described here; however I'm slightly worried about the correctness of the output.

Can you indicate if your walker handles orientation and indels correctly?

If I count the base frequencies from pileup or IGV and compare these to the output of the walker, the latter values can be way off.

ADD REPLY
0
Entering edit mode

I'm aware of the problem and have sent you a PM.

ADD REPLY
0
Entering edit mode

Hi Johan,

I have cloned the link to the command line but the syntax you've written doesn't work. Do I need to clone this first (GenomeAnalysisTK.jar) into the command line?

ADD REPLY
1
Entering edit mode

Did you managed to find a way to recover specific bases from your reads based on position from bam file ? I have the same problem ! Thx

ADD REPLY
0
Entering edit mode
samtools view my.bam chr1:100
ADD REPLY
0
Entering edit mode

thank you for the answer. But this doesnt help. because when i give this command as input, it again prints the bam file at that positions but not the bases. For example, i have few positions where i got the SNPS, lets say 5, 10 ,15 20 and not all reads will have this variation right !! so for read 1 at 5, 10,15,20 - ATGC , Read 2 : ATCG , Read3 : AACG, Read4: AACC .. like that.. i have some 15- 20 positions and i have to extract all the bases on all reads for those positions.

ADD REPLY
0
Entering edit mode

Its not very simple as it looks like. Using this command, i will get the bases on the reads just by giving that position, but i want it according to reference. please read my comments under my post above. that might give you better idea of what i am looking for and help me if you figure out a way to do it.. so in simple if i have an indel in a read for example from pos 5-10 (5bases) and i wanted to extract the 11th base.

Technically using the above command i will get 11th base on read but when it maps to reference, the 11th base on reference will be 11+5 i.e 16th base pair on the read as we have 5bp indel in between the read. So it should take gaps, mismatches, indels in to account for calculating that base.

Hope this is clear.. i have been searching for this and to my surprise no program is able to achieve this :(

ADD REPLY
0
Entering edit mode

did you find a solution ? I have the same problem !

ADD REPLY
0
Entering edit mode

Chris' solution is neat! However, could you explain why you'd want to extract bases at the same location from all the reads, given that they have different start and end mapped positions?

ADD REPLY
0
Entering edit mode

thank you but this not what i need. I guess i should have made my question clear..sorry about that.. So once the read maps to the reference, i wanted to give input a position based on the reference say 100th position on reference, i need all bases from the reads which are mapping at that particular position including the readname. if i use the solution below, its printing the 100th base on the reads which is not equal to what i need because if there are indels in reference or reads, then the position of bases changes completely. So is there anyway to get what i need? i tried with mpileup. Its printing out all the bases at that positions but not the readnames.

ADD REPLY
0
Entering edit mode

Can you explain why you need the read name and not just the list of bases via mpileup?

ADD REPLY
0
Entering edit mode

ok sure.. I am working on a polyploid organism which has two genomes in it. goal is to seperate them by identifying the markers. So when i identify the snps between them i could not find good heterozygous snps between them. So i wanted to get a haplotype for the read. for that, i use the snps information and build this. As i have known the snp positions, i would like to get bases on other reads as well . this is amplicon sequencing so the read length is around 380bp. this is one part to seperate two genomes in a genotype. Also i would like to seperate two genotypes by using this. I rename reads according to specific genotype and once i get the haplotype information for the reads, i can seperate two genotypes easily based on read names.

i tried several haplotype callers but they doesnt seem to solve my problem. So i thought of starting from base i.e bam file to get such information out.. hope i am clear with what i need. Can you suggest a better approach for this?

ADD REPLY
0
Entering edit mode

How to do this in R?

ADD REPLY
4
Entering edit mode
12.0 years ago

There is a much simpler solution than all the ones showed here, which uses GATK's UnifiedGenotyper to do the base calling in all the wanted sites with the aid of the EMIT_ALL_SITES capability. the wanted sites should be previously defined in a bed file with the following format:

chrXXX  PositionOfInterest-1  PositionOfInterest

the final command would be something like this:

java -jar GenomeAnalysisTK.jar \
  -T UnifiedGenotyper \
  -R human_hg19.fa \
  -D dbsnp_135.hg19.vcf \
  -I input.bam \
  -o out.vcf \
  --intervals positions.bed \
  --output_mode EMIT_ALL_SITES

Entering the dbsnp file is completely optional, but it will output the rs code if a particular position has it, which is something very useful for us. I've just tested this command very recently because we are about to start using a 27 SNPs Sequenom multplex as quality control and sample identification for all the samples to be run in our sequencer, and it takes only a couple of seconds to run on an entire 10-15GB BAM exome file.

ADD COMMENT
1
Entering edit mode
12.4 years ago

Ok here is an ugly (global variables...) C program that will output the information you need. You need to walk over the CIGAR string for each read. Please, not that I'm not sure about to handle the cigar operations involving a 'CLIP', feel free to correct my code.

Compilation:

gcc prog.c -I ./samtools-0.1.18 -L samtools-0.1.18 -lbam -lz

Execute:

$ ./a.out  toy.bam ref:7-10 
7   r001[1] T
8   r001[2] T
9   r001[3] A
9   r002[4] A
9   r003[6] A
10  r001[4] G
10  r002[5] G
10  r003[7] N
11  r001[5] A
11  r002[6] A
11  r003[8] N
12  r001[6] T
12  r002[7] T
12  r003[9] N
13  r001[7] A
13  r002[8] A
13  r003[10]    N
14  r001[8] A
14  r002[9] A
14  r003[11]    N
15  r001[13]    G
15  r002[14]    T
16  r001[14]    A
16  r002[15]    A
17  r001[15]    T
17  r002[16]    A
18  r001[16]    A
18  r002[17]    A
20  r001[17]    C
21  r001[18]    T
22  r001[19]    G

Source:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "bam.h"
#include "sam.h"

static bamFile in=0;
static bam_header_t* header=0;
static bam_index_t *idx=0;

// callback for bam_fetch()
static int fetch_func(const bam1_t *b, void *data)
    {
    bam_plbuf_t *buf = (bam_plbuf_t*)data;
    bam_plbuf_push(b, buf);
    return 0;
    }

static int  scan_all_genome_func(uint32_t tid, uint32_t pos, int depth, const bam_pileup1_t *pl, void *data)
        {    
        int i;   
        for( i=0;i<depth ;++i)
        {
        const bam_pileup1_t *p=&pl[i];
        bam1_t* aln=p->b;
        uint32_t *cigar = bam1_cigar(aln);
        //const uint8_t* seq=bam1_seq(aln);
        int read_seq_index=0;
        uint32_t genomic_pos=aln->core.pos;
        uint32_t  cigaridx;
        for (cigaridx = 0; cigaridx < aln->core.n_cigar; ++cigaridx)
            {
            int k;
            int op = cigar[cigaridx] & BAM_CIGAR_MASK;
            int n_op= cigar[cigaridx] >> BAM_CIGAR_SHIFT;
            for(k=0;k< n_op;++k)
            {
            switch(op)
                {
                case BAM_CMATCH:
                {
                if(genomic_pos==pos)
                    {
                    char base=bam_nt16_rev_table[bam1_seqi(bam1_seq(aln),read_seq_index)];
                    fprintf(stdout,"%d\t%s[%d]\t%c\n",
                        genomic_pos+1,
                        bam1_qname(aln),
                        read_seq_index+1,
                        base

                        );
                    }
                ++read_seq_index;
                ++genomic_pos;
                break;
                }
                //deletion from the reference
                case BAM_CDEL:
                {
                ++genomic_pos;
                break;
                }
                //insertion to the reference
                case BAM_CHARD_CLIP://?
                case BAM_CPAD://?
                case BAM_CSOFT_CLIP:
                case BAM_CINS:
                {
                ++read_seq_index;
                break;
                }
                default:
                {
                fprintf(stderr,"#cigar::op not handled: %d\n",op) ;
                break;
                }
                }
            }
            }
        }
        return 0;
        }

int main(int argc,char** argv)
    {
    int ref,beg,end;
    bam_plbuf_t* buf;
    in= bam_open(argv[1],"rb");
    if(in==NULL) exit(-1); 

    header= bam_header_read(in);
    if(header==NULL) exit(-1);
    bam_init_header_hash(header);
    idx = bam_index_load(argv[1]);
    if(idx==NULL) exit(-1);

    bam_parse_region(header, argv[2], &ref, &beg, &end); // parse the region
    if (ref < 0) exit(-1);
    buf= bam_plbuf_init( scan_all_genome_func,NULL);
    if(buf==NULL) exit(-1);
    bam_fetch(in, idx, ref, beg, end, buf, fetch_func);
    bam_plbuf_push(0, buf); // finalize pileup
    bam_plbuf_destroy(buf);
    bam_header_destroy(header);
    bam_index_destroy(idx);
    bam_close(in);
    return EXIT_SUCCESS;
    }
ADD COMMENT
2
Entering edit mode
ADD REPLY
0
Entering edit mode

ah, and my code doesn't print the gaps.

ADD REPLY
1
Entering edit mode
4.0 years ago
nhaus ▴ 430

I know this post is really old, but in case someone is still looking, there is a tool called bam-readcount which should do exactly what you want.

Simply run bam-readcount your_bam -l site_list where site_list contains the regions of interest and has the format chr:start-stop (1 based and tab seperated)

ADD COMMENT
0
Entering edit mode
4.2 years ago

As with almost every life problem, this is one that can [surprisingly] be handled by mpileup:

cat rsBED/rs2857845.bed 
chr4    3026385 3026386

bcftools mpileup --regions-file rsBED/rs2857845.bed \
  --annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
  input.bam \
  --fasta-ref refdata-GRCh38-2.1.0/fasta/genome.fa

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  CH01848
chr4    3026386 .   A   T,<*>   0   .   DP=20;ADF=6,6,0;ADR=2,4,0;AD=8,10,0;I16=6,2,6,4,295,11299,365,13851,480,28800,600,36000,183,4439,203,4917;QS=0.44697,0.55303,0;VDB=0.558487;SGB=-0.670168;RPB=0.906029;MQB=1;MQSB=1;BQB=0.991158;MQ0F=0 PL:DP:SP:ADF:ADR:AD 224,0,174,248,204,255:18:2:6,6,0:2,4,0:8,10,0

Let's pipe it into bcftools query to tidy it up nicely:

bcftools mpileup --regions-file rsBED/rs2857845.bed \
  --annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
  input.bam \
  --fasta-ref refdata-GRCh38-2.1.0/fasta/genome.fa |\
  bcftools query --format '%CHROM\t%POS\t%REF\t%ALT[\t%AD]\n'

chr4    3026386 A   T,<*>   8,10,0

We don't even need the [really] biased Buffalo New York reference genome - just add --no-reference to bcftools mpileup

bcftools mpileup --regions-file rsBED/rs2857845.bed \
  --annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
  input.bam \
  --no-reference |\
  bcftools query --format '%CHROM\t%POS\t%ALT[\t%AD]\n'

chr4    3026386 T,A,<*> 0,11,8,0

Be also weary of filter flags, such as:

-q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]
-Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]

Kevin

ADD COMMENT
0
Entering edit mode

Like always, thanks for the heads-up, Kevin. I've just come across your post, and I have it working. However, would you mind explaining the output a bit further? For troubleshooting, I set the minimum base quality to 0. For my formatting I use '%CHROM\t%POS\t%REF\t%DP\t%ALT[\t%AD]\n', and I end up with the following:

chr1 18945 A 1279 G,T,<*> 0,1278,1,0

How would I interpret this? Naively, it makes sense to me that I have 1278 reads supporting G, and 1 read supporting T. What, then, are the two flanking zeroes? (0,1278,1,0)

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6