Question: Extract Base Based On Position From Bam File
gravatar for Empyrean
7.7 years ago by
Empyrean160 wrote:

I have a bam file and i wanted to extract the following information.

For a read in a bamfile, can i extract the base based on the position? For example, if i say i need the base at 100th position and it should output whether its A/T/G/C. Can i get this using samtools or bamtools??

samtools mpileup • 12k views
ADD COMMENTlink modified 3.4 years ago by stephanieP10 • written 7.7 years ago by Empyrean160

Did you managed to find a way to recover specific bases from your reads based on position from bam file ? I have the same problem ! Thx

ADD REPLYlink written 3.4 years ago by stephanieP10

Chris' solution is neat! However, could you explain why you'd want to extract bases at the same location from all the reads, given that they have different start and end mapped positions?

ADD REPLYlink written 7.7 years ago by Arun2.3k

thank you but this not what i need. I guess i should have made my question clear..sorry about that.. So once the read maps to the reference, i wanted to give input a position based on the reference say 100th position on reference, i need all bases from the reads which are mapping at that particular position including the readname. if i use the solution below, its printing the 100th base on the reads which is not equal to what i need because if there are indels in reference or reads, then the position of bases changes completely. So is there anyway to get what i need? i tried with mpileup. Its printing out all the bases at that positions but not the readnames.

ADD REPLYlink written 7.7 years ago by Empyrean160

Can you explain why you need the read name and not just the list of bases via mpileup?

ADD REPLYlink written 7.7 years ago by Chris Miller21k

ok sure.. I am working on a polyploid organism which has two genomes in it. goal is to seperate them by identifying the markers. So when i identify the snps between them i could not find good heterozygous snps between them. So i wanted to get a haplotype for the read. for that, i use the snps information and build this. As i have known the snp positions, i would like to get bases on other reads as well . this is amplicon sequencing so the read length is around 380bp. this is one part to seperate two genomes in a genotype. Also i would like to seperate two genotypes by using this. I rename reads according to specific genotype and once i get the haplotype information for the reads, i can seperate two genotypes easily based on read names.

i tried several haplotype callers but they doesnt seem to solve my problem. So i thought of starting from base i.e bam file to get such information out.. hope i am clear with what i need. Can you suggest a better approach for this?

ADD REPLYlink written 7.7 years ago by Empyrean160
gravatar for Chris Miller
7.7 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

Here's a one-liner that will do it:

samtools view your.bam | awk '{print substr( $10, 100, 1)}'
ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Chris Miller21k

thank you chris, your solution is perfect for what i have asked but when i saw the output, i realized that the position i need is not with respect to read, but it is based on reference. i have commented to Arun above in a bit detail.. can you suggest any solution for this?

ADD REPLYlink written 7.7 years ago by Empyrean160
gravatar for Johan
7.5 years ago by
Johan870 wrote:

I ran across a problem similar to yours, and when googling to find a solution I found this thread. Since Pierre's solution did not work for me I wrote a GATKWalker of my own to solve the issue.

You run it like this:

java -jar GenomeAnalysisTK.jar -T GetBasesAtPosition --intervalToGet ref:39-41 -R [reference path] -I [bam path] -o [path to output file]

And it will generate output that looks like this:

read1      TTT
read2      TAT
read3       TTG
read4     TCT

It's available in my fork of gatk at github. Clone it from there, and then build it and you should be able to run it:

git clone <> -b devel gatk
cd gatk

Is it something like this you were looking for? I realize that this a old thread, but I still think that I would leave this answer for completeness sake.

ADD COMMENTlink modified 7.5 years ago by Istvan Albert ♦♦ 82k • written 7.5 years ago by Johan870

You have to remove the "<>" around the git url, Biostars seems to add it automatically, and I don't know how to remove them.

ADD REPLYlink written 7.5 years ago by Johan870

Thank you for your reply. But i am not looking to get for a interval. you can read my comment from above.. where i am not just looking for bases from read. I am looking for bases after mapping to reference i.e it has to take gaps in to consideration and get as output.. I have some specific snp positions and like to get bases on reads for those positions.. hope i havent confused you..

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Empyrean160

I got it that you wanted the bases relative to the reference, and that is what this will give you. so if you specify --intervalToGet chr1:1-3, you would get the first three bases from chromosome 1. However, it won't be able to handle gaps. I don't think that it should be to much work to check if it is a gap on that position by looking at the cigar string, but I don't have the time for it at the moment. Feel free to fork the code and do it your self how ever. :) Hope you find a solution.

ADD REPLYlink written 7.5 years ago by Johan870

thank you johan, also can i give multiple locations? like instead of chr1:1-3, i wanted the bases for following positions on chr:1 - 1, 10, 15 , 20 , 25 , 30. how can i get this information?

ADD REPLYlink written 7.5 years ago by Empyrean160

It's not possible in the current implementation, I'm afraid. But I might fix that later on, since I think I might need it myself. But right now you have to run it once for each locus, which isn't very practical if you have a lot of loci.

ADD REPLYlink written 7.5 years ago by Johan870

I've been looking at this walker since I needed exactly what is described here; however I'm slightly worried about the correctness of the output.

Can you indicate if your walker handles orientation and indels correctly?

If I count the base frequencies from pileup or IGV and compare these to the output of the walker, the latter values can be way off.

ADD REPLYlink written 6.7 years ago by i.j.nijman0

I'm aware of the problem and have sent you a PM.

ADD REPLYlink written 6.7 years ago by Johan870
gravatar for Jorge Amigo
7.3 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

there is a much simpler solution than all the ones showed here, which uses GATK's UnifiedGenotiper to do the base calling in all the wanted sites with the aid of the EMIT_ALL_SITES capability. the wanted sites should be previously defined in a bed file with the following format:

chrXXX  PositionOfInterest-1  PositionOfInterest

the final command would be something like this:

java -jar GenomeAnalysisTK.jar \
  -T UnifiedGenotyper \
  -R human_hg19.fa \
  -D dbsnp_135.hg19.vcf \
  -I input.bam \
  -o out.vcf \
  --intervals positions.bed \
  --output_mode EMIT_ALL_SITES

entering the dbsnp file is completely optional, but it will output the rs code if a particular position has it, which is something very useful for us. I've just tested this command very recently because we are about to start using a 27 SNPs Sequenom multplex as quality control and sample identification for all the samples to be run in our sequencer, and it takes only a couple of seconds to run on an entire 10-15GB BAM exome file.

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Jorge Amigo11k
gravatar for Pierre Lindenbaum
7.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

Ok here is an ugly (global variables...) C program that owill utput the information you need. You need to walk over the CIGAR string for each read. Please, not that I'm not sure about to handle the cigar operations involving a 'CLIP', feel free to correct my code. Compilation:

gcc prog.c -I ./samtools-0.1.18 -L samtools-0.1.18 -lbam -lz


$ ./a.out  toy.bam ref:7-10 
7   r001[1] T
8   r001[2] T
9   r001[3] A
9   r002[4] A
9   r003[6] A
10  r001[4] G
10  r002[5] G
10  r003[7] N
11  r001[5] A
11  r002[6] A
11  r003[8] N
12  r001[6] T
12  r002[7] T
12  r003[9] N
13  r001[7] A
13  r002[8] A
13  r003[10]    N
14  r001[8] A
14  r002[9] A
14  r003[11]    N
15  r001[13]    G
15  r002[14]    T
16  r001[14]    A
16  r002[15]    A
17  r001[15]    T
17  r002[16]    A
18  r001[16]    A
18  r002[17]    A
20  r001[17]    C
21  r001[18]    T
22  r001[19]    G

Source: #include <stdio.h> #include <stdlib.h> #include <string.h> #include "bam.h" #include "sam.h"

static bamFile in=0;
static bam_header_t* header=0;
static bam_index_t *idx=0;

// callback for bam_fetch()
static int fetch_func(const bam1_t *b, void *data)
    bam_plbuf_t *buf = (bam_plbuf_t*)data;
    bam_plbuf_push(b, buf);
    return 0;

static int  scan_all_genome_func(uint32_t tid, uint32_t pos, int depth, const bam_pileup1_t *pl, void *data)
        int i;   
        for( i=0;i<depth ;++i)
        const bam_pileup1_t *p=&pl[i];
        bam1_t* aln=p->b;
        uint32_t *cigar = bam1_cigar(aln);
        //const uint8_t* seq=bam1_seq(aln);
        int read_seq_index=0;
        uint32_t genomic_pos=aln->core.pos;
        uint32_t  cigaridx;
        for (cigaridx = 0; cigaridx < aln->core.n_cigar; ++cigaridx)
            int k;
            int op = cigar[cigaridx] & BAM_CIGAR_MASK;
            int n_op= cigar[cigaridx] >> BAM_CIGAR_SHIFT;
            for(k=0;k< n_op;++k)
                case BAM_CMATCH:
                    char base=bam_nt16_rev_table[bam1_seqi(bam1_seq(aln),read_seq_index)];

                //deletion from the reference
                case BAM_CDEL:
                //insertion to the reference
                case BAM_CHARD_CLIP://?
                case BAM_CPAD://?
                case BAM_CSOFT_CLIP:
                case BAM_CINS:
                fprintf(stderr,"#cigar::op not handled: %d\n",op) ;
        return 0;

int main(int argc,char** argv)
    int ref,beg,end;
    bam_plbuf_t* buf;
    in= bam_open(argv[1],"rb");
    if(in==NULL) exit(-1); 

    header= bam_header_read(in);
    if(header==NULL) exit(-1);
    idx = bam_index_load(argv[1]);
    if(idx==NULL) exit(-1);

    bam_parse_region(header, argv[2], &ref,  &beg, &end); // parse the region
    if (ref < 0) exit(-1);
    buf= bam_plbuf_init( scan_all_genome_func,NULL);
    if(buf==NULL) exit(-1);
    bam_fetch(in, idx, ref, beg, end, buf, fetch_func);
    bam_plbuf_push(0, buf); // finalize pileup
    return EXIT_SUCCESS;
ADD COMMENTlink written 7.7 years ago by Pierre Lindenbaum126k

5 years later:

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum126k

ah, and my code doesn't print the gaps.

ADD REPLYlink written 7.7 years ago by Pierre Lindenbaum126k
gravatar for Zev.Kronenberg
7.7 years ago by
United States
Zev.Kronenberg11k wrote:

samtools view my.bam chr1:100

ADD COMMENTlink written 7.7 years ago by Zev.Kronenberg11k

thank you for the answer. But this doesnt help. because when i give this command as input, it again prints the bam file at that positions but not the bases. For example, i have few positions where i got the SNPS, lets say 5, 10 ,15 20 and not all reads will have this variation right !! so for read 1 at 5, 10,15,20 - ATGC , Read 2 : ATCG , Read3 : AACG, Read4: AACC .. like that.. i have some 15- 20 positions and i have to extract all the bases on all reads for those positions.

ADD REPLYlink written 7.7 years ago by Empyrean160

Its not very simple as it looks like. Using this command, i will get the bases on the reads just by giving that position, but i want it according to reference. please read my comments under my post above. that might give you better idea of what i am looking for and help me if you figure out a way to do it.. so in simple if i have an indel in a read for example from pos 5-10 (5bases) and i wanted to extract the 11th base.

Technically using the above command i will get 11th base on read but when it maps to reference, the 11th base on reference will be 11+5 i.e 16th base pair on the read as we have 5bp indel in between the read. So it should take gaps, mismatches, indels in to account for calculating that base.

Hope this is clear.. i have been searching for this and to my surprise no program is able to achieve this :(

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Empyrean160

did you find a solution ? I have the same problem !

ADD REPLYlink written 3.4 years ago by stephanieP10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2151 users visited in the last hour