Question: Extracting Reads Containing A Specific Variant From A Bam File
1
gravatar for Nick Stoler
3.9 years ago by
Nick Stoler60
Penn State
Nick Stoler60 wrote:

Perhaps my Google-fu is failing me, because this doesn't seem like an uncommon need, but I can't find any answer here.

Say I have a SAM/BAM file, and a variant at a known location. I'd like to extract all the alignments from the BAM file which contain that variant. That is, I'd like the ALT, not the REF allele. I do not want all the reads at that location. Is there an existing tool or workflow out there that can accomplish this?

I would honestly be surprised if there wasn't, since I would imagine it's a common situation to be interested in a certain variant and investigate the reads supporting it. Specifically, I'd like to make sure there's no bias in where along the read it falls, or the strand of the read. I should note that my coverage is very high, so it's not feasible to extract all the reads covering the site and manually sort them out.

I've already started a script that does this by parsing CIGAR strings, but I wanted to check that it doesn't already exist before going further down this road.

cigar format sam bam • 4.8k views
ADD COMMENTlink modified 10 days ago by Biostar ♦♦ 20 • written 3.9 years ago by Nick Stoler60

Maybe something like this:

samtools view foo.bam | grep -E 'pattern'

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Phil S.630
1

@PhilS: your solution won't help parsing the cigar string.

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum99k

Sry I thought you need the whole information of the read given by a bam file. If you need the cigar string only you can do it like:

samtools view foo.bam1:1660404-1660404 | awk '{print $6"\t"$10'} | grep 'pattern' | cut -f 1

this will give you the cigar strings of all reads containing 'pattern'

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Phil S.630
1

no, you have to walk over the cigar string to see if the read contains or not the variation. You cannot use a regex for this.

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum99k
1
gravatar for Pierre Lindenbaum
3.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum99k wrote:

My previous answer might help: position of mismatches per read from a sam/bam file : you'll get the content of each read (as XML) for each base.

ADD COMMENTlink written 3.9 years ago by Pierre Lindenbaum99k

Thanks! So it looks like that can get the information I need. It's a custom script like I've been writing (except certainly much nicer and faster), instead of an existing tool, so I'll take that as an answer to whether I'm reinventing the wheel.

ADD REPLYlink written 3.9 years ago by Nick Stoler60
1
gravatar for Rm
3.9 years ago by
Rm7.5k
Danville, PA
Rm7.5k wrote:

See if this oneliner helps: For example: at given position say "1:1660404" get reads with Variant from the bam file:

samtools view -b INPUT.bam 1:1660404-1660404 | samtools fillmd -e - REFERENCE.fasta | grep -v "^@"| awk -v pos="1660404" 'BEGIN {OFS = FS = "\t" } ; {n=split($10,a,"") ; if(a[(pos-$4)+1] != "=" ) print pos,(pos-$4)+1, a[(pos-$4)+1], $1, $4, $10 }'

Potential output:

Position Distance_from_read_start Variant READ_NAME Read_Start READ

    1660404 31      G       HI-SW1050R:121:D2C53BCXX:8:2211:2278:38508      1660374 ==============================G===================================================
    1660404 30      G       HI-SW1050R:121:D2C53BCXX:8:2308:3759:48204      1660375 =============================G====================================================
    1660404 25      G       HI-SW1050R:121:D2C53BCXX:2:2208:13607:73456     1660380 =======C================G=========================================================
    1660404 25      G       HI-SW1050R:121:D2C53BCXX:6:2201:3216:84210      1660380 =======C================G=========================================================
    1660404 25      G       HI-SW1050R:121:D2C53BCXX:8:1316:1264:91116      1660380 =======C================G======N==N===================N=N=========================
    1660404 16      G       HI-SW1050R:121:D2C53BCXX:2:1301:15715:89711     1660389 ===============G==================================================================

Note: READ base changed to '=' if identical to reference base: it doesn't show indels and soft-clipped bases

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Rm7.5k

That's good to know about fillmd's -e option, but unfortunately it seems it doesn't show indels. Also, soft-clipped bases appear just like SNV's.

ADD REPLYlink written 3.9 years ago by Nick Stoler60

@Nick : thats True, I will edit it

ADD REPLYlink written 3.9 years ago by Rm7.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1443 users visited in the last hour