Question

Extract reads from BAM file with known variant

0

Entering edit mode

21 months ago

pablo ▴ 300

Hello,

I have a bam file from which I'd like to extract reads which contain one specific variant at a specific position. I used :

samtools mpileup -uf ref.fa my.bam -v -r chr7:151490948-151490948

From IGV, I can spot 1681 "A" nucleotide variant (cf. IGV image below). I would like to extract the read containing this variant. The upper command givec back :

#CHROM POS      ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  my.bam
chr7      151490948       .       T       A,C,<*> 0       .       DP=8074;I16=1395,5186,333,1160,1.67816e+06,4.2793e+08,380715,9.70823e+07,45921,2.31924e+06,9838,499526,164525,4.11312e+06,37325,933125;QS=0.82102,0.178825,0.000155298,0;VDB=1.02295e-43;SGB=-0.693147;RPB=0.78422;MQB=0.997427;MQSB=1.49326e-36;BQB=1;MQ0F=0.00606886  PL      0,99,255,255,255,255,255,255,255,255

I can see the A variants but also the C variants. I know the option --output-QNAME extract the read names. In my case, I want to extract only the "A" variant reads, not the "C" ones.

I tried :

samtools mpileup --output-QNAME  -f ref.fa my.bam  -r CM000669.2:151490948-151490948 |awk '{print $NF}' | sed 's/,/\n/g' | wc -l

I get 8157 matching reads. Do these reads correspond to the reference and the variant (A or C) alleles?

Any help to extract my 1681 variant "A" reads?

bam variant samtools • 4.1k views

ADD COMMENT • link 21 months ago by pablo ▴ 300

0

Entering edit mode

how did you find the variant ? show us the igv screenshot.

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

you may use python programming for this task. here are some simple steps: open the file, read line by line, add the "if" condition variant you need, store these variant data in a list and print the list. that's all. i hope this help...

ADD REPLY • link 21 months ago by Ernest Bonat ▴ 10

1

Entering edit mode

this doesn't explain why OP cannot find a read carrying the variant with samtools.

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

This user seems to be pushing python on multiple threads.

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

but there is any output, whereas

what does that mean ?

are you sure it's 'Chr19' and not 'chr19' or '19' ?

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

enter image description here

I expect a variant T/A (with 1681 reads matching the A variant) as you in the IGV window. I would like to extract those 1681 reads ID.

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

I am sure of the nomenclature. I add this an IGV window, for another locus.

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

for another locus

???

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

The IGV window I showed is for another locus which is Chr7:151490948-151490948) (not Chr9:151490946-151490946 as mentionned first)

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

You are not answering questions: A) show the locus that is relevant, b) confirm that "Chr7" rather than "chr7" is correct.

ADD REPLY • link 21 months ago by ATpoint 82k

0

Entering edit mode

You're right, it is always because I try to modify a bit my post to do not divulge on what I work. Anyway, I edited the post.

ADD REPLY • link 21 months ago by pablo ▴ 300

1

Entering edit mode

It just makes it harder to help. Do you really think people could threaten your project by seeing a random screenshot not even knowing who you are and who your work for? Whatever, your choice.

ADD REPLY • link 21 months ago by ATpoint 82k

0

Entering edit mode

Just right click the read in IGV and click "copy read details to clipboard" then just grep for the read name using something like samtools view your_file.bam | grep 'M00119:58:000000000-JV6N9:1:2116:17863:22678'

ADD REPLY • link 21 months ago by benformatics 3.9k

0

Entering edit mode

This works only for one read right? I need to do it for the 1681 reads..

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

start with one read... show us that this read is present in the samtools view output...

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

On IGV, as I showed with the screenshot, I have 1681 "A" variant reads. But when I scroll all the reads, I can't spot any "A" variant . Is that normal? So, it is impossible for me to detect them and get the corresponding "copy read details to clipboard".

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

But when I scroll all the reads, I can't spot any "A" variant . Is that normal?

see IGV downsampling reads params https://software.broadinstitute.org/software/igv/Preferences .

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for the doc. I tried to dealt with it and the "Preferences" parameters (I also unchecked the "Downsample reads") but I can't get the "A" variant. Other mutations are however visible as you can see on the screenshot....

enter image description here

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

Why don't you copy the sequence of the read surrounding the A like +/-10 bp and then grep for the sequence instead of the read name e.g. 'AGAAGATA' etc... make sure it's long enough to be unique

ADD REPLY • link 21 months ago by benformatics 3.9k

0

Entering edit mode

I already tried. When I grep +/- 10 bp around the variant, I only get 39 BAM sequences. Which is very much low than the 1681 expecting reads of the IGV screenshot.

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

I installed the last version of IGV and I am now able to spot the "A" variant on the reads.

enter image description here

About the "copy read details to clipboard" , I can grep the read name in my BAM file. It works for one :

samtools view my.bam | grep "m64071_220512_054244/103417836/ccs"

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

ok i can write you a script to identify these if that's all you need... are you able to use R?

ADD REPLY • link 21 months ago by benformatics 3.9k

0

Entering edit mode

Indeed, I am able to use R.

ADD REPLY • link 21 months ago by pablo ▴ 300

score 0 · Answer 1 · 2022-07-28

0

Entering edit mode

21 months ago

benformatics 3.9k

This should extract reads from an index'd BAM file at a location of interest and transfer them into a valid SAM file. Modify the variables in the script to work according to your specific variant.

library(GenomicAlignments)
## your bam file
mybam <- 'your_file.bam'
## the variant you want to extract reads for
##location of variant chromosome:position (make sure to check if your genome is 1 or chr1 style)
varposition <- '10:89720633'
## allele of variant
var <- 'T'
## convert to GRanges
var.gr <- GRanges(varposition)

## read your BAM file
aln <- readGAlignments(mybam,param=ScanBamParam(which=var.gr,what=c('qname','strand','seq')))
## find variants at location! WATCH OUT MAKE SURE YOUR VARIANT IS BASED ON THE plus strand
aln.seq <- stackStringsFromGAlignments(aln,region=var.gr)
## extract read names with variant
reads <- mcols(aln[aln.seq == var])$qname
## print them
print(reads)
## save to file - you can change the name of this file if you want it will be written to your current working dir
write(reads,'pattern_file.txt')

## now you can do this on the command line to make a subset sam file
## header copy
print(paste0('samtools view -H ',normalizePath(mybam),' > ',gsub('\\.bam$','_subset.sam',mybam)))
## extract relevant reads
print(paste0("samtools view ",normalizePath(mybam)," '",varposition,"' | grep -f pattern_file.txt - >> ",gsub("\\.bam$","_subset.sam",mybam)))

ADD COMMENT • link 21 months ago by benformatics 3.9k

0

Entering edit mode

Here is the output of the last two print statements for me:

samtools view -H /home/biostars/my.bam > my_subset.sam
samtools view /home/biostars/my.bam '10:89720633' | grep -f pattern_file.txt - >> my_subset.sam

ADD REPLY • link 21 months ago by benformatics 3.9k

0

Entering edit mode

Also the strand param I had originally added isn't really used but you could probably fix the function up to take the actual variant (in the mRNA) and flip the sequence/strand for the reads when needed instead of being forced to use the '+' strand variant definition.

ADD REPLY • link 21 months ago by benformatics 3.9k

0

Entering edit mode

Thanks a lot for your reply. But I got this error message :

> aln.seq <- stackStringsFromGAlignments(aln,region=var.gr)
   Error in .normarg_at2(at, x) :
   some ranges in 'at' are off-limits with respect to their corresponding
   sequence in 'x'

Could it be because I work on long pacbio reads? My BAM file is OK I guess.

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

what does this give you? width(mcols(aln)$seq)

ADD REPLY • link 21 months ago by benformatics 3.9k

0

Entering edit mode

This gives me :

width(mcols(aln)$seq)

 [1] 1689    0    0 1563 1613 1601 1602 1614 1614 1614 1614 1614 1745 1476
 [15] 1478 1476 1408 1407 1405 1405  927  926  927  976  622  612  758 2030
 [29] 2030 2021 2081 1967 1967 1163 1692 1135 1135 1369 1369 1369  918  921
 [43]  684  972  924 1248 1265 1253 1254 1251 1254 1248 1249    0    0    0
 [57]    0    0    0    0    0    0    0    0    0    0    0    0    0    0
 [71]    0    0    0    0    0  746  747  746  746  746 2406 2129 1927 1922
 [85] 1918 1923 1919 1925 1928 1922 1976 1941 2029 2070 2071 1915 1943 2061
 [99] 1944 2576 2049 1916 2041 1941 1918 1942 1947 1941 1823 2048 2092 1923
 [113] 2046 1823 2053 1946 2048 2104 2089 1827 1825 1736 1699 1698 2167 1983
 [127] 1701 1802 2162 1701 1977 1980 2164 1978 1980 1976 2167 1701 2170 1738
    ...
 [19979]  971 1234  728  733  804  703  506  825  757  474 1128  204  203  611
 [19993]  611  854    0    0    0    0  637  637  627    0    0    0    0    0
 [20007]    0    0  629    0    0    0    0    0  626  653    0  576  421    0
 [20021]    0  580  589  363    0    0  509  615    0  616    0  613  601

I remove the data where width(mcols(aln)$seq) gives back "0" . From my initial BAM file, I removed the alignments where the length was equal to "0" :

samtools view -hbq 1 my.bam  > filtered.my.bam

ADD REPLY • link 21 months ago by pablo ▴ 300

0

Entering edit mode

So did it work after that? You must have done aln <- aln[width(mcols(aln)$seq) > 0]

ADD REPLY • link 21 months ago by benformatics 3.9k

1

Entering edit mode

Both way works : either with aln <- aln[width(mcols(aln)$seq) > 0] or samtools view -hbq 1 my.bam > filtered.my.bam

ADD REPLY • link 21 months ago by pablo ▴ 300