Question

Get sequences from a BAM file as strings spaced out as described by cigar

0

Entering edit mode

6.7 years ago

yesitsjess • 0

Hi all. This is a very over simplified example but hopefully someone will be able to help.

I have a file amp.fa (e.g. >amp\nGAACATAAGCC) which I've run bwa index and samtools faidx on.

I've then aligned all my fastq read files (${f}) to this index (${a}):

bwa mem ${a} ${f}_L001_R1_001.fastq ${f}_L001_R2_001.fastq > ${f}.sam
samtools view -bS ${f}.sam > ${f}.bam
samtools sort ${f}.bam > ${f}.sorted.bam
samtools index ${f}.sorted.bam
samtools mpileup -f ${a} ${f}.sorted.bam > ${f}.mpileup

and basically all I want to do now is, with respect to my reference (GAACATAAGCC) calculate the percentages of different reads mapping the the middle section (CATAA) when that central T is either a T or a C. There are other changes and I want to see if they co occur with the T->C change and what % do.

From visual inspection in IGV I can see when reads are still wildtype (T) then there's a good chance there's a insertion just before the CATAA section. When the reads are mutant (C) the C and A of CATAA are often, but not always, also mutated.

Ideally I'd like to read a string aligned with regards to the read's cigar into R and work with it in there. If I could convert the sequence to a matrix of ncol=maximum read length and nrow=number of reads, where position 6 always means that base aligned to GAACATAAGCC then I could definitely figure out the rest myself.

I've been looking at the GenomicAlignments package in R and hoped that stackStringsFromBam or sequenceLayer functions would help to do this, but I think I must've misunderstood the documentation. Does anyone know how to do this or have suggestions of how to do it in a different way? Perhaps MSA using BAM?

TLDR: I'm interested as to whether a particular SNP cooccurs or occurs independently of various other genomic changes (two further SNPs, two indels) for each read in my fastq files

alignment bam r genomicalignments cigar • 2.1k views

ADD COMMENT • link updated 6.7 years ago by Matt Shirley 10k • written 6.7 years ago by yesitsjess • 0

score 2 · Answer 1 · 2018-11-06

2

Entering edit mode

6.7 years ago

Matt Shirley 10k

You're looking for a gapped representation of the alignment. I'm not sure how to accomplish this in R, but I addressed this issue in the python simplesam package, where you can call the gapped method of a Sam object to get the sequence represented as it's aligned to the reference, including removing insertions, adding gap characters for deletions, and removing hard or soft clipped regions.

You might also try working with the "pileup" output from samtools. This will give you a reference-aligned, row and column representation of the base calls at any genomic position.

ADD COMMENT • link 6.7 years ago by Matt Shirley 10k

0

Entering edit mode

Thank you Matt. I'm not massively comfortable with python but I'll take a look at simplesam

Regarding pileup, that would be absolutely perfect if the read results column were ordered by unique read. In that case I could split the string into a matrix with rows=unique positions and columns=unique reads. However since the read results entries all have different lengths and there's no character to mean "not covered" I don't think this can be the case. But please let me know if I'm wrong :)

ADD REPLY • link 6.7 years ago by yesitsjess • 0