I am fairly new to bioinformatics and currently have issues with a task that seems like it should be fairly simple. The situation is the following:
I have a library of DNA molecules with a variable region surrounded by fixed sequences. What I want is to a) get statistics regarding the distribution of nucleotides in the variable sequence (e.g. calculate a sequence logo) b) possibly find families of sequences in the variable region via clustering.
The library was sequenced with Nanopore sequencing. Aligning the reads was straightforward and gave me a sam file. What I am stuck with is how to extract sequence information from the aligned reads for the specific part of the reference I am interested in.
What I would like to have is the sequences of all individual reads in the region of interest. To illustrate this:
If the reference and the reads are (region of interest is bold)
CAGGCTGATG**ACTCAAAC[del]A**TAGCAAG[del]AG CAGUCTTATG**AATATATTTT**TAGCAAGCAG CCGGCTTATG**GCCCGCG[ins]TTA**TAGCAA[ins]GCAG
then I would like to extract
ACTCAAAC[del]A AATATATTTT GCCCGCG[ins]TTA
I can use "samtools view" to get the reads that cover (part of) the region of interest. This is not helpful, as I only care about the subsequence of the reads that is aligned the region of interest. "samtools mpileup" comes fairly close, but I can only get statistics about individual nucleotides (or do I misunderstand how this works?), not entire read subsequences.
Is there a tool that does what I want?