How can I extract part of the DNA sequences that align to a specific region in the reference from a .bam file?
2
0
Entering edit mode
3 months ago
LDT • 0

Dear all,

I have a bam file from which I want to extract the exact part of DNA sequences that map within the following region 54-78 to my reference .

I have tried so far unsuccessfully bedtools, samtools, custom scripts and I was wondering if there is a quick way to do it with RSamtools or any other tool in R.

samtools command:

samtools faidx Reference_barcodes.fasta "Reference_barcodes:54-78" | grep -v '^>' | tr -d '\n' && echo " REF" && java -Xmx200g -jar /data/biosoftware/sam4weblogo/jvarkit/dist/sam4weblogo.jar  --interval "Reference_barcodes:54-78" -F tabular test_4_seqs_sorted.bam

bedmap --echo --fraction-map 1 <(bam2bed < test_4_seqs_sorted.bam) <(echo -e "Reference_barcodes\t54\t78") > answer.bed


Here is a link to my data which is tiny, only 4 sequences, https://1drv.ms/u/s!ArOzUuixE-mggfskZPX_Y0cpyQH4RA?e=hAa0W3 and this is the desired result that I want to take out

TACGGTTATATTGACAGACCGAGGG
TAATTCGAAACCATAAGGCTCGCAA
TACGGTTATATTGACAGACCGAGGG
TAATTCGAAACCATAAGGCTCGCAA


I am at a desperate point so any serious help and guidance is more than appreciated and I am happy to tip for it if possibly though "Buy me a coffee" or patreon format

bedtools alignment samtools bam sequences • 548 views
1
Entering edit mode

What you are trying to do is not trivial, and I don't think you'll find a simple command to do that (but maybe I'm wrong). If I had to do that, I would write my own script using some library for parsing bam files. You'll need to consider the alignment start position and the CIGAR string, so you can extract matches within the required range. Sorry this is not really an answer, but more of a general direction. Good luck.

1
Entering edit mode
10 weeks ago
LDT • 0

After so long time and many battles I found a solution that works for me Thanks to the Bioconductor people and especially @hpages

library(GenomicAlignments)
stack <- stackStringsFromBam("test_4_seqs_sorted.bam", param=GRanges("Reference_barcodes:54-78"))

0
Entering edit mode
3 months ago
Vincent Laufer ★ 1.8k

Ms., or Mr., LDT - please see: https://pangenome.github.io/

VCF files separate genetic variants from the surrounding sequence in which they are found.

However, these are not the only way to represent and view genomes. Genome graphs ( <---google this) do not do this. It is not a direct answer to your question, but I am guessing that you will love these as much as I do. If, after looking them over, you decide they don't help, let me know and I will addend this answer.

0
Entering edit mode

0
Entering edit mode

OK. ill write one.

what genome build ? 1-based or 0-based?

0
Entering edit mode

Dear Vincent, sorry for my late reply! Forgive me for my ignorance what does genome build mean? 0 or 1

I did a google but I did not get it. Here is an example of my bam file https://1drv.ms/u/s!ArOzUuixE-mggfskZPX_Y0cpyQH4RA?e=hAa0W3 which comes after mapping amplicons to a reference. The organism I am working with is bacteria