Extract from a .bam file the reads that map to specific area in the reference genome
2
0
Entering edit mode
4 months ago
LDT • 0

Dear all,

I have a reference that looks like this

AAAAATTTTTTTNNNNNNGGGGGGCCCCCC

and on this reference reads have been mapped

ref:               AAAAATTTTTTT-**NNNNNN-**GGGGGGCCCCCC

read1:             AAAAATTTTTTT-**ATGCAA**-GGGGGGCCCCCC
....
read2:             AAAAATTTTTTT-**CCGTAA**-GGGGGGCCCCCC

and then I extracted the bam file. I want to extract the sequences that have been mapped the reference in the positions 13:18. For this I have used the following command

samtools view -b -h file.sorted.bam Reference_barcodes:13-18 > test.bam

Now I want from this bam to isolate in a fasta file the exact sequences that have mapped the 13-18 and my data to look like this

ref:  NNNNNN
read1:ATGCAA 
read2:CCGTAA
alignment sequences bam extract • 567 views
ADD COMMENT
0
Entering edit mode

Have you looked at samtools ampliconclip: http://www.htslib.org/doc/samtools-ampliconclip.html You will need a newer version of samtools.

ADD REPLY
0
Entering edit mode

samtools faidx can extract regions of a fasta file

samtools faidx chrom:start-end

http://www.htslib.org/doc/samtools-faidx.html

the extracted file will be also in FASTA so a bit of postprocessing may be needed to fix that (see the concept "linearize FASTA file")

ADD REPLY
1
Entering edit mode
10 weeks ago
LDT • 0

Finally only this gives the answer I want:

library(GenomicAlignments)
stack <- stackStringsFromBam("test_4_seqs_sorted.bam", param=GRanges("Reference_barcodes:54-78"))
ADD COMMENT
1
Entering edit mode
4 months ago

I wrote sam4weblogo see; Sequence Logo For Different Alleles Or Generated From Sam/Bam

$ samtools faidx src/test/resources/rotavirus_rf.fa "RF05:100-200" | grep -v '^>' | tr -d '\n' && echo " REF" && java -jar dist/sam4weblogo.jar --interval "RF05:100-200" -F tabular src/test/resources/S1.bam 
AACTAGGAGCAAATGATGAATGGAGGCCAGCACCAATGACAAAATATAAAGGATGGTGTTTAGATTGTTGTCAATATACAAATTTGACATATTGCAGAGGG REF
100       110       120       130       140       150       160       170       180       190         RF05:100-200
AACTAGGAGCAAATGATGCATGGAGGCCAGCAC-------------------------------------------------------------------- RF05_63_612_2:0:0_0:0:0_16/1 (99) S1
CACTAGGAGCAAATGATGACTGGAGGCCAGCACCA------------------------------------------------------------------ RF05_65_581_3:0:0_3:0:0_22/2 (163) S1
AACTAGGAGCAAATGATGAATGGAGGCCAGCACCAAT---------------------------------------------------------------- RF05_67_593_0:0:0_1:0:0_19/1 (99) S1
AACTAGGAGCAAATGATGAATGGAGGCCAGCACCAATGACAA----------------------------------------------------------- RF05_72_589_0:0:0_2:0:0_1c/1 (99) S1
AACTAGGAGCAAATGATGAATGGAGGCCAGCACGAATGACAACA--------------------------------------------------------- RF05_74_640_2:0:0_1:0:0_3e/1 (99) S1
AACTAGGAGCAAATGATGAATGGAGGCCAGCACCAATGACAAAAAA------------------------------------------------------- RF05_76_583_1:0:0_5:0:0_27/1 (99) S1
----AGGAGCAAATGATGAATGGAGGCCAGCACCAATGACAAAATATCAAGGATGGTGTTTAGATTGTTGTCAA--------------------------- RF05_104_625_1:0:0_0:0:0_24/2 (163) S1
---------------ATGAATGGAGGCCAGCACCAATGACAAAATATAAAGGATGGTGTTTAGATTGTTGTCAATATACAAATTT---------------- RF05_115_585_0:0:0_0:0:0_1/1 (99) S1
--------------------------------CCAATGACAACATATAAAGGATGGTGTTTAGATTGTTGTCAATATACAAATTTGACATATTGCAGAGGG RF05_132_611_1:0:0_2:0:0_3f/2 (163) S1
---------------------------------CCATGACAAAATAAAAAGGATGGTGTTTAGATTGTTGTCAATATACAAATTTGACATATTGCAGAGGG RF05_133_555_2:0:0_0:0:0_2e/1 (99) S1
-------------------------------------GACAAAATATAAAGGATGGTGTTTAGATTGTTGTCAAAATACAAATTTGACATATTGCAGAGGG RF05_137_611_1:0:0_1:0:0_4b/1 (99) S1
---------------------------------------CAAAATATAAAGGATGGTGTTTAGATTGTAGTCAATATACAAATTTGACATATTGCAGAGGG RF05_139_728_1:0:0_0:0:0_12/1 (99) S1
-------------------------------------------ATATAAAGGATGGTGTTTAGATTGTTGTCAATATACAAATTTGACATATTGCAGAGGT RF05_143_731_1:0:0_4:0:0_1e/1 (99) S1
-------------------------------------------------AGGATGGTGTTTAGATTGTTGTCAATATACAAATTTGACATAATGCAGCGGG RF05_149_654_2:0:0_2:0:0_48/1 (99) S1
----------------------------------------------------ATGGTGTTTAGATTGTTGTCAATATACAAATTTGACATATTGCAGAGGG RF05_152_584_0:0:0_0:0:0_2b/1 (99) S1
-----------------------------------------------------------TTAGATTGTTGTCAATATACAAATTAGACATATTGCAGAGGG RF05_159_603_1:0:0_0:0:0_3b/2 (163) S1
---------------------------------------------------------------ATTGTTGTCAATATACAAATTTGACATATTGCAGAGGG RF05_163_622_0:0:0_1:0:0_4d/1 (99) S1
----------------------------------------------------------------------TCAATATACAAATTTGACATATTGCAGAGGG RF05_170_620_0:0:0_1:0:0_45/1 (99) S1
---------------------------------------------------------------------------ATACAAATTTGACATATTGCAGAGGG RF05_175_728_1:0:0_0:0:0_17/1 (99) S1
-------------------------------------------------------------------------------AAATTTGACATATTGCAGAGGG RF05_179_662_3:0:0_0:0:0_13/2 (163) S1
ADD COMMENT
0
Entering edit mode

Dear Pierre,

We have tried this, and it does NOT work as we want. I guess you are very busy to have a look. Here is a link with only four sequences of our data if you have time https://1drv.ms/u/s!ArOzUuixE-mggfskZPX_Y0cpyQH4RA?e=I3vFzc. It's straightforward what we want to do. We want to extract the part of the sequences that map between 54-78 in the reference.

This is the result that we get from your program

enter image description here

and these are the sequences that we need:

TACGGTTATATTGACAGACCGAGGG
TAATTCGAAACCATAAGGCTCGCAA
TACGGTTATATTGACAGACCGAGGG
TAATTCGAAACCATAAGGCTCGCAA

I isolated them using R. It works for sequences but not for 20 million in R.

I guess you are very busy, so my apologies TACGGTTATATTGACAGACCGAGGG

ADD REPLY

Login before adding your answer.

Traffic: 1801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6