Question

Create smaller bam file based on the list of sequences

0

Entering edit mode

9.7 years ago

Biomonika (Noolean) 3.2k

I have a list of sequences of interest and a huge bam file. I would like to create new bam file that would contain only reads mapping to one of those sequences as a reference. So far I was filtering SAM file with grep -f to achieve this, but I am looking for quicker and more elegant solution, something like seqtk subseq <in.fa> <name.list>

Thanks

filter bam • 5.4k views

ADD COMMENT • link updated 2.1 years ago by O.rka ▴ 710 • written 9.7 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

You can blat those sequences against the reference genome and find out their chromosomal positions. Then you can use a list of those regions and samtools to extract sequences from the bam file.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Did you ever find a suitable answer? I wanna do the same thing.

ADD REPLY • link 2.1 years ago by O.rka ▴ 710

Ram · Answer 1 · 2014-08-20

1

Entering edit mode

9.7 years ago

matted 7.8k

The methods that traverse the entire file are inefficient, especially if your sequences of interest are only a small fraction of the total. It's better to use the bam index to only include the reads you want. You can do this with samtools view by making a trivial bed file that consists of your chromosome names, 0, and a large number, in order to select all reads on that chromosome:

awk '{print $1"\t0\t1000000000"}' references.txt | samtools view -F 4 -b -L /dev/stdin input.bam > output.bam

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by matted 7.8k

1

Entering edit mode

I like your solution, but it might not be the most efficient. I think -L option makes samtools search the file once per input region. See this benchmark:

cat /tmp/a.bed LmjF.01 LmjF.05 LmjF.10 LmjF.20 LmjF.36

With -L option:

time awk '{print $1"\t0\t1000000000"}' /tmp/a.bed \
> | samtools view -L /dev/stdin $bam | wc -l
5015

real    0m1.114s
user    0m1.031s
sys    0m0.057s

Passing individual regions:

time samtools view $bam LmjF.01 LmjF.05 LmjF.10 LmjF.20 LmjF.36 | wc -l
5015

real    0m0.052s
user    0m0.026s
sys    0m0.018s

ADD REPLY • link 9.7 years ago by dariober 14k

4

Entering edit mode

Ah, looks like you're right. I forgot about passing multiple regions to samtools view alone; that's better.

I guess to convert it to the same form I had before, your better solution would be:

cat references.txt | xargs samtools view -F 4 -b input.bam > output.bam

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by matted 7.8k

0

Entering edit mode

Excellent. This is exactly what I had in mind.

ADD REPLY • link 9.7 years ago by Biomonika (Noolean) 3.2k

score 1 · Answer 2 · 2014-08-21

1

Entering edit mode

9.7 years ago

dariober 14k

Wouldn't this suffice assuming the bam file is sorted & indexed?

samtools view -b myaln.bam contig3 contig5 contig7 > subset.bam

ADD COMMENT • link 9.7 years ago by dariober 14k

score 0 · Answer 3 · 2014-08-20

0

Entering edit mode

9.7 years ago

RafaelMP ▴ 120

awk '$3 != "*"' file.sam > newfile.sam

ADD COMMENT • link 9.7 years ago by RafaelMP ▴ 120

0

Entering edit mode

I don't want all reads that are mapped. I want just those that map to contig3, contig5 and contig7 in case that's on my list. This is significantly smaller list than list of all reference sequences. Sorry for not describing this precisely.

ADD REPLY • link 9.7 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Do you want to extract reads that map to a contig or is it some short sequence ? I am just curious because you can easily extract reads that map to a particular contig if your reference fasta file has a separate fasta entry for that contig.

ADD REPLY • link 9.7 years ago by Ashutosh Pandey 12k