Question: Create smaller bam file based on the list of sequences
0
gravatar for Biomonika (Noolean)
5.3 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

I have a list of sequences of interest and a huge bam file. I would like to create new bam file that would contain only reads mapping to one of those sequences as a reference. So far I was filtering SAM file with grep -f to achieve this, but I am looking for quicker and more elegant solution, something like seqtk subseq <in.fa> <name.list>

Thanks.

 

bam filter • 2.5k views
ADD COMMENTlink modified 5.3 years ago by dariober10k • written 5.3 years ago by Biomonika (Noolean)3.1k

You can blat those sequences against the reference genome and find out their chromosomal positions. Then you can use a list of those regions and samtools to extract sequences from the bam file. 

ADD REPLYlink written 5.3 years ago by Ashutosh Pandey11k
1
gravatar for matted
5.3 years ago by
matted7.2k
Boston, United States
matted7.2k wrote:

The methods that traverse the entire file are inefficient, especially if your sequences of interest are only a small fraction of the total.  It's better to use the bam index to only include the reads you want.  You can do this with samtools view by making a trivial bed file that consists of your chromosome names, 0, and a large number, in order to select all reads on that chromosome:

awk '{print $1"\t0\t1000000000"}' references.txt | samtools view -F 4 -b -L /dev/stdin input.bam > output.bam

 

 

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by matted7.2k
1

I like your solution, but it might not be the most efficient. I think -L option makes samtools search the file once per input region. See this benchmark:

cat /tmp/a.bed
LmjF.01
LmjF.05
LmjF.10
LmjF.20
LmjF.36

With -L option:

time awk '{print $1"\t0\t1000000000"}' /tmp/a.bed \
> | samtools view -L /dev/stdin $bam | wc -l
5015

real    0m1.114s
user    0m1.031s
sys    0m0.057s

Passing individual regions:

time samtools view $bam LmjF.01 LmjF.05 LmjF.10 LmjF.20 LmjF.36 | wc -l
5015

real    0m0.052s
user    0m0.026s
sys    0m0.018s
ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by dariober10k
4

Ah, looks like you're right.  I forgot about passing multiple regions to samtools view alone; that's better.

I guess to convert it to the same form I had before, your better solution would be:

cat references.txt | xargs samtools view -F 4 -b input.bam > output.bam
ADD REPLYlink written 5.3 years ago by matted7.2k

Excellent. This is exactly what I had in mind.

ADD REPLYlink written 5.3 years ago by Biomonika (Noolean)3.1k
1
gravatar for dariober
5.3 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Wouldn't this suffice assuming the bam file is sorted & indexed?

samtools view -b myaln.bam contig3 contig5 contig7 > subset.bam
ADD COMMENTlink written 5.3 years ago by dariober10k
0
gravatar for RafaelMP
5.3 years ago by
RafaelMP110
Brazil
RafaelMP110 wrote:

awk '$3 != "*"' file.sam > newfile.sam

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by RafaelMP110

I don't want all reads that are mapped. I want just those that map to contig3, contig5 and contig7 in case that's on my list. This is significantly smaller list than list of all reference sequences. Sorry for not describing this precisely.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Biomonika (Noolean)3.1k

Do you want to extract reads that map to a contig or is it some short sequence ? I am just curious because you can easily extract reads that map to a particular contig if your reference fasta file has a separate fasta entry for that contig.

ADD REPLYlink written 5.3 years ago by Ashutosh Pandey11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2098 users visited in the last hour