Faster extraction of reads mapped to a region
1
0
Entering edit mode
7.3 years ago
novice ★ 1.1k

So I think I've just read every single post about this on the internet.

I simply want to extract reads mapped to a certain region. This is the pipeline I'm currently using:

samtools view sorted.bam "chr3:1000-5000" | cut -f1 > ids
LC_ALL=C grep -wFf ids < original.sam > subset.sam  # this is the bottleneck
echo -e "$(samtools view -H original.sam)\n$(cat subset.sam)" > subset.sam
samtools view -b subset.sam | samtools fastq -1 reads1.fq -2 reads2.fq -

This is not too bad for a single run, but I actually have to do this for 1000+ regions. Is there possibly a more efficient way to extract the reads, or is this as fast as it gets?

Edit: added updating-the-header step
Edit: Help

samtools alignment • 2.0k views
ADD COMMENT
0
Entering edit mode

Could you clarify what the desired output is? Do you want fastq files, read IDs, a subset bam or something else?

ADD REPLY
0
Entering edit mode

The fastq files (paired end).

ADD REPLY
0
Entering edit mode
7.3 years ago
venu 7.1k

I would do something like this

cat <(samtools view -H merged.mdup.bam) <(samtools view merged.mdup.bam "chr3:20000-200000") | samtools view -bS - > subset.bam

You can loop the above command on every region giving region as name to the output file instead of subset.bam.

BTW, are you trying to generate fastq from your subset.bam ? Because I did not find fastq argument in my version (samtools-1.2), instead I found bam2fq.

ADD COMMENT
0
Entering edit mode

samtools fastq is in samtools v.1.3.1

ADD REPLY
0
Entering edit mode

fastq is the new bam2fq.

The problem with this solution is that, if I pipe it into fastq, I get discordant fastq files. I need to make sure to extract the pairs of reads as well.

ADD REPLY

Login before adding your answer.

Traffic: 3383 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6