Question: Extracting reads mapped to any of multiple references
gravatar for novice
2.8 years ago by
United States
novice910 wrote:

I can extract reads mapped to a single reference sequence by first aligning the reads using, say, BWA, and processing with SAMtools:

samtools view -b -f 0x2 alignment.sam | samtools fastq - -1 mapped_1.fastq -2 mapped_2.fastq

Is there a similar pipeline I can use to extract reads mapped to multiple references? That is, if a read was found in any of the given references, it should be included in the output.

Here's an idea I have:

  1. align the reads to each reference separately (producing multiple SAM files)
  2. extract the reads mapped to each reference using the above pipeline
  3. sort and uniq the names of the reads from all output files, storing the names in file.txt
  4. grep the names in file.txt from one of the SAM files and store in a new SAM file
  5. convert SAM to FASTQ

I'm guessing this is probably not the most efficient way to solve the problem. It also doesn't catch when a read might be mapped to one of the references without its pair. In this case, I'd like to extract both reads.

samtools alignment • 1.7k views
ADD COMMENTlink modified 2.8 years ago by thackl2.6k • written 2.8 years ago by novice910
gravatar for Brian Bushnell
2.8 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The most efficient way would be to map to all the references at once. You can do this with any aligner if you just concatenate the references. Alternatively you could map to all at once using BBSplit, which can produce one fastq output file per reference, containing all the reads (or pairs) that best map to that reference. But if you simply want all reads that map to any reference in a single fastq file, along with their unmapped mates, you can do this:

1) Concatenate the references with cat.
2) Map the reads with this command: ref=references.fa in1=read1.fq.gz in2=read2.fq.gz outm=mapped.fq.gz outu=unmapped.fq.gz

If one or both reads in a pair map to anything, both will go to "outm"; if neither map, both will go to "outu". The output files will be interleaved, but if you use "outm1" and "outm2" (and etc.) you can get the output in dual files.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Brian Bushnell16k

Concatenating the references into one fasta file worked just fine. Not the first time I spend way more time thinking and typing than is necessary...

ADD REPLYlink written 2.8 years ago by novice910
gravatar for thackl
2.8 years ago by
thackl2.6k wrote:

If you store your bwa output in bam, rather than sam, you an use samtools view -b -f 0x2 alignment.bam REGION, with REGION being a single or multiple reference sequence IDs

 samtools view -b -f 0x2 alignment.bam ref1 ref2 refx | samtools fastq ...
ADD COMMENTlink written 2.8 years ago by thackl2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1097 users visited in the last hour