Extracting reads mapped to any of multiple references
2
2
Entering edit mode
7.8 years ago
novice ★ 1.1k

I can extract reads mapped to a single reference sequence by first aligning the reads using, say, BWA, and processing with SAMtools:

samtools view -b -f 0x2 alignment.sam | samtools fastq - -1 mapped_1.fastq -2 mapped_2.fastq

Is there a similar pipeline I can use to extract reads mapped to multiple references? That is, if a read was found in any of the given references, it should be included in the output.

Here's an idea I have:

  1. align the reads to each reference separately (producing multiple SAM files)
  2. extract the reads mapped to each reference using the above pipeline
  3. sort and uniq the names of the reads from all output files, storing the names in file.txt
  4. grep the names in file.txt from one of the SAM files and store in a new SAM file
  5. convert SAM to FASTQ

I'm guessing this is probably not the most efficient way to solve the problem. It also doesn't catch when a read might be mapped to one of the references without its pair. In this case, I'd like to extract both reads.

samtools alignment • 5.2k views
ADD COMMENT
3
Entering edit mode
7.8 years ago

The most efficient way would be to map to all the references at once. You can do this with any aligner if you just concatenate the references. Alternatively you could map to all at once using BBSplit, which can produce one fastq output file per reference, containing all the reads (or pairs) that best map to that reference. But if you simply want all reads that map to any reference in a single fastq file, along with their unmapped mates, you can do this:

1) Concatenate the references with cat.
2) Map the reads with this command:
bbmap.sh ref=references.fa in1=read1.fq.gz in2=read2.fq.gz outm=mapped.fq.gz outu=unmapped.fq.gz

If one or both reads in a pair map to anything, both will go to "outm"; if neither map, both will go to "outu". The output files will be interleaved, but if you use "outm1" and "outm2" (and etc.) you can get the output in dual files.

ADD COMMENT
0
Entering edit mode

Concatenating the references into one fasta file worked just fine. Not the first time I spend way more time thinking and typing than is necessary...

ADD REPLY
0
Entering edit mode
7.8 years ago
thackl ★ 3.0k

If you store your bwa output in bam, rather than sam, you an use samtools view -b -f 0x2 alignment.bam REGION, with REGION being a single or multiple reference sequence IDs

 samtools view -b -f 0x2 alignment.bam ref1 ref2 refx | samtools fastq ...
ADD COMMENT

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6