Hello! I obtained a list of unmapped reads IDs from my BAM file and I want to remap only the unmapped reads with other parameters. How can I extract the subset of unmapped reads from my original fastq file? Thank you in advance, Luke
I also wrote a program for this purpose, distributed with BBMap. Usage:
filterbyname.sh in=reads.fq out=filtered.fq names=names.txt include=t
The "include" flag will toggle between including or excluding the names in names.txt (which can, alternately, be another fastq or fasta file). This also supports paired input/output, and names being substrings or superstrings of read IDs.
First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:
read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq
check out grab for details.
It is simpler to go back to the original .bam, and just pull out the .bam entries that are unmapped. samtools view -f4 should do it. Then, you can use something like Picard's SamToFastq to go back to fastq format, if you need to. (Some software, like velvet, is fine with using .bam as input)
I've found a quick solution with cdbfasta and cdbyank tools. First you have to index your fastq with cdbfasta, then you can search for the IDs in fastq with cdbyank. For more info http://sourceforge.net/projects/cdbfasta/ Thank you, Luke