I am working with Illumina paired end reads. I am trying to use Blast to find read pairs in which one pair aligns to the nucleus and the other pair aligns to the mitochondria. Any reference genome I could align the reads to would not have the gene transfers I'm looking for, so the unaligned fastq files are my best bet.
Basically what I've done so far is turn the fastq files into fasta files and blasted them to the mitochondrial genome, then I've taken the read names from the blast output and used sed to change the ".1's" to ".2's" so that I have the name of that read's pair.
So I have a text file with 100000 read names, and I need to get the sequences of these reads. Originally I used grep to find the names and print the next line (
grep -f -A1) however, because my files are so large, this took too much RAM and I had to cancel the task.
I found that using:
while read line; do grep -wF -A1 "$line" query.file; done < subject.file
gives me the output I need and doesn't crash our server, but it is also EXTREMELY slow.
Any advice on how to more efficiently perform this task? Even using grep -m 1 will take over a day to finish each sample.