I have sequences from 2 separate extractions for the same individual. We are working on a highly polymorphic region, and only want to keep reads that are found in both fastq files (from separate Illumina lanes). I've already paired the sequences if that changes anything.
The ideal tool would match sequences from two separate fastq (or fasta files if that's the input), and output only the matching sequences. I thought BBMap might have something, but I couldn't find anything appropriate.
I've also looked at jMHC, even though I'm not using MHC data, but I'm struggling to even install that on the computing cluster.
Thanks.
BBsplit ?
I might as well not fully understand your issue. Why don't you just map each of the fast files, get the matching reads and join those in a new fastq file?
So if a read differs by one base from a read in the other fastq, you want to throw it away?
Yes, because we used 2 separate extractions for each individual that were sequenced in different lanes, we are trying to be careful to not incorporate erroneous alleles into our data set. Our estimations are that the genes are nearly as polymorphic as the MHC region, hence being so careful.