Keeping only matching reads in fastq files
1
0
Entering edit mode
4.8 years ago
dthorbur ★ 1.9k

I have sequences from 2 separate extractions for the same individual. We are working on a highly polymorphic region, and only want to keep reads that are found in both fastq files (from separate Illumina lanes). I've already paired the sequences if that changes anything.

The ideal tool would match sequences from two separate fastq (or fasta files if that's the input), and output only the matching sequences. I thought BBMap might have something, but I couldn't find anything appropriate.

I've also looked at jMHC, even though I'm not using MHC data, but I'm struggling to even install that on the computing cluster.

Thanks.

fastq • 1.7k views
ADD COMMENT
0
Entering edit mode

BBsplit ?

I might as well not fully understand your issue. Why don't you just map each of the fast files, get the matching reads and join those in a new fastq file?

ADD REPLY
0
Entering edit mode

So if a read differs by one base from a read in the other fastq, you want to throw it away?

ADD REPLY
0
Entering edit mode

Yes, because we used 2 separate extractions for each individual that were sequenced in different lanes, we are trying to be careful to not incorporate erroneous alleles into our data set. Our estimations are that the genes are nearly as polymorphic as the MHC region, hence being so careful.

ADD REPLY
1
Entering edit mode
4.8 years ago
GenoMax 141k

I would suggest using clumpify.sh to identify sequences that are unique in both sets. Use the addcount= option to count the copies of reads. If you want exact sequence replicates only then be sure to adjust subs=0.

Then do a second clumpify.sh run on the two individual data files to leave a single copy of each sequence. At this point you would need to do some custom work (perhaps use filterbyname.sh to reextract those reads that get removed from the final set.

Take a look at clumpify.sh guide here.

ADD COMMENT
0
Entering edit mode

I'll have a look. Thanks. I may have found a solution using uclust though. If you use the second fasta as the database, it will match all reads.

ADD REPLY
0
Entering edit mode

Does that keep only those reads that are common to both files? Did you post process the results to only keep those reads from both files that were a hit?

ADD REPLY
0
Entering edit mode

Exactly, and it is as simple as taking the columns of header names and only keep those entries. I haven't done that just yet, but I'll write the script soon.

ADD REPLY

Login before adding your answer.

Traffic: 2670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6