Question

Keeping only matching reads in fastq files

0

Entering edit mode

6.3 years ago

dthorbur ★ 3.1k

I have sequences from 2 separate extractions for the same individual. We are working on a highly polymorphic region, and only want to keep reads that are found in both fastq files (from separate Illumina lanes). I've already paired the sequences if that changes anything.

The ideal tool would match sequences from two separate fastq (or fasta files if that's the input), and output only the matching sequences. I thought BBMap might have something, but I couldn't find anything appropriate.

I've also looked at jMHC, even though I'm not using MHC data, but I'm struggling to even install that on the computing cluster.

Thanks.

fastq • 2.4k views

ADD COMMENT • link updated 6.3 years ago by GenoMax 154k • written 6.3 years ago by dthorbur ★ 3.1k

0

Entering edit mode

BBsplit ?

I might as well not fully understand your issue. Why don't you just map each of the fast files, get the matching reads and join those in a new fastq file?

ADD REPLY • link 6.3 years ago by lieven.sterck 15k

0

Entering edit mode

So if a read differs by one base from a read in the other fastq, you want to throw it away?

ADD REPLY • link 6.3 years ago by swbarnes2 15k

0

Entering edit mode

Yes, because we used 2 separate extractions for each individual that were sequenced in different lanes, we are trying to be careful to not incorporate erroneous alleles into our data set. Our estimations are that the genes are nearly as polymorphic as the MHC region, hence being so careful.

ADD REPLY • link 6.3 years ago by dthorbur ★ 3.1k

score 1 · Answer 1 · 2019-06-14

1

Entering edit mode

6.3 years ago

GenoMax 154k

I would suggest using clumpify.sh to identify sequences that are unique in both sets. Use the addcount= option to count the copies of reads. If you want exact sequence replicates only then be sure to adjust subs=0.

Then do a second clumpify.sh run on the two individual data files to leave a single copy of each sequence. At this point you would need to do some custom work (perhaps use filterbyname.sh to reextract those reads that get removed from the final set.

Take a look at clumpify.sh guide here.

ADD COMMENT • link 6.3 years ago by GenoMax 154k

0

Entering edit mode

I'll have a look. Thanks. I may have found a solution using uclust though. If you use the second fasta as the database, it will match all reads.

ADD REPLY • link 6.3 years ago by dthorbur ★ 3.1k

0

Entering edit mode

Does that keep only those reads that are common to both files? Did you post process the results to only keep those reads from both files that were a hit?

ADD REPLY • link 6.3 years ago by GenoMax 154k

0

Entering edit mode

Exactly, and it is as simple as taking the columns of header names and only keep those entries. I haven't done that just yet, but I'll write the script soon.

ADD REPLY • link 6.3 years ago by dthorbur ★ 3.1k