Question: Keeping only matching reads in fastq files
0
gravatar for miles.thorburn
5 weeks ago by
miles.thorburn80 wrote:

I have sequences from 2 separate extractions for the same individual. We are working on a highly polymorphic region, and only want to keep reads that are found in both fastq files (from separate Illumina lanes). I've already paired the sequences if that changes anything.

The ideal tool would match sequences from two separate fastq (or fasta files if that's the input), and output only the matching sequences. I thought BBMap might have something, but I couldn't find anything appropriate.

I've also looked at jMHC, even though I'm not using MHC data, but I'm struggling to even install that on the computing cluster.

Thanks.

fastq • 183 views
ADD COMMENTlink modified 5 weeks ago by genomax69k • written 5 weeks ago by miles.thorburn80

BBsplit ?

I might as well not fully understand your issue. Why don't you just map each of the fast files, get the matching reads and join those in a new fastq file?

ADD REPLYlink written 5 weeks ago by lieven.sterck5.5k

So if a read differs by one base from a read in the other fastq, you want to throw it away?

ADD REPLYlink written 5 weeks ago by swbarnes26.0k

Yes, because we used 2 separate extractions for each individual that were sequenced in different lanes, we are trying to be careful to not incorporate erroneous alleles into our data set. Our estimations are that the genes are nearly as polymorphic as the MHC region, hence being so careful.

ADD REPLYlink written 5 weeks ago by miles.thorburn80
1
gravatar for genomax
5 weeks ago by
genomax69k
United States
genomax69k wrote:

I would suggest using clumpify.sh to identify sequences that are unique in both sets. Use the addcount= option to count the copies of reads. If you want exact sequence replicates only then be sure to adjust subs=0.

Then do a second clumpify.sh run on the two individual data files to leave a single copy of each sequence. At this point you would need to do some custom work (perhaps use filterbyname.sh to reextract those reads that get removed from the final set.

Take a look at clumpify.sh guide here.

ADD COMMENTlink written 5 weeks ago by genomax69k

I'll have a look. Thanks. I may have found a solution using uclust though. If you use the second fasta as the database, it will match all reads.

ADD REPLYlink written 5 weeks ago by miles.thorburn80

Does that keep only those reads that are common to both files? Did you post process the results to only keep those reads from both files that were a hit?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax69k

Exactly, and it is as simple as taking the columns of header names and only keep those entries. I haven't done that just yet, but I'll write the script soon.

ADD REPLYlink written 5 weeks ago by miles.thorburn80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 743 users visited in the last hour