Question: tools for unique reads generation from fastqs
0
gravatar for genya35
28 days ago by
genya3510
genya3510 wrote:

Hello,

I would like to generate a fasta file containing unique reads and count of each read occurrence, from two Illumina fastq files (forward and reverse). The next step is to blast the unique reads and group them together based on the results of the blast search. Could someone please suggest a tool that could accomplish this?

Thanks

next-gen • 163 views
ADD COMMENTlink modified 25 days ago • written 28 days ago by genya3510
1

Are you sure you know what are you asking? That sounds like you want to extract unique reads and then count them. Please explain what do you want to do and if you have specific questions.

ADD REPLYlink modified 28 days ago • written 28 days ago by JC7.6k
0
gravatar for genomax
28 days ago by
genomax64k
United States
genomax64k wrote:

You need clumpify.sh from BBMap suite to remove/count duplicate reads. See this: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

Once you have your set of unique fastq reads they can be easily converted to fasta format by using reformat.sh from BBMap suite. I am reasonably certain that the read header containing count numbers should be retained in following conversion.

reformat.sh in=your.fq.gz out=your.fa
ADD COMMENTlink written 28 days ago by genomax64k

@genomax At what point do you recommend combing the two fastqs into one? Thanks

ADD REPLYlink written 25 days ago by genya3510

If these are pair-end reads then you should process them together with in1= and in2= directives and capture results in out1= and out2=. Only those reads where R1/R2 are identical would be considered duplicates. Remember to use addcount=t subs=0. Depending on size of your data files clumpify.sh can need a significant amount of memory.

ADD REPLYlink modified 25 days ago • written 25 days ago by genomax64k

My goal is come up with a list of unique reads with counts for the sample. I will use igblast in the next step to assign V-J usage and later group them and count them. At what point should I combine the unique reads from the two files? thanks

ADD REPLYlink written 25 days ago by genya3510

If you use clumpify.sh as intended, it will keep only one best copy of the duplicate reads and then add a count number to the header to show how many there were. So you do not need to do any combining.

ADD REPLYlink written 25 days ago by genomax64k

Is there an easy way to sort the fasta output from the most common to the least common read? I ran fastp at default to post process the fastqs before I've used clumpy. Do you recommend any additional post processing? Thanks

ADD REPLYlink written 25 days ago by genya3510

You should use clumpify.sh on original un-processed data. Trimming your data may result in loss of some valuable information about duplication.

ADD REPLYlink modified 25 days ago • written 25 days ago by genomax64k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 894 users visited in the last hour