Question: How produce a table with number of copies per unique sequence? (possible?)
gravatar for angelaparody
12 months ago by
angelaparody50 wrote:


First of all, I am not a bioinformatician or computational person, I am a molecular biologist. I have some fastq.gz files and I generated a fastQC report that says that I have lots of overrepresented sequences (> 50%!). Probably due to the nature of the genome (there is no reference genome, and my guess is that the restriction enzyme used has favoured the sequencing of repetitive elements (??)). What I am interested is in knowing how many unique sequences/reads have certain number of copies (coverage), to see how many unique reads have a moderate number of copies (x20-50 copies). Any idea of how to get this information? Would it be possible through a command? Would it be possible to produce a txt file with two columns, one with number of unique sequences and second column with number of copies on those unique sequences? Maybe, in other words, what I am trying to get is the distribution of number of copies in unique sequences.

Thanks in advance,


fastq.gz count • 232 views
ADD COMMENTlink written 12 months ago by angelaparody50

Please take a look at these blog posts from author's of FastQC.

Duplication ( )
Positional bias ( )

You could de-duplicate this data if you want to count copies of reads with identical sequences ( see: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) You would use the program like this in=file.fq out=deduped.fq dedupe addcount

Reads that absorbed duplicates will have "copies=X" appended to the end of fastq header to indicate how many reads they represent (including themselves, so the minimum you would see is 2).

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax83k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1181 users visited in the last hour