Question: How produce a table with number of copies per unique sequence? (possible?)
gravatar for angelaparody
24 days ago by
angelaparody30 wrote:


First of all, I am not a bioinformatician or computational person, I am a molecular biologist. I have some fastq.gz files and I generated a fastQC report that says that I have lots of overrepresented sequences (> 50%!). Probably due to the nature of the genome (there is no reference genome, and my guess is that the restriction enzyme used has favoured the sequencing of repetitive elements (??)). What I am interested is in knowing how many unique sequences/reads have certain number of copies (coverage), to see how many unique reads have a moderate number of copies (x20-50 copies). Any idea of how to get this information? Would it be possible through a command? Would it be possible to produce a txt file with two columns, one with number of unique sequences and second column with number of copies on those unique sequences? Maybe, in other words, what I am trying to get is the distribution of number of copies in unique sequences.

Thanks in advance,


fastq.gz count • 107 views
ADD COMMENTlink written 24 days ago by angelaparody30

Please take a look at these blog posts from author's of FastQC.

Duplication ( )
Positional bias ( )

You could de-duplicate this data if you want to count copies of reads with identical sequences ( see: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) You would use the program like this in=file.fq out=deduped.fq dedupe addcount

Reads that absorbed duplicates will have "copies=X" appended to the end of fastq header to indicate how many reads they represent (including themselves, so the minimum you would see is 2).

ADD REPLYlink modified 24 days ago • written 24 days ago by genomax68k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1328 users visited in the last hour