First of all, I am not a bioinformatician or computational person, I am a molecular biologist. I have some fastq.gz files and I generated a fastQC report that says that I have lots of overrepresented sequences (> 50%!). Probably due to the nature of the genome (there is no reference genome, and my guess is that the restriction enzyme used has favoured the sequencing of repetitive elements (??)). What I am interested is in knowing how many unique sequences/reads have certain number of copies (coverage), to see how many unique reads have a moderate number of copies (x20-50 copies). Any idea of how to get this information? Would it be possible through a command? Would it be possible to produce a txt file with two columns, one with number of unique sequences and second column with number of copies on those unique sequences? Maybe, in other words, what I am trying to get is the distribution of number of copies in unique sequences.
Thanks in advance,