I am looking to compare the overrepresented sequences for a series of fastqc reports. All of my 16 fastqc reports have failed for overrepresented sequences and I am looking for a way to extract this information from the .txt file between >>Overrepresented sequences
and >>END_MODULE
and visualise and compare the data to see if it is contaminated. Is there any way to do this using python/R?
I would like to know the top overrepresented sequences between all files and see if there is any link and if so to blast them.
Usually
fastqc
produces also an .html file where the section "overrepresented sequences" reports the top overrepresented sequences with their frequency of occurrence in your file. So you can simply copy them and paste into BLAST.unless you post an example file, it is not clear what you want to extract. Please post example input and expected output or may be you could use multiqc to collate multiple fastqc reports and extract the information from multiqc output.
I have sequences from two different datasets. I am looking to compare the overrepresented sequences in both to see if there has been any cross contamination.