Hello,
I received a dataset of rna-seq data (paired end read illumina fastq files) from a member of a lab that collaborated with my PI, and was instructed to run modules to remove primers (cutadapt) and verify the quality of the data (fastqc) before further analysis. My problem is that I don't know if these fastq files contain primers, and I believe that if cutadapt does not have the proper primer sequences, it will remove those sequences. If the dataset does contain primers, ideally I would like to identify what primers were used for sequencing, so that I could then use cutadapt to properly remove those sequences.
A Fastqc report of the fastq files I received shows multiple overrepresented sequences. I blasted a few of these sequences, but they did not return any matches. Is this the right way to go about looking for these primers?
I am very new to this, but trying my best. Feel free to point me in the right direction, or suggest supplemental material on the subject, as I would like to learn how to use these tools more effectively.
Programs like
fastp
(LINK) are able to automatically identify adapter sequences. If you have a collaboration with the lab that generated the data then write to them and ask about the protocol/kit that was used for the preparation of the libraries. That should remove any ambiguity.As ATPoint noted below your data may have already been cleaned by the submitters (if all sequences are not the same length then almost certainly that is the case).
Even if there is some residual primer sequences are remaining they will be "soft-clipped" by aligner you will use in next step.