Identifying primers in rna-seq data
1
0
Entering edit mode
7 weeks ago
mtsrn • 0

Hello,

I received a dataset of rna-seq data (paired end read illumina fastq files) from a member of a lab that collaborated with my PI, and was instructed to run modules to remove primers (cutadapt) and verify the quality of the data (fastqc) before further analysis. My problem is that I don't know if these fastq files contain primers, and I believe that if cutadapt does not have the proper primer sequences, it will remove those sequences. If the dataset does contain primers, ideally I would like to identify what primers were used for sequencing, so that I could then use cutadapt to properly remove those sequences.

A Fastqc report of the fastq files I received shows multiple overrepresented sequences. I blasted a few of these sequences, but they did not return any matches. Is this the right way to go about looking for these primers?

I am very new to this, but trying my best. Feel free to point me in the right direction, or suggest supplemental material on the subject, as I would like to learn how to use these tools more effectively.

cutadapt transcriptomics fastqc rna-seq primers • 665 views
ADD COMMENT
1
Entering edit mode

a member of a lab that collaborated with my PI

Programs like fastp (LINK) are able to automatically identify adapter sequences. If you have a collaboration with the lab that generated the data then write to them and ask about the protocol/kit that was used for the preparation of the libraries. That should remove any ambiguity.

As ATPoint noted below your data may have already been cleaned by the submitters (if all sequences are not the same length then almost certainly that is the case).

Even if there is some residual primer sequences are remaining they will be "soft-clipped" by aligner you will use in next step.

ADD REPLY
1
Entering edit mode
7 weeks ago
ATpoint 89k

The term is adapters, not primers. In typical RNA-Seq, you would simply run FASTQC and check if it identifies adapter contamination. Usually, the Illumina universal adapter is used in most protocols, which you would subsequently trim using cutadapt or other trimmers. The sequences for this one you can find online. Run FASTQC first, and then feel free to post the output. if it finds no adapters then there are none in the read. That happens if the insert size of the DNA fragment is longer than the read length, and is often the case, so finding no adapters in the read is often expected.

ADD COMMENT
0
Entering edit mode

Thank you for the prompt reply! I ran FASTQC on the fastq files and it appears that there were no adapter sequences detected in the files.

Fastqc adapter content

if it finds no adapters then there are none in the read. That happens if the insert size of the DNA fragment is longer than the read length, and is often the case, so finding no adapters in the read is often expected.

Let me know if this sounds right: you are saying that the DNA fragment (the part we are trying to sequence) is longer then the read length (150bp based on the FASTQC report) we will not see the adapter (which is attached to the flow cell and covalently bonded to the DNA fragment).

In this case, based on the given data, is it correct that I should not need to use cutadapt to trim adapters (as there appear to be none in the dataset)?

Here is an example of the CutAdapt script I was given to run, and it's output for one of the sequences. Based on the --overlap=3 flag, I think this is overly permissive and resulting in the actual sample being truncated. Would you agree with that assessment?

cutadapt output

ADD REPLY
1
Entering edit mode

Here, there is no adapter contamination, so no need to trim, indeed.

ADD REPLY

Login before adding your answer.

Traffic: 4479 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6