RNA-seq data: Trim away oligo with known sequence
0
1
Entering edit mode
6.5 years ago
arvchi ▴ 60

I have RNA-seq data for 24 samples that were prepared with smart-seq2 library prep. I have noticed in fastQC that I have many overrepresented reads that are identical, or closely resemble, parts of my oligo primer that I used in reverse transcription (oligo poly-T sequence).

Is there any program for aligning these reads with the oligo sequence, and trim them away if they align well with the oligo sequence? Have googled a bit and asked around but no one has heard of this, which surprises me.

Thankful for any advice.

RNA-Seq • 2.5k views
ADD COMMENT
0
Entering edit mode

If I understood your question correctly you could use Trimmomatic, to which you can add specific adapters to trim against.

ADD REPLY
0
Entering edit mode

It's not an adapter though, it's a long oligonucleotide. I have used TrimGalore to trim adapters, but will it work on this additional sequence too? This oligo sequence is longer than the reads, and there could be mismatches, so the reads are not identical to the oligo sequence. It needs some sort of aligning algorithm and trimming based on probability.

ADD REPLY
0
Entering edit mode

I suggest that you not get distracted by the over-represented sequences in FastQC. If you have scanned/trimmed the adapters (both Illumina and your kit specific ones) then go on to the actual alignment and analysis of the data. IF there is a downstream problem noted (e.g. with alignments etc) then backtrack and look at this some more.

ADD REPLY
0
Entering edit mode

There has been downstream problems (relatively poor alignment, but more worrisome there are a lot of counts assigned as "no feature", which is odd because this is mRNA).

I am not sure if this is related to the oligo being sequenced. But anyway, the oligo is overrepresented and surely doesn't do anything good, so it seems better to trim it away. Moreover, seeing the percentage of trimmed away reads might provide insight too.

ADD REPLY
0
Entering edit mode

If things are not aligning then trimming that oligo may not make a significant difference. You should take a sample of the reads that do not align and then do blast with their sequence at NCBI to see if you have a problem of contamination or some other basic issue. If you have more counts assigned to no feature (have you examined alignments in IGV?) and if the aligned reads mapping there are uniformly scattered across the genome then you could potentially have DNA contamination in your original RNA. Also check to see if you have rRNA contamination.

Like I said below. You can easily trim thereads by providing sequence of the oligo with the command below and then everything to the right of that sequence can be trimmed by adding ktrim=r option for bbduk.sh. An example below

bbduk.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=clean_R1.fq.gz out2=clean_R2.fq.gz  literal=sequence_of_oligo ktrim=r k=15

Adjust value of k= depending on length of the sequence you are providing for the oligo.

ADD REPLY
0
Entering edit mode

Thanks. Alignment is around 80 %, with the rest being mainly multi mapping (STAR report). However, when counting features, only a few percent is assigned to a feature. 40-70 % is unassigned due to multi-mapping, and the rest is "no feature".

I am not too concerned about the oligo any more, but still have some questions. 1) How come the multi-mapping proportion is higher when counting features with featurecounts compared to the STAR alignment report? 2) Is there any standard way of checking for rRNA and/or genomic DNA? For genomic DNA, I have checked alignments in IGV to look for a uniform distribution, but I don't feel this visual approach is conclusive to me and IGV constantly crashed on my Mac.

Thanks for all the help

ADD REPLY
0
Entering edit mode

See this recent thread with my comments for answers to your rRNA questions: high % of reads mapped to multiple loci after STAR mapping If you don't work with human data then get the rDNA repeat for your organism.

ADD REPLY
0
Entering edit mode

the op in that thread has only 20 % uniquely mapped reads in STAR, I have around 80 %. Still featurecounts reports 40-70% multimapping. How come there can be such a big discrepancy between what STAR and fc reports?

ADD REPLY
0
Entering edit mode

Are you counting the STAR generated BAM file with featureCounts? What options are you using with that?

ADD REPLY
0
Entering edit mode

Yes, I count on the STAR generated BAM files. No additional options really, featureCounts -T 16 -a $gtf -g gene_id -o [path to output] [path-to-BAM].

For instance, in one of my samples, multimapping is 13.5 % in STAR, but 37.4 % in FeatureCounts.

ADD REPLY
0
Entering edit mode

Unless you add -M option multi-mapping reads are not counted by featureCounts. Are you referring to multi-overlapping reads perhaps? You would also want to add -p if you are using a PE dataset so fragments are counted instead of reads.

Having said that you should not be counting multi-mapping alignments.

ADD REPLY
0
Entering edit mode

They are not counted, you are right, but featurecounts reports percentages of reads that were not assigned due to multi-mapping, and this percentage is not consistent with what is reported by STAR.

For instance, in one sample FeatureCounts reports that 37 % of the reads were not assigned to any feature because of multimapping, but STAR reports that 13.5 % of the reads were mapped to multiple loci.

ADD REPLY
0
Entering edit mode

bbduk.sh from BBMap suite using the literal=sequence1,sequence2 etc.

ADD REPLY

Login before adding your answer.

Traffic: 1392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6