Question

FastQC & trimming - RNA seq data

0

Entering edit mode

3.8 years ago

dietrima • 0

Hi,

I am new to transcriptome work - I have 150 bp PE data from two strains, 3 biological reps of each from a HiSeq4000 run. We are planning on doing genome-guided assembly and ultimately DGE analysis.

My data passed in all the FastQC analyses except for failures (red) in ‘per base sequence content’, ‘sequence duplication levels’, & ‘kmer content’ – all of which I have read are not applicable to RNA Seq data. Some level of adapter content is present (yellow coding). The only difference that comes up between samples is related to overrepresented sequences in some samples, & I’ve blasted them against the genome, & they are, indeed, found in the genome.

1) The more I’ve been reading about interpreting FastQC reports, I’m thinking I may not need (much/any?) quality trimming based on the raw sequence FastQC reports. I think the quality is good (for the analyses applicable to RNA Seq data) & that I don’t need quality trimming. Would you agree? I've read that it is better to keep quality trimming to a minimum if possible.

2) I do have adapters present in all reads, which means some of the library inserts are small, so I will need to trim to remove adapters. Seems like I’ve seen papers where they say this isn’t necessary? Or maybe it depends on the downstream analyses?

3) Last question – how to decide on a minimum length cut-off. Clearly some of the inserts are small. I am not quite sure how to decide minlen. What is appropriate for 150 bp PE data? In looking at the QC reports, I don’t think anything there helps me decide. Is this true? Is there a commonly accepted minlen used for 150 bp reads?

Thanks!

rna-seq next-gen • 3.5k views

ADD COMMENT • link updated 3.8 years ago by Arindam Ghosh ▴ 510 • written 3.8 years ago by dietrima • 0

1

Entering edit mode

Quality based trimming should not be needed for most data of a recent vintage. Kits/prep methods are now mature enough.

If your data has some adapter contamination then most aligners will manage those by soft clipping the adapters when they align the data. If you need to do any de novo assembly work you should scan/trim your data.

Ideally you should not have inserts smaller than 150 bp in standard RNAseq libraries but if you do then you can decide what length you want to keep as a minimum (40-50 bp is reasonable). Remember that shorter reads are going to have problems aligning uniquely and would likely not be counted if they multi-map in downstream processing.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Aligners will soft clip adapters while mapping, but maybe they will mess up assembly?

ADD REPLY • link 3.8 years ago by swbarnes2 14k

0

Entering edit mode

I did say in my comment above

If you need to do any de novo assembly work you should scan/trim your data.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Adapters are present in all reads? All of your sequences are so short they read through to the other side?

ADD REPLY • link 3.8 years ago by swbarnes2 14k

0

Entering edit mode

No, not all sequences. Assuming the Y axis label on the adapter content graph in FastQC is % of sequences, it runs around a maximum of 10% that have adapters.

ADD REPLY • link 3.8 years ago by dietrima • 0

score 1 · Answer 1 · 2020-07-15

For differential gene expression (DEG) analysis, I'd recommend performing adaptor trimming. Trim_galore (Cutadapt) can be used for trimming, in which a default minimum length of 20bp is set to trash the reads or read pairs.

You can find the documentation here: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

score 0 · Answer 2 · 2020-07-20

A minimal quality trimming can be done to remove low quality bases (phred < 20) and adapters if present.

For RNA-seq data I usually go by a rule to keep the minimum length to 80% of read length. I haven't come across any standard regarding the min len to set but longer reads are good for alignment.

For expression analysis 50bp+ reads are good (Read :https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4531809/)