Question

Per sequence GC content fail linked to rRNA ?

2

Entering edit mode

7.6 years ago

cjmb3 ▴ 30

Hi everyone,

I'm new to RNA-seq and I'm struggling with the QC of my data. I performed Illumina paired-end stranded RNA-seq on human cells (ran on HiSeq). I did FASTQC on some samples and I have a high GC content and fail the "Per sequence GC content" module, with several peaks around 60 or 80% GC :

GC content

It looks like most of the over-represented sequences are rRNA and I've read that these usually have high GC content and could explain the failure of this module. I'm just wondering how it will affect my downstream analysis (since I don't know how much of my reads are made of these rRNA) and if there is a way to asses the amount/remove these rRNA sequences before aligning to the genome, to see if it improves the quality of my data ?

Many thanks for your help. Christophe

RNA-Seq • 6.4k views

ADD COMMENT • link updated 7.6 years ago by Ron ★ 1.2k • written 7.6 years ago by cjmb3 ▴ 30

1

Entering edit mode

I'd suggest you check out MultiQC and run it over all your fastqc reports to see if this is a sample level event, or an experiment level event. GC content distributions that aren't normally distributed, can be indicative of sample contamination (see here). I think attempting to remove the rRNA sequences, as @Antonio suggested would be a useful stepping stone.

ADD REPLY • link 7.6 years ago by andrew.j.skelton73 6.6k

score 1 · Answer 1 · 2016-10-07

1

Entering edit mode

7.6 years ago

Antonio R. Franco ★ 5.1k

Some suggestions..

Use BBSplit to deplete your sequences of the rRNA in your fastq files. Here you have information on how to do it. Then you can repeat your analysis
You don't give information about the quality of your sequences. Sometimes a trimming for quality fixes some if not all of these problems, because at least in my experiences, most of the bad quality sequences are assigned to G+C

ADD COMMENT • link 7.6 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Hi Antonio, thanks for your answer, I'm gonna check BBSplit! The quality of my sequences is actually very good. I did 100bp long sequencing and I didn't have to trim at all because of quality, for this particular sample for example I have Q(phred33)>32. In FASTQC I only fail the modules for duplication levels and bp/GC content.

ADD REPLY • link 7.6 years ago by cjmb3 ▴ 30

2

Entering edit mode

Get the human rDNA repeat unit from the link in this post to use with BBSplit: entire human rDNA

ADD REPLY • link 7.6 years ago by GenoMax 142k

0

Entering edit mode

Thank you all for your help, the QC is much better now that I have removed the rRNA reads! I lost quite a lot of reads for some samples but at least what's left is "clean".

ADD REPLY • link 7.6 years ago by cjmb3 ▴ 30

0

Entering edit mode

Hi again! After looking in details at my new fastq files (depleted from rRNA), I wonder if I've not removed too much reads since for some samples I depleted up to 50% of the reads ! Could you please tell me if what I've done seems accurate to you ?

I first indexed the human rDNA repeat unit (from the link given by genomax2) and then ran BBSplit keeping all the default parameters :

bbmap.sh ref=humanrRNAseq.fasta

bbsplit.sh in1=R1.fq.gz in2=R2.fq.gz ref=humanrRNAseq.fasta basename=out_%.fq outu1=clean_R1.fq outu2=clean_R2.fq (R1 and R2 being the paired reads of 1 sample)

But since I want to remove rRNA from my reads and keep the rest for downstream analysis, I guess I have to set BBSplit to be highly precise to avoid false positive. Which makes me wonder whether the default parameters are accurate in my case... I found this comment on the BBMap guide online : To map quickly with very high precision and lower sensitivity, as when removing contaminant reads specific to a genome without risking false-positives: bbmap.sh minratio=0.9 maxindel=3 bwr=0.16 bw=12 fast minhits=2 qtrim=r trimq=10 untrim idtag printunmappedcount kfilter=25 maxsites=1 k=14

Do you know if this code would work fine with bbsplit.sh and if it makes sense to run like this instead of running in default mode ?

Thanks a lot for your help.

ADD REPLY • link 7.6 years ago by cjmb3 ▴ 30

0

Entering edit mode

If the samples were not properly depleted then it is not unreasonable to see a large % of reads being lost (after all rRNA is 90%+ of total RNA). I will tag @Brian separately so he is notified about this thread. He may have specific suggestions o what parameters you could try.

ADD REPLY • link 7.6 years ago by GenoMax 142k

0

Entering edit mode

Tagging: Brian Bushnell

ADD REPLY • link 7.6 years ago by GenoMax 142k

1

Entering edit mode

Hi, Thanks for this comment and tag, I actually tried new parameters on a couple of samples and got a different result (less reads binned) but I'm not sure how it worked, since the command I typed and the report I got don't really match (I'm sorry if there was an obvious mistake, I'm a real beginner!).

So I typed this :

bbsplit.sh minratio=0.9 maxindel=3 bwr=0.16 bw=12 fast minhits=2 qtrim=r trimq=10 untrim idtag printunmappedcount kfilter=25 maxsites=1 k=14 in1=read1.fq.gz in2=read2.fq.gz ref=humanrRNAseq.fasta basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

And I got this report :

Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minratio=0.9, maxindel=3, bwr=0.16, bw=12, fast, minhits=2, qtrim=r, trimq=10, untrim, idtag, printunmappedcount, kfilter=25, maxsites=1, k=14, in1=read1.fq.gz, in2=read2.fq.gz, ref=humanrRNAseq.fasta, basename=out_%.fq, outu1=clean1.fq, outu2=clean2.fq]

Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minratio=0.9, maxindel=3, bwr=0.16, bw=12, fast, minhits=2, qtrim=r, trimq=10, untrim, idtag, printunmappedcount, kfilter=25, maxsites=1, k=14, in1=read1.fq.gz, in2=read2.fq.gz, basename=out_%.fq, outu1=clean1.fq, outu2=clean2.fq, ref_humanrRNAseq=humanrRNAseq.fasta] Merged reference file ref/genome/1/merged_ref_528718303.fa.gz already exists; skipping merge. Ref merge time: 0.096 seconds. Executing align2.BBMap [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minratio=0.9, maxindel=3, bwr=0.16, bw=12, minhits=2, qtrim=r, trimq=10, untrim, idtag, printunmappedcount, kfilter=25, maxsites=1, k=14, in1=read1.fq.gz, in2=read2.fq.gz, outu1=clean1.fq, outu2=clean2.fq, ref=ref/genome/1/merged_ref_528718303.fa.gz, out_humanrRNAseq=out_humanrRNAseq.fq]

BBMap version 36.38 Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.650 Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560 Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.900 Retaining first best site only for ambiguous mappings. NOTE: Ignoring reference file because it already appears to have been processed. NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt Set genome to 1

Loaded Reference: 0.035 seconds. Loading index for chunk 1-1, build 1 No index available; generating from reference genome: /home/cjmb3/RNAseqDC/Trimgalore-Merged/ref/index/1/chr1_index_k14_c15_b1.block Indexing threads started for block 0-1 Indexing threads finished for block 0-1 Generated Index: 4.709 seconds. Analyzed Index: 13.623 seconds. Started output stream: 0.118 seconds. Cleared Memory: 0.231 seconds. Processing reads in paired-ended mode. Started read stream. Started 32 mapping threads.

I mostly wonder whether the code I used, which supposedly work with BBMap, works with BBSplit as well, and why some parameters indicated in the report are not matching those I entered ? It did work and gave me a different output than my first try with BBSplit default parameters, but I'm not sure I understand why.

Thanks again for your help!

ADD REPLY • link 7.6 years ago by cjmb3 ▴ 30

score 0 · Answer 2 · 2016-10-07

0

Entering edit mode

7.6 years ago

Ron ★ 1.2k

You can use Split_bam.py from RseQc to get the number of reads that belong to rRNA http://rseqc.sourceforge.net/#spilt-bam-py

This will tell the number of ribosomal reads in the already aligned bam file by splitting the bam file into two files(one for rRNA and other without rRNA). You can use the bam file without rRNA for downstream analysis.

ADD COMMENT • link 7.6 years ago by Ron ★ 1.2k

1

Entering edit mode

This program requires a mapped BAM file and a gene list in BED format. BBSplit does the job without these requirements

ADD REPLY • link 7.6 years ago by Antonio R. Franco ★ 5.1k