ribosomeProfilingQC help
1
0
Entering edit mode
26 days ago
tommy ▴ 40

Hello, I am new to ribo-seq analysis. Here is my working flow:

  1. Read quality control was performed with FastQC.
  2. The reads were trimmed with cutadapt v5.0, removing short reads (<20 nt) and adapter sequences.
  3. FastQC after adaptor trimming
  4. Deduplication was performed using a seqkit (v2.9.0).
  5. Read mapping was performed to the hg38 genome assembly with the GENCODE v29 basic genome annotation using STAR v2.7.6a
  6. The alignment results were stored in the Aligned.sortedByCoord.out.bam file.

I found tutorial for RibosomeProfilingQC and I analyzed the data downloaded from journal. After removing rRNA, tRNA, snRNA, snoRNA, misc_RNA, and RepeatMasker annotations, I obtained a file named fil.bam.

My questions are:

  1. Since my input BAM file is already mapped (to hg38), do I need to map the fil.bam file again to hg38 using STAR as the tutorial suggests?
  2. I used hg38 and GENCODE v29 GTF files downloaded from Ensembl for mapping with STAR. However, the tutorial uses UCSC references. Could this cause problems with the downstream analysis?

Additionally, I’ve been receiving warning messages in my downstream analysis, and the QC shows that the data quality is not good. Could this be related to the mapping or reference files used?

> estimatePsite(bamfile, CDS, genome)
[1] 13
Warning message:
In .merge_two_Seqinfo_objects(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': GL000008.2, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000208.1, GL000213.1, GL000214.1, GL000216.2, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000224.1, GL000225.1, GL000226.1, KI270302.1, KI270303.1, KI270304.1, KI270305.1, KI270310.1, KI270311.1, KI270312.1, KI270315.1, KI270316.1, KI270317.1, KI270320.1, KI270322.1, KI270329.1, KI270330.1, KI270333.1, KI270334.1, KI270335.1, KI270336.1, KI270337.1, KI270338.1, KI270340.1, KI270362.1, KI270363.1, KI270364.1, KI270366.1, KI270371.1, KI270372.1, KI270373.1, KI270374.1, KI270375.1, KI270376.1, KI270378.1, KI270379.1, KI270381.1, KI270382.1, KI270383.1, KI270384.1, KI270385.1, KI270386.1, KI270387.1, KI270388.1, KI270389.1, KI270390.1, KI270391.1, KI270392.1, KI270393.1, KI270394.1, KI270395.1, KI270396.1, KI270411.1, KI270412.1, KI270414.1, KI270417.1, KI270418.1, KI270419.1, KI270420.1, KI270422.1, KI270423.1, KI270424.1, KI270425.1, KI27042 [... truncated]

Thanks for helping out.

ribo-seq • 410 views
ADD COMMENT
0
Entering edit mode
26 days ago
Jack Tierney ▴ 410

I apologise that this is so long.

TLDR:

  • If you used sensible alignment parameters you would not need to realign but I suspect you would not have as the STAR defaults do not make sense for RIbo-Seq. Re-align and see if the quality improves
  • Using Ensembl will not impact data quality
  • Ensure the sequences that are not shared between the two SeqInfo objects are not any of the 24 chromosomes and you will be okay but ideally also ensure the annotation and references match.

Long Winded Answer

I think it is important to understand the logic behind each step taken in such a workflow so I am glad you came here to ask.

The primary goal when processing this data type is to ensure what we end up with at the end are the alignmnets for the set of high confidence ribosome protected fragments (RPFs). The initial length exclusion (no reads < 20 nt) looks to exclude RNA fragments that, in theory, could not come about via ribosome protection simply because the ribosome is so big it would cover far more than 20 bases. There is evidence that there are populations of true RPFs with shorter read lengths but the conservative approach you are following is perfectly reasonable.

The exclusion of reads that uniquely align to these non coding RNAs is another of the common steps utilised to identify and exclude low confidence RPFs. rRNA for example should not and cannot be translated so any reads that align there are considered contaminating our set of clean RPFs. So at the point of the workflow that you reached you now have a higher confidence set of RPFs that you ready to analyse.

I don't believe you need to re-align if the initial alignment parameters are sensible. Basically you do not want to allow multiple mappings of reads (especially not 10 which is the STAR default) and if you are confident in the removal of adapters and UMIs I have anecdotally seen better results when soft-clipping is not allowed. I suspect the authors of this tutorial used Bowtie to do a quicker mapping to exclude these ncRNA reads and then TopHat for better handling of indels and splice sites but I cannot be sure.

Using Ensembl vs UCSC annotations should not cause issues for the analysis itself.

The warning message you show suggests your reference and annotation (I am guessing that is x and y in your script) do not exactly match and as a result contain some differing sequences.

Poor data quality could be due to the mapping params but realistically many people struggle to produce high quality Ribo-Seq data on first attempt so it equally could just be aperiodic.

ADD COMMENT
0
Entering edit mode

Hello, Jack. Thanks for taking the time to answer my question. I believe my dataset should be valid, as I downloaded it directly from journal. Yesterday, I reran my STAR program with the parameter --outFilterMultimapNmax 1, and by reviewing the mapping results, I believe the mapping is overall good.

However, I discovered that the error was caused by the BAM file. When I examined the header of the BAM file, I noticed that, instead of normal characters, I found unusual entries that match the names mentioned in the warning message.

I ultimately use samtools to remove all the chromosomes that start with GL and KI, and only kept the conventional chromosomes. I do find some posts talking about this, but do you think filtering them all out is a good choice?

After filering out, there are no warning messages.

                                     Started job on |   Jan 24 09:36:52
                             Started mapping on |   Jan 24 09:41:13
                                    Finished on |   Jan 24 09:41:41
       Mapping speed, Million of reads per hour |   380.78

                          Number of input reads |   2961604
                      Average input read length |   30
                                    UNIQUE READS:
                   Uniquely mapped reads number |   2188127
                        Uniquely mapped reads % |   73.88%
                          Average mapped length |   29.67
                       Number of splices: Total |   326556
            Number of splices: Annotated (sjdb) |   319970
                       Number of splices: GT/AG |   322839
                       Number of splices: GC/AG |   2676
                       Number of splices: AT/AC |   404
               Number of splices: Non-canonical |   637
                      Mismatch rate per base, % |   0.33%
                         Deletion rate per base |   0.00%
                        Deletion average length |   1.25
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.16
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   0
             % of reads mapped to multiple loci |   0.00%
        Number of reads mapped to too many loci |   599412
             % of reads mapped to too many loci |   20.24%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   167142
                 % of reads unmapped: too short |   5.64%
                Number of reads unmapped: other |   6923
                     % of reads unmapped: other |   0.23%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%
names(bamHeader[[1]]$targets)
  [1] "chr1"       "chr2"       "chr3"       "chr4"       "chr5"      
  [6] "chr6"       "chr7"       "chr8"       "chr9"       "chr10"     
 [11] "chr11"      "chr12"      "chr13"      "chr14"      "chr15"     
 [16] "chr16"      "chr17"      "chr18"      "chr19"      "chr20"     
 [21] "chr21"      "chr22"      "chrX"       "chrY"       "chrM"      
 [26] "GL000008.2" "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2"
 [31] "GL000208.1" "GL000213.1" "GL000214.1" "GL000216.2" "GL000218.1"
 [36] "GL000219.1" "GL000220.1" "GL000221.1" "GL000224.1" "GL000225.1"
 [41] "GL000226.1" "KI270302.1" "KI270303.1" "KI270304.1" "KI270305.1"
 [46] "KI270310.1" "KI270311.1" "KI270312.1" "KI270315.1" "KI270316.1"
 [51] "KI270317.1" "KI270320.1" "KI270322.1" "KI270329.1" "KI270330.1"
 [56] "KI270333.1" "KI270334.1" "KI270335.1" "KI270336.1" "KI270337.1"
 [61] "KI270338.1" "KI270340.1" "KI270362.1" "KI270363.1" "KI270364.1"
 [66] "KI270366.1" "KI270371.1" "KI270372.1" "KI270373.1" "KI270374.1"
 [71] "KI270375.1" "KI270376.1" "KI270378.1" "KI270379.1" "KI270381.1"
 [76] "KI270382.1" "KI270383.1" "KI270384.1" "KI270385.1" "KI270386.1"
 [81] "KI270387.1" "KI270388.1" "KI270389.1" "KI270390.1" "KI270391.1"
 [86] "KI270392.1" "KI270393.1" "KI270394.1" "KI270395.1" "KI270396.1"
 [91] "KI270411.1" "KI270412.1" "KI270414.1" "KI270417.1" "KI270418.1"
 [96] "KI270419.1" "KI270420.1" "KI270422.1" "KI270423.1" "KI270424.1"
[101] "KI270425.1" "KI270429.1" "KI270435.1" "KI270438.1" "KI270442.1"
[106] "KI270448.1" "KI270465.1" "KI270466.1" "KI270467.1" "KI270468.1"
[111] "KI270507.1" "KI270508.1" "KI270509.1" "KI270510.1" "KI270511.1"
[116] "KI270512.1" "KI270515.1" "KI270516.1" "KI270517.1" "KI270518.1"
[121] "KI270519.1" "KI270521.1" "KI270522.1" "KI270528.1" "KI270529.1"
[126] "KI270530.1" "KI270538.1" "KI270539.1" "KI270544.1" "KI270548.1"
[131] "KI270579.1" "KI270580.1" "KI270581.1" "KI270582.1" "KI270583.1"
[136] "KI270584.1" "KI270587.1" "KI270588.1" "KI270589.1" "KI270590.1"
[141] "KI270591.1" "KI270593.1" "KI270706.1" "KI270707.1" "KI270708.1"
[146] "KI270709.1" "KI270710.1" "KI270711.1" "KI270712.1" "KI270713.1"
[151] "KI270714.1" "KI270715.1" "KI270716.1" "KI270717.1" "KI270718.1"
[156] "KI270719.1" "KI270720.1" "KI270721.1" "KI270722.1" "KI270723.1"
[161] "KI270724.1" "KI270725.1" "KI270726.1" "KI270727.1" "KI270728.1"
[166] "KI270729.1" "KI270730.1" "KI270731.1" "KI270732.1" "KI270733.1"
[171] "KI270734.1" "KI270735.1" "KI270736.1" "KI270737.1" "KI270738.1"
[176] "KI270739.1" "KI270740.1" "KI270741.1" "KI270742.1" "KI270743.1"
[181] "KI270744.1" "KI270745.1" "KI270746.1" "KI270747.1" "KI270748.1"
[186] "KI270749.1" "KI270750.1" "KI270751.1" "KI270752.1" "KI270753.1"
[191] "KI270754.1" "KI270755.1" "KI270756.1" "KI270757.1"

my code of building index and mapping

genome_index=/ribo_seq/ref_star/
fasta_file=/ref_star/GRCh38.primary_assembly.genome.fa
gtf_file=/ref_star/gencode.v29.annotation.gtf

# Create genome index
$star_dir --runThreadN 8 \
    --runMode genomeGenerate \
    --genomeDir $genome_index \
    --genomeFastaFiles $fasta_file \
    --sjdbGTFfile $gtf_file \
    --sjdbOverhang 42

# Run STAR mapping
$star_dir --runThreadN 8 \
    --genomeDir $genome_index \
    --readFilesIn $file \
    --outFileNamePrefix ${output_dir}${base_name}_ \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode TranscriptomeSAM GeneCounts \
    --outFilterMultimapNmax 1
ADD REPLY
0
Entering edit mode

Hi Tommy, It will depend on the analysis you are trying to do but if you are comparing individual genes across conditions or looking for novel translated regions then there should be no issue with what you did in my opinion.

ADD REPLY
0
Entering edit mode

Thanks, Jack, for your help! :)

ADD REPLY

Login before adding your answer.

Traffic: 4583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6