How to build ribosomal interval files to use in CollectRnaSeqMetrics for hg38.p14
1
0
Entering edit mode
27 days ago

Hi,

I have been following this post to build my own ribosomal intervals for hg38.p14

Referenc Genome: GRCh38.p5 Ensembl release 84
#
# 
# 1. Prepare chromosome sizes file  from fasta sequence if needed.
#
#     ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
#     samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
#     cut -f1,2 Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai > sizes.genome
#
# 2. Make an interval_list file suitable for CollectRnaSeqMetrics.jar.
#
# Ensembl genes:
#
#   ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
#
#
#
# Picard Tools CollectRnaSeqMetrics.jar:
#
#   https://broadinstitute.github.io/picard/command-line-overview.html#CollectRnaSeqMetrics

chrom_sizes=sizes.genome

# rRNA interval_list file -------------------------------------------------

# Genes from Ensembl.

genes=/home/dell/Documents/Arindam/Work/ReferenceGenome/Human_84/Annotation/gtf/Homo_sapiens.GRCh38.84.gtf


# Output file suitable for Picard CollectRnaSeqMetrics.jar.

rRNA=GRCh38.p5.rRNA.interval_list

# Sequence names and lengths. (Must be tab-delimited.)
perl -lane 'print "\@SQ\tSN:$F[0]\tLN:$F[1]\tAS:GRCh38"' $chrom_sizes | \
    grep -v _ \ >> "$rRNA"

# Intervals for rRNA transcripts.
grep 'gene_biotype "rRNA"' $genes | \
    awk '$3 == "gene"' | \
    cut -f1,4,5,7,9 | \
    perl -lane '
        /gene_id "([^"]+)"/ or die "no gene_id on $.";
        print join "\t", (@F[0,1,2,3], $1)
    ' | \
    sort -k1V -k2n -k3n \ >> "$rRNA"

But, when I ran the command, I received an error message telling that 'GRCh38.p14.rRNA.interval_list does not contain intervals', what should I do?

I really stuck right now, really clueless what to do next. Any help and guidance are highly appreciated.

hg38.p14 ribosomal-intervals • 430 views
ADD COMMENT
0
Entering edit mode
27 days ago
noodle ▴ 570

Edited

IMO if you want to use rRNA intervals for QC you should just stick with something published, like this;

Look for the file GRCh38_rRNA.bed from RSeQC: An RNA-seq Quality Control Package

And then create the interval list using picard (assumes you've already created the dictionary for the genome);

picard BedToIntervalList -I GRCh38_rRNA.bed -O GRCh38_rRNA.bed.intervalList.txt --SEQUENCE_DICTIONARY GRCh38.dict

BUT

If you want the specific files you reference above to work you can try the below (depends on ucsctools);

wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz 
wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip *
picard CreateSequenceDictionary R=Homo_sapiens.GRCh38.dna.primary_assembly.fa O=Homo_sapiens.GRCh38.dna.primary_assembly.fa.dict
grep 'rRNA' Homo_sapiens.GRCh38.84.gtf  > Homo_sapiens.GRCh38.84.rRNA.gtf 
gtfToGenePred -genePredExt -geneNameAsName2 Homo_sapiens.GRCh38.84.rRNA.gtf Homo_sapiens.GRCh38.84.rRNA.gtf.genePred
awk 'BEGIN{OFS="\t"} {print $2, $4, $5, $3, $12}' Homo_sapiens.GRCh38.84.rRNA.gtf.genePred > Homo_sapiens.GRCh38.84.rRNA.gtf.genePred.shortlist
cat Homo_sapiens.GRCh38.dna.primary_assembly.fa.dict Homo_sapiens.GRCh38.84.rRNA.gtf.genePred.shortlist > Homo_sapiens.GRCh38.84.rRNA.intervals.list
ADD COMMENT
0
Entering edit mode

Thank you for your help. I created the rRNA interval list using yours, however when I ran CollectRnaSeqMetrics using this command

java -jar $PICARD_JAR/picard.jar CollectRnaSeqMetrics \
      I=/mnt/iusers01/jw01/c02544na/training/dummy/AB1_gencode44_trimmed_TruSeq3_readcountAligned.sortedByCoord.out.bam \
      O=AB1_output.RNA_Metrics \
      REF_FLAT=/mnt/iusers01/jw01/c02544na/refFlat.txt \
      STRAND=SECOND_READ_TRANSCRIPTION_STRAND \
      RIBOSOMAL_INTERVALS=/mnt/iusers01/jw01/c02544na/training/dummy/psa_transcriptomic_training/Homo_sapiens.GRCh38.111.rRNA.intervals.list 

and, I received this error message

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException: Invalid interval record contains 1 fields: 1 9437668 9437778 - RNA5SP40
    at htsjdk.tribble.IntervalList.IntervalListCodec.parseIntervalString(IntervalListCodec.java:73)
    at htsjdk.tribble.IntervalList.IntervalListCodec.decode(IntervalListCodec.java:130)
    at htsjdk.samtools.util.IntervalList.fromReader(IntervalList.java:529)
    at htsjdk.samtools.util.IntervalList.fromPath(IntervalList.java:458)
    at htsjdk.samtools.util.IntervalList.fromFile(IntervalList.java:447)
    at picard.analysis.directed.RnaSeqMetricsCollector.makeOverlapDetector(RnaSeqMetricsCollector.java:79)
    at picard.analysis.CollectRnaSeqMetrics.setup(CollectRnaSeqMetrics.java:168)
    at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:145)
    at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:94)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:280)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)

What should I do to fix this?

Many thanks

ADD REPLY
0
Entering edit mode

Ah, I think the awk is printed space-separated instead of tab-separated. I updated my awk comment adding BEGIN{OFS="\t"}

ADD REPLY
0
Entering edit mode

Thanks a lot. It works perfectly :)

ADD REPLY
0
Entering edit mode

Just a final comment - it looks like the only rRNAs in the gtf are 5S and depending on how you do the library prep these might be lost during size selection. You should make sure that the larger rRNAs are included in this type of analysis, otherwise it won't be accurate.

ADD REPLY

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6