RSEM with ERCC spike-ins formatting issues, ERCC sequences never make it into the reference
1
0
Entering edit mode
4.4 years ago

I've been stuck on this for a few days now, and have talked with ThermoFisher, posted on the Google Group for RSEM, and have been unable to figure out what's happening. I have RNA seq data with ERCC spike-ins, and we want to align using STAR and quantify expression using RSEM. I've manually built the fasta and gtf files using the sequence file from thermo, as well as used the official fasta file provided by thermo. I've also tried building the reference with and without the STAR flag added. Every way I've built the reference with STAR has worked, I can look at the reference files, see the ERCC chromosomes, and I can find plenty of reads aligned to ERCC, and even ran featureCounts on my BAM files and gotten quantification of the ERCC transcripts. I'm pushing to switch to a different quantification tool, but I'd also like to figure out why RSEM isn't working.

Long wall of text, but the main question is, what should the header lines of ERCC chromosomes look like to work in RSEM properly?

To prepare the reference:

# downloads ercc fasta files
wget "https://tools.thermofisher.com/content/sfs/manuals/cms_095047.txt"
# converts windows returns to unix returns
tr -d '\r' <cms_095047.txt> cms_095047.tsv

# make ercc fasta file
cat cms_095047.tsv | awk -F "\t" -v OFS="" 'NR>1{print ">",$1," ",$2," ",$3," ",$4,"\n",$5}' > ercc.fa # change ERCC- to ERCC_ sed -i 's/ERCC-/ERCC_/g' ercc.fa # append to genome cat /fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa ercc.fa \ > grch37-ensembl-ercc.fa head grch37-ensembl-ercc.fa ## >1 ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  Looks like a fasta file (the ## are artifacts, they don't exist in the actual file) Do a tail to make the appended ERCC chromosomes look correct. tail grch37-ensembl-ercc.fa ## >ERCC_00164 DQ516779 Ac03459963_a1 Ac03460059_a1 ## AGGAGCTCCAGTAGTTTTCCCCTCAAAAATTCCTGATAAGATTTCAACTTTATCCTCTTCTTTTCTTGGTGTTGAGAAGATGCTCTGCCCTGGTCTTCTCCTGTCAAGCTCTTTTTGGATATCCTCTTCAGATAAAGGCAGATTAGTTGGACATCCATCAACAACTGCTCCAACAGCCTTTCCATGACTTTCTCCAAAAACTGTAACTCTAAACATATCCCCATAGGTGTTCATTAATGTCACCAAAAATTTTTAATTGCTTAGTTTTACATTTAAAATAAAAATTAAAATAGTCAAAAAATAAAAAAGGTTTATCTGTAGAGAACATCCAAGTGTGCTGGTTCCTTAACTTTAACTTTCTTTTTCTCCATAATCTTCTCAACTGCCTTTCTAAAGTCATCCATTGTTACATAGTCCCTTAACTCCCTAATTGCATTCATCCCTGCCTCTGTGCAGATTGCCTTTAACTCAGCCCCTACACATCCTTCAGTCATCTTAGCTATTTCTTCTAAATTGACATCTTCCGCTAAATTCATCTTTCTTGTATGAATCTTCAATATCTCCAATCTACCCTTCTCATCAGGAGCTGGGACTTCTATGATTCTATCAAATCTTCCAGGTCTTAATATTGCAGGGTCTAAAATGTCAGGCCTGTTTGCGGCCCCAATTATCTTAACATCTCCCCTTGCATCGAATCCATCCATCTCTGCCAACAACTGCATTAATGTTCTCTGAACTTCCCTATCTCCACCAGTTAAAGCGTCTGTTCTCTTTGCTGCAATAGCATCAATCTCATCTATGAATATGATTGAAGGAGCTTTTTCTTTAGCCAATTTGAATATATCTTTAACTAACGGAGCCCCCTCTCCAATAAACTTCTTAACCAATTCAGAACCAACAACTCTTATAAAGGTAGCATTTGTTTCTGTAGCAACAGCTTTAGCTAATAATGTCTTTCCAGTTCCTGGTGGCCCGTAAAGAGAATACCTTTTGGTGGTTAAAAAAAAAAAAAAAAAAAAAAA ## >ERCC_00165 DQ668363 Ac03459964_a1 Ac03460060_a1 ## GATATGCGTTACGTGAGTCTGATAGCAGTTCACTACCTGGATATCTGATCCACTAGCTCGATCATGCTCACCCATAGTTTATCTGCATCACTCGTACTGAAATGCTCACATCGCAGGTAGAGCAGCATCGTAGAGCGTCAAGCTGCATCCTAGCGTCATGAGTCATAGTACCTCATGCTCACGTGATCTACCCTAGCTGACCGCTAATGACGGCAGTGCAACCTGAGATACCGACGGCATACTGTCGTCAACGTCAGGCAATGTGTCCGAACGGCGAGCTACGTCGCCTCACGGAGTAATCGCGTCCCTCTAGGTATAGTGCCGTCGGTTCAGGTCATATGTCGCGGGTTCTGCACATATCACGGACGTATCGCTATCAGACGGACGCTCTCGGACCTAAACCGTAGCTCTCGGCAAGATCGTCCTCGTCTCGAATATAGCGCCCTAGTGCTGCAAATGTCACCGCTATCTCGTAAGGGGTCCGTCTGTTGAGTTAGGCCTCCTCTCGTTGGATGTGAGCTCGGTTGCTTGGATGGTGCAGCTTACTTCGCGTACCTGCTGTTTGCATCAGTCCTCTGCATCTATAATCGCGTATCTCTCTCTAGTAGACCATATAGCCATCTAAGCGCTCGATATTCCACCTAAGTGGCGCCTATTGAACTAAGTGGCAGCCGAATGGACTATCGCTCCTCGATATGTACGGATAGGCCACGGCATGTACGAGCATAAGCCGAACTGCACGAGCATACCCGACACTGATCTGAGAGTCGCTTAAATCATCTGCGTGTCTTAGAGCTTATCGCCATGTCTGTCAACTGTACTGTCATCCTGTAACTGTAGCGTATGTGAAAAAAAAAAAAAAAAAAAAAAAA ## >ERCC_00168 DQ516776 Ac03459965_a1 Ac03460061_a1 ## CCAATGAACTCAGCTATTCTTCTTAACAAATAACTTTCTCCAGAAAATTCAAATGTATCATCTTCAGGATTCCACCTAAAAACATCATGTAATATAATATCATCAATTTTTGGGTCGTATTCAACAATCTCAGTTATACTCTCAGTTCTTCTAACAAATCTTCCTTTATAAATCAATCTAACCTGCATACATATGGCATTTAGTTGTTCAAGCATAATCTTTGGAATGTTCATTGGTTCAGCATTCAACCTCCTTATAACTGCCTCTGGGGATTTTGCGTGTATCGTTGATAACGCCAAATGTCCTGTAGTTATTGCTTGAAATAATATCTTCGCCTCCTCACCTCTAACCTCTCCAACAATTAAATAATCTGGTCTTTGCCTTAAAGCCGCTTTTAATAAATCCATCATAGTTATTTCATATTCTTCTCCACCGAATCCACTTCTTGTAGTTCCAGCAATCCAGTTTTCATGATACAACCTAATTTCTGGAGTATCCTCAATAGATACGATTTTCATTTGAGGAAGGATGAAAAGAGAGAATGCATTTAAAAGGGTGGTTTTTCCAGTAGCTACCTCTCCAGCAACCATAATAGAATTTTTATATTCAATGAGTAACCAAAGATATGCAAGCATCTCTGGAGAAATACTCCCATATCTTATTAAATCTGTTGGCAATATAGGAGTGTGTGTGAATTTTCTTATTGTAAATGTTGAACCATATCTTGAGATATCCCTTCCAAGGGTTACATTTAGCCTGCTACCATCTGGGAGAGAACCATCCACTATTGGATTAGCCAATGTTAAAGATTTTCCACACCTTTGGGCTAAGGATATACAAAACGAGTCTAATTCTTCATCAGTTTCAAATTTTATATTTGTCTTTAAATGTTCGTATTTTCTATGAAACACATACACTGGCTTTCCAACACCTGTGCAACTGATATCCTCCAAATTCTCATCTTTCATAAGAGCATCTATTTCCCATATCCAATGAGGTAAAAAAAAAAAAAAAAAAAAAAAAA ## >ERCC_00170 DQ516773 Ac03459966_a1 Ac03460062_a1 ## TATTGGTGGAGGGGCACAAGTTGCTGAAGTTGCGAGAGGGGCGATAAGTGAGGCAGACAGGCATAATATAAGAGGGGAGAGAATTAGCGTAGATACTCTTCCAATAGTTGGTGAAGAAAATTTATATGAGGCTGTTAAAGCTGTAGCAACTCTTCCACGAGTAGGAATTTTAGTTTTAGCTGGCTCTTTAATGGGAGGGAAGATAACTGAAGCAGTTAAAGAATTAAAGGAAAAGACTGGCATTCCCGTGATAAGCTTAAAGATGTTTGGCTCTGTTCCTAAGGTTGCTGATTTGGTTGTTGGAGACCCATTGCAGGCAGGGGTTTTAGCTGTTATGGCTATTGCTGAAACAGCAAAATTTGATATAAATAAGGTTAAAGGTAGGGTGCTATAAAGATAATTTAATAATTTTTGATGAAACCGAAGCGTTAGCTTTGGGTTATGAAACTCCATGATTTTCATTTAATTTTTTCCTATTAATTTTCTCCTAAAAAGTTTCTTTAACATAAATAAGGTTAAAGGGAGAGCTCTATGATTGTCTTCAAAAATACAAAGATTATTGATGTATATACTGGAGAGGTTGTTAAAGGAAATGTTGCAGTTGAGAGGGATAAAATATCCTTTGTGGATTTAAATGATGAAATTGATAAGATAATTGAAAAAATAAAGGAGGATGTTAAAGTTATTGACTTAAAAGGAAAATATTTATCTCCAACATTTATAGATGGGCATATACATATAGAATCTTCCCATCTCATCCCATCAGAGTTTGAGAAATTTGTATTAAAAAGCGGAGTTAGCAAAGTAGTTATAGACCCGCATGAAATAGCAAATATTGCTGGAAAAGAAGGAATTTTGTTTATGTTGAATGATGCCAAAATTTTAGATGTCTATGTTATGCTTCCTTCCTGTGTTCCAGCTACAAACTTAGAAACAAGTGGAGCTGAGATTACAGCAGAGAATATTGAAGAACTCATTCTTTAGATAATGTCTTAGGTTAAAAAAAAAAAAAAAAAAAAAAAA ## >ERCC_00171 DQ854994 Ac03459967_a1 Ac03460063_a1 ## CTGGAGATTGTCTCGTACGGTTAAGAGCCTCCGCCCGTCTCTGGGACTATGGACGGGCACGCTCATATCAGGCTATATTTGGTCCGGGTTATTATCGTCGCGGTTACCGTAATACTTCAGATCAGTTAAGTAGGGCCATATGCCTCGGGAATAAGCTGACGGTGACAAGGTTTCCCCCTAATCGAGACGCTGCAATAACACAGGGGCATACAGTAACCAGGCAAGAGTTCAATCGCTTAGTTTCGTGGCGGGATTTGAGGAAAACTGCGACTGTTCTTTAACCAAACATCCGTGCGATTCGTGCCACTCGTAGACGGCATCTCACAGTCACTGAAGGCTATTAAAGAGTTAGCACCCACCATTGGATGAAGCCCAGGATAAGTGACCCCCCCGGACCTTGGAGTTTCATGCTAATCAAAGAAGAGCTAATCCGACGTAAAGTTGCGGCGTTGATTACGCAGGATTGCGACCAAAGAACGAGAAAAAAAAAAAAAAAAAAAAAAAA  I've tried multiple variations of headers including changing - to _, and only including the ERCC-00171 and omitting the rest of the header. None of these have worked. I also prepared a GTF and appended it to the file. If you want to see the full process for that, there's a html document on the google groups RSEM post I linked above, and that goes over it. But, the head of the gtf is: head grch37-ensembl-ercc.gtf ## 1 processed_transcript exon 11869 12227 . + . exon_id "ENSE00002234944"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000456328"; transcript_name "DDX11L1-002"; transcript_source "havana"; tss_id "TSS15133"; ## 1 processed_transcript transcript 11869 14409 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000456328"; transcript_name "DDX11L1-002"; transcript_source "havana"; tss_id "TSS15133"; ## 1 transcribed_unprocessed_pseudogene exon 11872 12227 . + . exon_id "ENSE00002234632"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000515242"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; tss_id "TSS178604"; ## 1 transcribed_unprocessed_pseudogene transcript 11872 14412 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000515242"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; tss_id "TSS178604"; ## 1 transcribed_unprocessed_pseudogene exon 11874 12227 . + . exon_id "ENSE00002269724"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000518655"; transcript_name "DDX11L1-202"; transcript_source "ensembl"; tss_id "TSS192941"; ## 1 transcribed_unprocessed_pseudogene transcript 11874 14409 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000518655"; transcript_name "DDX11L1-202"; transcript_source "ensembl"; tss_id "TSS192941"; ## 1 transcribed_unprocessed_pseudogene exon 12010 12057 . + . exon_id "ENSE00001948541"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000450305"; transcript_name "DDX11L1-001"; transcript_source "havana"; tss_id "TSS138561"; ## 1 transcribed_unprocessed_pseudogene transcript 12010 13670 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000450305"; transcript_name "DDX11L1-001"; transcript_source "havana"; tss_id "TSS138561"; ## 1 transcribed_unprocessed_pseudogene exon 12179 12227 . + . exon_id "ENSE00001671638"; exon_number "2"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000450305"; transcript_name "DDX11L1-001"; transcript_source "havana"; tss_id "TSS138561"; ## 1 transcribed_unprocessed_pseudogene exon 12595 12721 . + . exon_id "ENSE00002270865"; exon_number "2"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000518655"; transcript_name "DDX11L1-202"; transcript_source "ensembl"; tss_id "TSS192941";  And the tail of the gtf, where the ERCC annotations were added: tail grch37-ensembl-ercc.gtf ## ERCC_00157 ercc gene 1 1019 . + . gene_id "GERCC_00157"; gene_version "1"; gene_name "ERCC_00157"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00158 ercc gene 1 1027 . + . gene_id "GERCC_00158"; gene_version "1"; gene_name "ERCC_00158"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00160 ercc gene 1 743 . + . gene_id "GERCC_00160"; gene_version "1"; gene_name "ERCC_00160"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00162 ercc gene 1 523 . + . gene_id "GERCC_00162"; gene_version "1"; gene_name "ERCC_00162"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00163 ercc gene 1 543 . + . gene_id "GERCC_00163"; gene_version "1"; gene_name "ERCC_00163"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00164 ercc gene 1 1022 . + . gene_id "GERCC_00164"; gene_version "1"; gene_name "ERCC_00164"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00165 ercc gene 1 872 . + . gene_id "GERCC_00165"; gene_version "1"; gene_name "ERCC_00165"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00168 ercc gene 1 1024 . + . gene_id "GERCC_00168"; gene_version "1"; gene_name "ERCC_00168"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00170 ercc gene 1 1023 . + . gene_id "GERCC_00170"; gene_version "1"; gene_name "ERCC_00170"; gene_source "ercc"; gene_biotype "ercc"; ## ERCC_00171 ercc gene 1 505 . + . gene_id "GERCC_00171"; gene_version "1"; gene_name "ERCC_00171"; gene_source "ercc"; gene_biotype "ercc";  As I mentioned above, these files work fine in STAR and featureCounts, but not for RSEM. Am I forgetting a flag in the command below that is necessary because of the slight variation in chromosome naming? rsem-prepare-reference \ --gtf /data/$USER/genomes/grch37-ensembl-ercc.gtf \
/data/\$USER/genomes/grch37-ensembl-ercc.fa ref/human_ens_ercc


Any help would be greatly appreciated.

RNA-Seq ERCC RSEM DEG • 2.3k views
0
Entering edit mode

what kind of error you get form rsem?

0
Entering edit mode

So RSEM prepares the reference and quantifies expression just fine, it just doesn't quantify ERCC transcripts. When I prepare the reference I get the following output:

[+] Loading gcc  7.2.0  ...

Warning: Cannot extract transcript ENST00000411692's sequence since the chromosome it locates, HSCHR6_MHC_MCF, is absent!


This repeats several times for various transcripts, but never has ERCC in it anywhere, then after it stops repeating variations of this:

Warning: 18816 transcripts are failed to extract because their chromosome sequences are absent.

Parsed 200000 lines

Parsed 400000 lines

Parsed 600000 lines

Parsed 800000 lines

Parsed 1000000 lines

Parsed 1200000 lines

Parsed 1400000 lines

Parsed 1600000 lines

Parsed 1800000 lines

Parsed 2000000 lines

Parsed 2200000 lines

Parsed 2400000 lines

Parsed 2600000 lines

Parsing gtf File is done!

/data/capaldobj/genomes/grch37-ensembl-ercc.fa is processed!

196354 transcripts are extracted.

Extracting sequences is done!

Group File is generated!

Transcript Information File is generated!

Chromosome List File is generated!

Extracted Sequences File is generated!

rsem-preref ref/human_ens_ercc.transcripts.fa 1 ref/human_ens_ercc

Refs.makeRefs finished!

Refs.saveRefs finished!

ref/human_ens_ercc.idx.fa is generated!

ref/human_ens_ercc.n2g.idx.fa is generated!

0
Entering edit mode

Can you check if RSEM output has records for example to this gene_id GERCC_00157? I suspect that does RSEM calcualte spike-ins, but output their names as gene_ids in GTF file.

0
Entering edit mode

It does not, it doesn't even include ERCC chromosomes in the .chrlist. RSEM doesn't even through any warnings about not seeing the ERCC chromosomes.

0
Entering edit mode

Maybe try to prepare the reference only for spike-ins, and then if it works prepare it with spike-ins + some other fasta file. Maybe the problem is in the human assembly (unlikely though).

0
Entering edit mode

Good call, a meaningful error message! But that's all I got, it doesn't tell me what it needs in the formatting to consider something a transcript.

rsem-extract-reference-transcripts ref_thermo/thermo_ercc 0 /data/capaldobj/genomes/ercc.gtf None 0 /data/capaldobj/genomes/ercc.fa
The reference contains no transcripts!
"rsem-extract-reference-transcripts ref_thermo/thermo_ercc 0 /data/capaldobj/genomes/ercc.gtf None 0 /data/capaldobj/genomes/ercc.fa" failed! Plase check if you provide correct parameters/options for the pipeline!

0
Entering edit mode

Try to put everything is one directory, probably you locate wrong files. And check files once again, I think this is something very simple goes wrong.

0
Entering edit mode

Everything is in the same directory, and is being called from that directory. Definitely something simple, because STAR index build works, but not RSEM

0
Entering edit mode

Can you try the same with this file of ERCC https://ufile.io/cepge? I do the analysis with them and everything works. Could be that something goes wrong when you parse the initial file.

0
Entering edit mode

Sorry I didn't respond sooner, but my account is too new and I ran out of posts for the day. That did not work either.

0
Entering edit mode

Sounds weird... than maybe try reinstalling rsem. Or just use salmon or kallisto on your fastq files :)

3
Entering edit mode
4.4 years ago

Figured it out. GTF File, field 3 needed to be exon and not gene.

Traffic: 1894 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.