I've been stuck on this for a few days now, and have talked with ThermoFisher, posted on the Google Group for RSEM, and have been unable to figure out what's happening. I have RNA seq data with ERCC spike-ins, and we want to align using STAR and quantify expression using RSEM. I've manually built the fasta and gtf files using the sequence file from thermo, as well as used the official fasta file provided by thermo. I've also tried building the reference with and without the STAR flag added. Every way I've built the reference with STAR has worked, I can look at the reference files, see the ERCC chromosomes, and I can find plenty of reads aligned to ERCC, and even ran featureCounts on my BAM files and gotten quantification of the ERCC transcripts. I'm pushing to switch to a different quantification tool, but I'd also like to figure out why RSEM isn't working.
Long wall of text, but the main question is, what should the header lines of ERCC chromosomes look like to work in RSEM properly?
To prepare the reference:
# downloads ercc fasta files
wget "https://tools.thermofisher.com/content/sfs/manuals/cms_095047.txt"
# converts windows returns to unix returns
tr -d '\r' <cms_095047.txt> cms_095047.tsv
# make ercc fasta file
cat cms_095047.tsv | awk -F "\t" -v OFS="" 'NR>1{print ">",$1," ",$2," ",$3," ",$4,"\n",$5}' > ercc.fa
# change ERCC- to ERCC_
sed -i 's/ERCC-/ERCC_/g' ercc.fa
# append to genome
cat /fdb/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa ercc.fa \
> grch37-ensembl-ercc.fa
head grch37-ensembl-ercc.fa
## >1
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Looks like a fasta file (the ## are artifacts, they don't exist in the actual file)
Do a tail to make the appended ERCC chromosomes look correct.
tail grch37-ensembl-ercc.fa
## >ERCC_00164 DQ516779 Ac03459963_a1 Ac03460059_a1
## AGGAGCTCCAGTAGTTTTCCCCTCAAAAATTCCTGATAAGATTTCAACTTTATCCTCTTCTTTTCTTGGTGTTGAGAAGATGCTCTGCCCTGGTCTTCTCCTGTCAAGCTCTTTTTGGATATCCTCTTCAGATAAAGGCAGATTAGTTGGACATCCATCAACAACTGCTCCAACAGCCTTTCCATGACTTTCTCCAAAAACTGTAACTCTAAACATATCCCCATAGGTGTTCATTAATGTCACCAAAAATTTTTAATTGCTTAGTTTTACATTTAAAATAAAAATTAAAATAGTCAAAAAATAAAAAAGGTTTATCTGTAGAGAACATCCAAGTGTGCTGGTTCCTTAACTTTAACTTTCTTTTTCTCCATAATCTTCTCAACTGCCTTTCTAAAGTCATCCATTGTTACATAGTCCCTTAACTCCCTAATTGCATTCATCCCTGCCTCTGTGCAGATTGCCTTTAACTCAGCCCCTACACATCCTTCAGTCATCTTAGCTATTTCTTCTAAATTGACATCTTCCGCTAAATTCATCTTTCTTGTATGAATCTTCAATATCTCCAATCTACCCTTCTCATCAGGAGCTGGGACTTCTATGATTCTATCAAATCTTCCAGGTCTTAATATTGCAGGGTCTAAAATGTCAGGCCTGTTTGCGGCCCCAATTATCTTAACATCTCCCCTTGCATCGAATCCATCCATCTCTGCCAACAACTGCATTAATGTTCTCTGAACTTCCCTATCTCCACCAGTTAAAGCGTCTGTTCTCTTTGCTGCAATAGCATCAATCTCATCTATGAATATGATTGAAGGAGCTTTTTCTTTAGCCAATTTGAATATATCTTTAACTAACGGAGCCCCCTCTCCAATAAACTTCTTAACCAATTCAGAACCAACAACTCTTATAAAGGTAGCATTTGTTTCTGTAGCAACAGCTTTAGCTAATAATGTCTTTCCAGTTCCTGGTGGCCCGTAAAGAGAATACCTTTTGGTGGTTAAAAAAAAAAAAAAAAAAAAAAA
## >ERCC_00165 DQ668363 Ac03459964_a1 Ac03460060_a1
## GATATGCGTTACGTGAGTCTGATAGCAGTTCACTACCTGGATATCTGATCCACTAGCTCGATCATGCTCACCCATAGTTTATCTGCATCACTCGTACTGAAATGCTCACATCGCAGGTAGAGCAGCATCGTAGAGCGTCAAGCTGCATCCTAGCGTCATGAGTCATAGTACCTCATGCTCACGTGATCTACCCTAGCTGACCGCTAATGACGGCAGTGCAACCTGAGATACCGACGGCATACTGTCGTCAACGTCAGGCAATGTGTCCGAACGGCGAGCTACGTCGCCTCACGGAGTAATCGCGTCCCTCTAGGTATAGTGCCGTCGGTTCAGGTCATATGTCGCGGGTTCTGCACATATCACGGACGTATCGCTATCAGACGGACGCTCTCGGACCTAAACCGTAGCTCTCGGCAAGATCGTCCTCGTCTCGAATATAGCGCCCTAGTGCTGCAAATGTCACCGCTATCTCGTAAGGGGTCCGTCTGTTGAGTTAGGCCTCCTCTCGTTGGATGTGAGCTCGGTTGCTTGGATGGTGCAGCTTACTTCGCGTACCTGCTGTTTGCATCAGTCCTCTGCATCTATAATCGCGTATCTCTCTCTAGTAGACCATATAGCCATCTAAGCGCTCGATATTCCACCTAAGTGGCGCCTATTGAACTAAGTGGCAGCCGAATGGACTATCGCTCCTCGATATGTACGGATAGGCCACGGCATGTACGAGCATAAGCCGAACTGCACGAGCATACCCGACACTGATCTGAGAGTCGCTTAAATCATCTGCGTGTCTTAGAGCTTATCGCCATGTCTGTCAACTGTACTGTCATCCTGTAACTGTAGCGTATGTGAAAAAAAAAAAAAAAAAAAAAAAA
## >ERCC_00168 DQ516776 Ac03459965_a1 Ac03460061_a1
## CCAATGAACTCAGCTATTCTTCTTAACAAATAACTTTCTCCAGAAAATTCAAATGTATCATCTTCAGGATTCCACCTAAAAACATCATGTAATATAATATCATCAATTTTTGGGTCGTATTCAACAATCTCAGTTATACTCTCAGTTCTTCTAACAAATCTTCCTTTATAAATCAATCTAACCTGCATACATATGGCATTTAGTTGTTCAAGCATAATCTTTGGAATGTTCATTGGTTCAGCATTCAACCTCCTTATAACTGCCTCTGGGGATTTTGCGTGTATCGTTGATAACGCCAAATGTCCTGTAGTTATTGCTTGAAATAATATCTTCGCCTCCTCACCTCTAACCTCTCCAACAATTAAATAATCTGGTCTTTGCCTTAAAGCCGCTTTTAATAAATCCATCATAGTTATTTCATATTCTTCTCCACCGAATCCACTTCTTGTAGTTCCAGCAATCCAGTTTTCATGATACAACCTAATTTCTGGAGTATCCTCAATAGATACGATTTTCATTTGAGGAAGGATGAAAAGAGAGAATGCATTTAAAAGGGTGGTTTTTCCAGTAGCTACCTCTCCAGCAACCATAATAGAATTTTTATATTCAATGAGTAACCAAAGATATGCAAGCATCTCTGGAGAAATACTCCCATATCTTATTAAATCTGTTGGCAATATAGGAGTGTGTGTGAATTTTCTTATTGTAAATGTTGAACCATATCTTGAGATATCCCTTCCAAGGGTTACATTTAGCCTGCTACCATCTGGGAGAGAACCATCCACTATTGGATTAGCCAATGTTAAAGATTTTCCACACCTTTGGGCTAAGGATATACAAAACGAGTCTAATTCTTCATCAGTTTCAAATTTTATATTTGTCTTTAAATGTTCGTATTTTCTATGAAACACATACACTGGCTTTCCAACACCTGTGCAACTGATATCCTCCAAATTCTCATCTTTCATAAGAGCATCTATTTCCCATATCCAATGAGGTAAAAAAAAAAAAAAAAAAAAAAAAA
## >ERCC_00170 DQ516773 Ac03459966_a1 Ac03460062_a1
## TATTGGTGGAGGGGCACAAGTTGCTGAAGTTGCGAGAGGGGCGATAAGTGAGGCAGACAGGCATAATATAAGAGGGGAGAGAATTAGCGTAGATACTCTTCCAATAGTTGGTGAAGAAAATTTATATGAGGCTGTTAAAGCTGTAGCAACTCTTCCACGAGTAGGAATTTTAGTTTTAGCTGGCTCTTTAATGGGAGGGAAGATAACTGAAGCAGTTAAAGAATTAAAGGAAAAGACTGGCATTCCCGTGATAAGCTTAAAGATGTTTGGCTCTGTTCCTAAGGTTGCTGATTTGGTTGTTGGAGACCCATTGCAGGCAGGGGTTTTAGCTGTTATGGCTATTGCTGAAACAGCAAAATTTGATATAAATAAGGTTAAAGGTAGGGTGCTATAAAGATAATTTAATAATTTTTGATGAAACCGAAGCGTTAGCTTTGGGTTATGAAACTCCATGATTTTCATTTAATTTTTTCCTATTAATTTTCTCCTAAAAAGTTTCTTTAACATAAATAAGGTTAAAGGGAGAGCTCTATGATTGTCTTCAAAAATACAAAGATTATTGATGTATATACTGGAGAGGTTGTTAAAGGAAATGTTGCAGTTGAGAGGGATAAAATATCCTTTGTGGATTTAAATGATGAAATTGATAAGATAATTGAAAAAATAAAGGAGGATGTTAAAGTTATTGACTTAAAAGGAAAATATTTATCTCCAACATTTATAGATGGGCATATACATATAGAATCTTCCCATCTCATCCCATCAGAGTTTGAGAAATTTGTATTAAAAAGCGGAGTTAGCAAAGTAGTTATAGACCCGCATGAAATAGCAAATATTGCTGGAAAAGAAGGAATTTTGTTTATGTTGAATGATGCCAAAATTTTAGATGTCTATGTTATGCTTCCTTCCTGTGTTCCAGCTACAAACTTAGAAACAAGTGGAGCTGAGATTACAGCAGAGAATATTGAAGAACTCATTCTTTAGATAATGTCTTAGGTTAAAAAAAAAAAAAAAAAAAAAAAA
## >ERCC_00171 DQ854994 Ac03459967_a1 Ac03460063_a1
## CTGGAGATTGTCTCGTACGGTTAAGAGCCTCCGCCCGTCTCTGGGACTATGGACGGGCACGCTCATATCAGGCTATATTTGGTCCGGGTTATTATCGTCGCGGTTACCGTAATACTTCAGATCAGTTAAGTAGGGCCATATGCCTCGGGAATAAGCTGACGGTGACAAGGTTTCCCCCTAATCGAGACGCTGCAATAACACAGGGGCATACAGTAACCAGGCAAGAGTTCAATCGCTTAGTTTCGTGGCGGGATTTGAGGAAAACTGCGACTGTTCTTTAACCAAACATCCGTGCGATTCGTGCCACTCGTAGACGGCATCTCACAGTCACTGAAGGCTATTAAAGAGTTAGCACCCACCATTGGATGAAGCCCAGGATAAGTGACCCCCCCGGACCTTGGAGTTTCATGCTAATCAAAGAAGAGCTAATCCGACGTAAAGTTGCGGCGTTGATTACGCAGGATTGCGACCAAAGAACGAGAAAAAAAAAAAAAAAAAAAAAAAA
I've tried multiple variations of headers including changing -
to _
, and only including the ERCC-00171
and omitting the rest of the header. None of these have worked. I also prepared a GTF and appended it to the file. If you want to see the full process for that, there's a html document on the google groups RSEM post I linked above, and that goes over it. But, the head of the gtf is:
head grch37-ensembl-ercc.gtf
## 1 processed_transcript exon 11869 12227 . + . exon_id "ENSE00002234944"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000456328"; transcript_name "DDX11L1-002"; transcript_source "havana"; tss_id "TSS15133";
## 1 processed_transcript transcript 11869 14409 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000456328"; transcript_name "DDX11L1-002"; transcript_source "havana"; tss_id "TSS15133";
## 1 transcribed_unprocessed_pseudogene exon 11872 12227 . + . exon_id "ENSE00002234632"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000515242"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; tss_id "TSS178604";
## 1 transcribed_unprocessed_pseudogene transcript 11872 14412 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000515242"; transcript_name "DDX11L1-201"; transcript_source "ensembl"; tss_id "TSS178604";
## 1 transcribed_unprocessed_pseudogene exon 11874 12227 . + . exon_id "ENSE00002269724"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000518655"; transcript_name "DDX11L1-202"; transcript_source "ensembl"; tss_id "TSS192941";
## 1 transcribed_unprocessed_pseudogene transcript 11874 14409 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000518655"; transcript_name "DDX11L1-202"; transcript_source "ensembl"; tss_id "TSS192941";
## 1 transcribed_unprocessed_pseudogene exon 12010 12057 . + . exon_id "ENSE00001948541"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000450305"; transcript_name "DDX11L1-001"; transcript_source "havana"; tss_id "TSS138561";
## 1 transcribed_unprocessed_pseudogene transcript 12010 13670 . + . gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000450305"; transcript_name "DDX11L1-001"; transcript_source "havana"; tss_id "TSS138561";
## 1 transcribed_unprocessed_pseudogene exon 12179 12227 . + . exon_id "ENSE00001671638"; exon_number "2"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000450305"; transcript_name "DDX11L1-001"; transcript_source "havana"; tss_id "TSS138561";
## 1 transcribed_unprocessed_pseudogene exon 12595 12721 . + . exon_id "ENSE00002270865"; exon_number "2"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; transcript_id "ENST00000518655"; transcript_name "DDX11L1-202"; transcript_source "ensembl"; tss_id "TSS192941";
And the tail of the gtf, where the ERCC annotations were added:
tail grch37-ensembl-ercc.gtf
## ERCC_00157 ercc gene 1 1019 . + . gene_id "GERCC_00157"; gene_version "1"; gene_name "ERCC_00157"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00158 ercc gene 1 1027 . + . gene_id "GERCC_00158"; gene_version "1"; gene_name "ERCC_00158"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00160 ercc gene 1 743 . + . gene_id "GERCC_00160"; gene_version "1"; gene_name "ERCC_00160"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00162 ercc gene 1 523 . + . gene_id "GERCC_00162"; gene_version "1"; gene_name "ERCC_00162"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00163 ercc gene 1 543 . + . gene_id "GERCC_00163"; gene_version "1"; gene_name "ERCC_00163"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00164 ercc gene 1 1022 . + . gene_id "GERCC_00164"; gene_version "1"; gene_name "ERCC_00164"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00165 ercc gene 1 872 . + . gene_id "GERCC_00165"; gene_version "1"; gene_name "ERCC_00165"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00168 ercc gene 1 1024 . + . gene_id "GERCC_00168"; gene_version "1"; gene_name "ERCC_00168"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00170 ercc gene 1 1023 . + . gene_id "GERCC_00170"; gene_version "1"; gene_name "ERCC_00170"; gene_source "ercc"; gene_biotype "ercc";
## ERCC_00171 ercc gene 1 505 . + . gene_id "GERCC_00171"; gene_version "1"; gene_name "ERCC_00171"; gene_source "ercc"; gene_biotype "ercc";
As I mentioned above, these files work fine in STAR and featureCounts, but not for RSEM. Am I forgetting a flag in the command below that is necessary because of the slight variation in chromosome naming?
rsem-prepare-reference \
--gtf /data/$USER/genomes/grch37-ensembl-ercc.gtf \
/data/$USER/genomes/grch37-ensembl-ercc.fa ref/human_ens_ercc
Any help would be greatly appreciated.
what kind of error you get form rsem?
So RSEM prepares the reference and quantifies expression just fine, it just doesn't quantify ERCC transcripts. When I prepare the reference I get the following output:
This repeats several times for various transcripts, but never has ERCC in it anywhere, then after it stops repeating variations of this:
Can you check if RSEM output has records for example to this gene_id GERCC_00157? I suspect that does RSEM calcualte spike-ins, but output their names as gene_ids in GTF file.
It does not, it doesn't even include ERCC chromosomes in the .chrlist. RSEM doesn't even through any warnings about not seeing the ERCC chromosomes.
Maybe try to prepare the reference only for spike-ins, and then if it works prepare it with spike-ins + some other fasta file. Maybe the problem is in the human assembly (unlikely though).
Good call, a meaningful error message! But that's all I got, it doesn't tell me what it needs in the formatting to consider something a transcript.
Try to put everything is one directory, probably you locate wrong files. And check files once again, I think this is something very simple goes wrong.
Everything is in the same directory, and is being called from that directory. Definitely something simple, because STAR index build works, but not RSEM
Can you try the same with this file of ERCC https://ufile.io/cepge? I do the analysis with them and everything works. Could be that something goes wrong when you parse the initial file.
Sorry I didn't respond sooner, but my account is too new and I ran out of posts for the day. That did not work either.
Sounds weird... than maybe try reinstalling rsem. Or just use
salmon
orkallisto
on your fastq files :)