Entering edit mode
3.4 years ago
crispin.hiley
▴
10
Hi,
I would be very grateful for help. I have a problem similar to this:
RSEM: The reference contains no transcripts!
I am trying to run RSEM-prepare-reference with my own custom gtf file.:
SCRATCH_DIR=/camp/project/scratch/hileyc
REFERENCE_DIR="${SCRATCH_DIR}/reference"
ASSETS_DIR="${SCRATCH_DIR}/RNA"
RSEM_DIR="${SCRATCH_DIR}/rsem/RNA_output"
singularity run -B "${SCRATCH_DIR}:${SCRATCH_DIR}" -W "${RSEM_DIR}" \
docker://slab/rsem \
rsem-prepare-reference \
--gtf "${ASSETS_DIR}/RNA.gtf" \
--star \
--star-sjdboverhang 74 \
--num-threads 8 \
"${REFERENCE_DIR}/hg19.fa" \
RNA
but it get the following error:
rsem-extract-reference-transcripts eRNA 0 /camp/project/proj-tracerx-lung/txscratch/hileyc/eRNA/eRNA.gtf None 0 /camp/project/proj-tracerx-lung/txscratch/hileyc/reference/hg19.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
The reference contains no transcripts!
"rsem-extract-reference-transcripts RNA 0 /camp/project/scratch/hileyc/RNA/RNA.gtf None 0 /camp/project/scratch/hileyc/reference/hg19.fa" failed! Plase check if you provide correct parameters/options for the pipeline!
this is what the first few lines of my gtf look like:
chr1 L_etal CDS 751480 751481 . + . gene_id "RNA751480"; transcript_id "RNA751480+";
chr1 L_etal CDS 751690 751691 . + . gene_id "RNA751690"; transcript_id "RNA751690+";
chr1 L_etal CDS 752240 752241 . + . gene_id "RNA752240"; transcript_id "RNA752240+";
I have tried changing CDS to 'transcript' and other terms. The chromosome labels are the same.
Im sure its an incompatibly between by gft and the reference fasta file but i cant figure it out.
any ideas appreciated.
I am guessing that the GTF is the issue - it is missing transcript-to-gene information. There is detailed info here: https://deweylab.github.io/RSEM/README.html#built
A complete record could look something like
I have intentionally added spaces. Here, the gene is OR4G11P, and it has 2 transcripts, the first with 3 exons and the second with a single exon.
For your [assuming] eRNAs, you may simply need an extra line with 'gene' in place of 'CDS' ?
thanks Kevin that is very kind of you to reply,
yes eRNA
so for each eRNA (as there can be expression from both strands something like this:
dont think this is the solution unfortunately. is the problem that I am trying to align to single base pair?
I had not noticed that. How is it that these are 1 bp? They may fail some filter for minimum transcript length?