I am trying to generate an index of ncRNAs for use with STAR aligner. I downloaded the long nc RNA GTF and FASTA files from GENCODE here: https://www.gencodegenes.org/human/
When I run Star in Genome Generate mode STAR --runThreadN 10 --runMode genomeGenerate --genomeDir . --genomeFastaFiles gencode.v36.lncRNA_transcripts.fa --sjdbGTFfile gencode.v36.long_noncoding_RNAs.gff3 --sjdbOverhang 100 --limitGenomeGenerateRAM 34068260906 --sjdbGTFtagExonParentTranscript transcript_id
I get this error:
Fatal INPUT FILE error, no valid exon lines in the GTF file: gencode.v36.long_noncoding_RNAs.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.
I then tried the ncRNA GTF and fasta files on ensembl from here: https://www.ensembl.org/info/data/ftp/index.html
I ran STAR in Genome Generate mode: STAR --runThreadN 10 --runMode genomeGenerate --genomeDir . --genomeFastaFiles Homo_sapiens.GRCh38.ncrna.fa --sjdbGTFfile Homo_sapiens.GRCh38.102.gtf sjdbOverhang 100 --limitGenomeGenerateRAM 40764467242
And got the same error: Fatal INPUT FILE error, no valid exon lines in the GTF file: Homo_sapiens.GRCh38.102.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.
Obviously I am missing something in the way I need to create this index, but after going through the STAR documentation and searching the error I am coming up completely empty. My end goal is to remove all reads in the sequencing file that map to ncRNAs, so if I'm going at this in completely the wrong way please let me know.
did you check that ?
The FASTA files do not have any chromosomes on them, just
>ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|
for the first entry in the GENCODE fasta, for example. But I have searched for ncRNA fasta files on each database with chromosome information and haven't found any, so I'm not sure how one would go about building an index using these files.STAR
should have no advantage over other aligners for ncRNA. You may want to try a simpler aligner instead.I wanted to use STAR because that's what I'm going to be using for more downstream analysis of reads I'm not throwing out, and I figured it would be better to use the same aligner for every step. Would there be any pitfalls for using BowTie2 for the quality control reads (looking for ncRNA reads to throw out, PhiX contamination, etc) and then STAR for the kept reads? I have BT indices available for both of those, so it would save me quite a bit of time.
You should use the correct tool for the job at hand. That said what is the rest of the data supposed to be for? Are you sure there are ncRNA's in your data and that you have a specific need to remove them? Are they going to affect your analysis if not removed?
I am using raw public RNC-seq, mRNA-seq, and Ribo-seq datasets to explore sequence and codon choice effect on protein phase separation behavior. In the lab we've observed some differences in phase-separation behavior that seem to be related which codons are used for the same amino acid. It makes no sense, but we've been unable to eliminate the apparent effect despite trying many different bench techniques and QC methods. So I'm investigating ribosome movement and dynamics to see if there's a pattern between various WT proteins that exhibit the same phase separation behaviors we're observing. Because I'm going to be looking at the translation dynamics at codon resolution, and the data sets are from different labs all over the world with (I'm assuming) folks preparing the libraries with different levels of skill, etc, I'm trying to make sure I'm only looking at the RNAs of interest, which are protein-coding.
Then you could simply use counts for the
coding
entries from GTF file after the STAR alignment. That should save you additional work. Assuming counts is what you are after?No, I need sequence and codon resolution, not just counts. I'm identifying exactly which codons the A and P sites are bound to.