Creating a non-coding RNA index for STAR
2
0
Entering edit mode
16 months ago
plberry ▴ 30

I am trying to generate an index of ncRNAs for use with STAR aligner. I downloaded the long nc RNA GTF and FASTA files from GENCODE here: https://www.gencodegenes.org/human/

When I run Star in Genome Generate mode STAR --runThreadN 10 --runMode genomeGenerate --genomeDir . --genomeFastaFiles gencode.v36.lncRNA_transcripts.fa --sjdbGTFfile gencode.v36.long_noncoding_RNAs.gff3 --sjdbOverhang 100 --limitGenomeGenerateRAM 34068260906 --sjdbGTFtagExonParentTranscript transcript_id

I get this error: Fatal INPUT FILE error, no valid exon lines in the GTF file: gencode.v36.long_noncoding_RNAs.gtf Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

I then tried the ncRNA GTF and fasta files on ensembl from here: https://www.ensembl.org/info/data/ftp/index.html

I ran STAR in Genome Generate mode: STAR --runThreadN 10 --runMode genomeGenerate --genomeDir . --genomeFastaFiles Homo_sapiens.GRCh38.ncrna.fa --sjdbGTFfile Homo_sapiens.GRCh38.102.gtf sjdbOverhang 100 --limitGenomeGenerateRAM 40764467242

And got the same error: Fatal INPUT FILE error, no valid exon lines in the GTF file: Homo_sapiens.GRCh38.102.gtf Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

Obviously I am missing something in the way I need to create this index, but after going through the STAR documentation and searching the error I am coming up completely empty. My end goal is to remove all reads in the sequencing file that map to ncRNAs, so if I'm going at this in completely the wrong way please let me know.

STAR RNA-Seq • 926 views
0
Entering edit mode

Most likely cause is the difference in chromosome naming between GTF and FASTA file.

did you check that ?

0
Entering edit mode

The FASTA files do not have any chromosomes on them, just >ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712| for the first entry in the GENCODE fasta, for example. But I have searched for ncRNA fasta files on each database with chromosome information and haven't found any, so I'm not sure how one would go about building an index using these files.

0
Entering edit mode

STAR should have no advantage over other aligners for ncRNA. You may want to try a simpler aligner instead.

0
Entering edit mode

I wanted to use STAR because that's what I'm going to be using for more downstream analysis of reads I'm not throwing out, and I figured it would be better to use the same aligner for every step. Would there be any pitfalls for using BowTie2 for the quality control reads (looking for ncRNA reads to throw out, PhiX contamination, etc) and then STAR for the kept reads? I have BT indices available for both of those, so it would save me quite a bit of time.

0
Entering edit mode

You should use the correct tool for the job at hand. That said what is the rest of the data supposed to be for? Are you sure there are ncRNA's in your data and that you have a specific need to remove them? Are they going to affect your analysis if not removed?

0
Entering edit mode

I am using raw public RNC-seq, mRNA-seq, and Ribo-seq datasets to explore sequence and codon choice effect on protein phase separation behavior. In the lab we've observed some differences in phase-separation behavior that seem to be related which codons are used for the same amino acid. It makes no sense, but we've been unable to eliminate the apparent effect despite trying many different bench techniques and QC methods. So I'm investigating ribosome movement and dynamics to see if there's a pattern between various WT proteins that exhibit the same phase separation behaviors we're observing. Because I'm going to be looking at the translation dynamics at codon resolution, and the data sets are from different labs all over the world with (I'm assuming) folks preparing the libraries with different levels of skill, etc, I'm trying to make sure I'm only looking at the RNAs of interest, which are protein-coding.

0
Entering edit mode

Then you could simply use counts for the coding entries from GTF file after the STAR alignment. That should save you additional work. Assuming counts is what you are after?

0
Entering edit mode

No, I need sequence and codon resolution, not just counts. I'm identifying exactly which codons the A and P sites are bound to.

3
Entering edit mode
16 months ago
GenoMax 115k

My end goal is to remove all reads in the sequencing file that map to ncRNAs

In that case I suggest that you filter your reads against the ncRNA fasta you have using bbduk.sh from BBMap suite. A guide is available.

0
Entering edit mode

Thanks for the tip on the BBMap suite, that looks very promising.

0
Entering edit mode

Marking this as the answer - it took a little work to get BBmap up and running due to some dependencies permissions but it was 2 hours well spent! Worked a treat, definitely going to use this again when I need a quick and dirty way to separate out reads from large alignments. It churned through over 200 million reads in less than 5 minutes on my work's server.

0
Entering edit mode
0
Entering edit mode

bbmap is available via conda. So you should consider using that route next time.

0
Entering edit mode
3 months ago

Hi, when mapping to transcriptome, you do not need the GTF file, since all the information about transcripts is already contained in the sequnces of the transcriptome file.