Question: Creating a non-coding RNA index for STAR
0
gravatar for plberry
6 weeks ago by
plberry30
Kansas City, Missouri, USA
plberry30 wrote:

I am trying to generate an index of ncRNAs for use with STAR aligner. I downloaded the long nc RNA GTF and FASTA files from GENCODE here: https://www.gencodegenes.org/human/

When I run Star in Genome Generate mode STAR --runThreadN 10 --runMode genomeGenerate --genomeDir . --genomeFastaFiles gencode.v36.lncRNA_transcripts.fa --sjdbGTFfile gencode.v36.long_noncoding_RNAs.gff3 --sjdbOverhang 100 --limitGenomeGenerateRAM 34068260906 --sjdbGTFtagExonParentTranscript transcript_id

I get this error: Fatal INPUT FILE error, no valid exon lines in the GTF file: gencode.v36.long_noncoding_RNAs.gtf Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

I then tried the ncRNA GTF and fasta files on ensembl from here: https://www.ensembl.org/info/data/ftp/index.html

I ran STAR in Genome Generate mode: STAR --runThreadN 10 --runMode genomeGenerate --genomeDir . --genomeFastaFiles Homo_sapiens.GRCh38.ncrna.fa --sjdbGTFfile Homo_sapiens.GRCh38.102.gtf sjdbOverhang 100 --limitGenomeGenerateRAM 40764467242

And got the same error: Fatal INPUT FILE error, no valid exon lines in the GTF file: Homo_sapiens.GRCh38.102.gtf Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

Obviously I am missing something in the way I need to create this index, but after going through the STAR documentation and searching the error I am coming up completely empty. My end goal is to remove all reads in the sequencing file that map to ncRNAs, so if I'm going at this in completely the wrong way please let me know.

rna-seq star • 186 views
ADD COMMENTlink modified 6 weeks ago by GenoMax96k • written 6 weeks ago by plberry30

Most likely cause is the difference in chromosome naming between GTF and FASTA file.

did you check that ?

ADD REPLYlink written 6 weeks ago by Pierre Lindenbaum134k

The FASTA files do not have any chromosomes on them, just >ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712| for the first entry in the GENCODE fasta, for example. But I have searched for ncRNA fasta files on each database with chromosome information and haven't found any, so I'm not sure how one would go about building an index using these files.

ADD REPLYlink written 6 weeks ago by plberry30

STAR should have no advantage over other aligners for ncRNA. You may want to try a simpler aligner instead.

ADD REPLYlink written 6 weeks ago by GenoMax96k

I wanted to use STAR because that's what I'm going to be using for more downstream analysis of reads I'm not throwing out, and I figured it would be better to use the same aligner for every step. Would there be any pitfalls for using BowTie2 for the quality control reads (looking for ncRNA reads to throw out, PhiX contamination, etc) and then STAR for the kept reads? I have BT indices available for both of those, so it would save me quite a bit of time.

ADD REPLYlink written 6 weeks ago by plberry30

You should use the correct tool for the job at hand. That said what is the rest of the data supposed to be for? Are you sure there are ncRNA's in your data and that you have a specific need to remove them? Are they going to affect your analysis if not removed?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax96k

I am using raw public RNC-seq, mRNA-seq, and Ribo-seq datasets to explore sequence and codon choice effect on protein phase separation behavior. In the lab we've observed some differences in phase-separation behavior that seem to be related which codons are used for the same amino acid. It makes no sense, but we've been unable to eliminate the apparent effect despite trying many different bench techniques and QC methods. So I'm investigating ribosome movement and dynamics to see if there's a pattern between various WT proteins that exhibit the same phase separation behaviors we're observing. Because I'm going to be looking at the translation dynamics at codon resolution, and the data sets are from different labs all over the world with (I'm assuming) folks preparing the libraries with different levels of skill, etc, I'm trying to make sure I'm only looking at the RNAs of interest, which are protein-coding.

ADD REPLYlink written 6 weeks ago by plberry30

Then you could simply use counts for the coding entries from GTF file after the STAR alignment. That should save you additional work. Assuming counts is what you are after?

ADD REPLYlink written 6 weeks ago by GenoMax96k

No, I need sequence and codon resolution, not just counts. I'm identifying exactly which codons the A and P sites are bound to.

ADD REPLYlink written 6 weeks ago by plberry30
2
gravatar for GenoMax
6 weeks ago by
GenoMax96k
United States
GenoMax96k wrote:

My end goal is to remove all reads in the sequencing file that map to ncRNAs

In that case I suggest that you filter your reads against the ncRNA fasta you have using bbduk.sh from BBMap suite. A guide is available.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by GenoMax96k

Thanks for the tip on the BBMap suite, that looks very promising.

ADD REPLYlink written 6 weeks ago by plberry30

Marking this as the answer - it took a little work to get BBmap up and running due to some dependencies permissions but it was 2 hours well spent! Worked a treat, definitely going to use this again when I need a quick and dirty way to separate out reads from large alignments. It churned through over 200 million reads in less than 5 minutes on my work's server.

ADD REPLYlink written 6 weeks ago by plberry30

bbmap is available via conda. So you should consider using that route next time.

ADD REPLYlink written 6 weeks ago by GenoMax96k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1565 users visited in the last hour
_