Question

Reads not aligned with STAR, only spike-ins

1

Entering edit mode

7.0 years ago

gasta88 ▴ 50

Hello everybody,

I'm fairly new in the RNA-seq analysis sector and I'm facing some problems with STAR aligner. I'm trying to align a set of reads on the Saccharomyces cerevisiae genome (Ensembl release 74) using spike-in (ERCC92) as QC method.

I've downloaded the .fa file from Ensembl and appended the spike-in sequences at the end (file: https://www.dropbox.com/s/ge18ellc8dcxnit/SCerevisiae.cdna.all.fa?dl=0 )

I've also created the GTF with reads and spike-in annotation in a similar fashion (file: https://www.dropbox.com/s/d3zcv4gemddfmbp/Saccharomyces_cerevisiae.R64-1-1.87.gtf?dl=0 ) The ERCC92 annotation where downloaded from here: https://github.com/NCBI-Hackathons/HASSL_Homogeneous_Analysis_of_SRA_rnaSequencing_Libraries/blob/master/Spike-Ins/ERCC92/ERCC92.gtf

The genome index creation was performed with the following command:

STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /homes/fgastaldello/STAR_test/S.Cer_db --genomeFastaFiles /homes/fgastaldello/SCerevisiae.cdna.all.fa

The mapping command was as follow:

STAR --sjdbGTFfile /homes/fgastaldello/Saccharomyces_cerevisiae.R64-1-1.87.gtf --quantMode GeneCounts --runThreadN 10 --genomeDir /homes/fgastaldello/STAR_test/S.Cer_db --genomeLoad NoSharedMemory --readFilesCommand gunzip -c --readFilesIn /homes/fgastaldello/dyeswap_data/Snf2/Snf2-${i}/${fastqfiles[0]} /homes/fgastaldello/dyeswap_data/Snf2/Snf2-${i}/${fastqfiles[1]} --outSAMmode Full --outSJfilterIntronMaxVsReadN 50 100 150 200 --outFilterType BySJout --outFilterMultimapNmax 2 --outFilterMismatchNmax 5 --outSAMunmapped Within

The bash script that runs this command iterates in 6 folders and substitute the variables ${fastqfiles[N]} with a sample files containing pair-end reads.

The command runs fine but the final result of the entire computation map only the spike-in. All the other genes are missing (file: https://www.dropbox.com/s/0e529ku9ub9z77y/ReadsPerGene.out.tab?dl=0 ).

Since I'm new I'm probably doing something silly but I can't get my head around it.

Thanks for any help.

RNA-Seq alignment genome • 3.7k views

ADD COMMENT • link 7.0 years ago by gasta88 ▴ 50

0

Entering edit mode

Grab a few reads from your file and check them using BLAST at NCBI to confirm they hit yeast genome. Hopefully there are no surprises (e.g. data you were given does not belong to you).

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

I've done as you suggested and they map to the yeast genome.

ADD REPLY • link 7.0 years ago by gasta88 ▴ 50

0

Entering edit mode

Then I suggest that you start with simplest STAR options to see if that restores mapping. Perhaps you are using a filter that is removing the mapping (--outFilterMultimapNmax 2 this may be one of those culprits).

You could also give bbmap.sh from BBMap suite a try.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Try to count with featureCounts, the star bam file (you may have to sort it) and your gtf annotation.

edit: also, check STAR log output, it should have information about %reads mapping to the genome, properly paired, multi-mappers, etc.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

For future ref: featureCounts does not require a pre-sorted file.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

I did not remember that, although it is an amazing software, I have been using only STAR for a while now.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

The log files from STAR are showing an average of 82 % unique mapped reads and low (less than 0.1%) unmapped reads for each sample.

I'm struggling to make featureCounts work since I operate on two Windows machines.

I'm quite incline to think that the GTF is not properly formatted. For example:

...

Mito ensembl start_codon 85554 85556 . + 0 gene_id "Q0297"; transcript_id "Q0297"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; trans$

Mito ensembl stop_codon 85707 85709 . + 0 gene_id "Q0297"; transcript_id "Q0297"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; trans$ ERCC-00002 ERCC exon 1 1061 0.000000 + . gene_id "ERCC-00002"; transcript_id "DQ459430";

ERCC-00003 ERCC exon 1 1023 0.000000 + . gene_id "ERCC-00003"; transcript_id "DQ516784";

...

There is a clear difference between the two styles. What is the correct way to append the genome GTF and the ERCC one? is there a preferred source for the latter?

ADD REPLY • link 7.0 years ago by gasta88 ▴ 50

score 2 · Answer 1 · 2017-04-11

Solved the problem.

The issue was due to the genome indexes generation. I had a look at the chrName.txt file and I saw that each gene was mentioned inside. I've compared it with another STAR genome indexes for yeast and I saw that the chrName.txt was different, mentioning only the chromosomes.

Mistakenly I've used the cDNA FASTA file from ENSEMBL instead of this one ( ftp://ftp.ensemblgenomes.org/pub/release-22/fungi/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.22.dna.toplevel.fa.gz ).

By re-creating the genome indexes with the right file, the mapping phase gave the right outputs.

Thanks again for the replies, they helped me to get deeper into the problem!