Question: Reads not aligned with STAR, only spike-ins
1
gravatar for gasta88
2.6 years ago by
gasta8830
gasta8830 wrote:

Hello everybody,

I'm fairly new in the RNA-seq analysis sector and I'm facing some problems with STAR aligner. I'm trying to align a set of reads on the Saccharomyces cerevisiae genome (Ensembl release 74) using spike-in (ERCC92) as QC method.

I've downloaded the .fa file from Ensembl and appended the spike-in sequences at the end (file: https://www.dropbox.com/s/ge18ellc8dcxnit/SCerevisiae.cdna.all.fa?dl=0 )

I've also created the GTF with reads and spike-in annotation in a similar fashion (file: https://www.dropbox.com/s/d3zcv4gemddfmbp/Saccharomyces_cerevisiae.R64-1-1.87.gtf?dl=0 ) The ERCC92 annotation where downloaded from here: https://github.com/NCBI-Hackathons/HASSL_Homogeneous_Analysis_of_SRA_rnaSequencing_Libraries/blob/master/Spike-Ins/ERCC92/ERCC92.gtf

The genome index creation was performed with the following command:

STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /homes/fgastaldello/STAR_test/S.Cer_db --genomeFastaFiles /homes/fgastaldello/SCerevisiae.cdna.all.fa

The mapping command was as follow:

STAR --sjdbGTFfile /homes/fgastaldello/Saccharomyces_cerevisiae.R64-1-1.87.gtf --quantMode GeneCounts --runThreadN 10 --genomeDir /homes/fgastaldello/STAR_test/S.Cer_db --genomeLoad NoSharedMemory --readFilesCommand gunzip -c --readFilesIn /homes/fgastaldello/dyeswap_data/Snf2/Snf2-${i}/${fastqfiles[0]} /homes/fgastaldello/dyeswap_data/Snf2/Snf2-${i}/${fastqfiles[1]} --outSAMmode Full --outSJfilterIntronMaxVsReadN 50 100 150 200 --outFilterType BySJout --outFilterMultimapNmax 2 --outFilterMismatchNmax 5 --outSAMunmapped Within

The bash script that runs this command iterates in 6 folders and substitute the variables ${fastqfiles[N]} with a sample files containing pair-end reads.

The command runs fine but the final result of the entire computation map only the spike-in. All the other genes are missing (file: https://www.dropbox.com/s/0e529ku9ub9z77y/ReadsPerGene.out.tab?dl=0 ).

Since I'm new I'm probably doing something silly but I can't get my head around it.

Thanks for any help.

rna-seq alignment genome • 1.4k views
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by gasta8830

Grab a few reads from your file and check them using BLAST at NCBI to confirm they hit yeast genome. Hopefully there are no surprises (e.g. data you were given does not belong to you).

ADD REPLYlink written 2.6 years ago by genomax74k

I've done as you suggested and they map to the yeast genome.

ADD REPLYlink written 2.6 years ago by gasta8830

Then I suggest that you start with simplest STAR options to see if that restores mapping. Perhaps you are using a filter that is removing the mapping (--outFilterMultimapNmax 2 this may be one of those culprits).

You could also give bbmap.sh from BBMap suite a try.

ADD REPLYlink written 2.6 years ago by genomax74k

Try to count with featureCounts, the star bam file (you may have to sort it) and your gtf annotation.

edit: also, check STAR log output, it should have information about %reads mapping to the genome, properly paired, multi-mappers, etc.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by h.mon28k

For future ref: featureCounts does not require a pre-sorted file.

ADD REPLYlink written 2.6 years ago by genomax74k

I did not remember that, although it is an amazing software, I have been using only STAR for a while now.

ADD REPLYlink written 2.6 years ago by h.mon28k

The log files from STAR are showing an average of 82 % unique mapped reads and low (less than 0.1%) unmapped reads for each sample.

I'm struggling to make featureCounts work since I operate on two Windows machines.

I'm quite incline to think that the GTF is not properly formatted. For example:

...

Mito ensembl start_codon 85554 85556 . + 0 gene_id "Q0297"; transcript_id "Q0297"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; trans$

Mito ensembl stop_codon 85707 85709 . + 0 gene_id "Q0297"; transcript_id "Q0297"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; trans$ ERCC-00002 ERCC exon 1 1061 0.000000 + . gene_id "ERCC-00002"; transcript_id "DQ459430";

ERCC-00003 ERCC exon 1 1023 0.000000 + . gene_id "ERCC-00003"; transcript_id "DQ516784";

...

There is a clear difference between the two styles. What is the correct way to append the genome GTF and the ERCC one? is there a preferred source for the latter?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by gasta8830
1
gravatar for gasta88
2.6 years ago by
gasta8830
gasta8830 wrote:

Solved the problem.

The issue was due to the genome indexes generation. I had a look at the chrName.txt file and I saw that each gene was mentioned inside. I've compared it with another STAR genome indexes for yeast and I saw that the chrName.txt was different, mentioning only the chromosomes.

Mistakenly I've used the cDNA FASTA file from ENSEMBL instead of this one ( ftp://ftp.ensemblgenomes.org/pub/release-22/fungi/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.22.dna.toplevel.fa.gz ).

By re-creating the genome indexes with the right file, the mapping phase gave the right outputs.

Thanks again for the replies, they helped me to get deeper into the problem!

ADD COMMENTlink written 2.6 years ago by gasta8830
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1965 users visited in the last hour