Low mapping frequency on STAR
1
0
Entering edit mode
11 weeks ago
a_bis ▴ 20

EDIT: problem seems to be that 88% of reads seem to be 'too short' but not sure how to resolve this -- more details in my comment below.

Hi all, I've been trying to re-map some RNA-seq fasta files to mm39 using STAR. I was told by the sequencing facility who ran the requencing for me that when they mapped the reads onto mm10 using BWA MEM their mapping frequency for each file was around 95%. However when I do the alignment, my unique reads frequency is a little under 10%. Judging by the mapping frequency achieved using BWA MEM I doubt that rRNA or other contamination is a problem. Do you have any suggestions as to why I'm coming up against this problem? Could I be using incorrect genome/annotation files for the genome indexing?

For reference, I used the following code to generate my genome index (with genome and annotation files from Ensembl):

STAR --runMode genomeGenerate --genomeDir mm39starindex/index/ \
--genomeFastaFiles /fullpath/Mus_musculus.GRCm39.dna.primary_assembly.fasta \
--sjdbGTFfile /fullpath/Mus_musculus.GRCm39.104.gtf  \


And the format of the code I'm using for the alignment:

#!/bin/bash
#SBATCH --job-name=star
#SBATCH --time=0-64:00:00
#SBATCH --output=outfile.%j
#SBATCH --error=errfile.%j

STAR --runThreadN 16 --genomeDir mm39starindex/mm39starindex/index \
--outSAMtype BAM SortedByCoordinate --quantMode GeneCounts \
--outFileNamePrefix alignments/filename


Any advice will be much appreciated!

star rna-seq • 829 views
1
Entering edit mode

0
Entering edit mode

UPDATE: I extracted 25,000 reads from one of my fastq files and ran STAR just on that. I examined the output log and got the following:

According to this, 88% of my reads are not mapping because they are 'too short' (no clue how short that is). However, as far as I can understand STAR doesn't have a minimum requirement for read length. Nonetheless, I ran it again with --outFilterMatchNmin 20, and I got pretty much the same proportions of mapped vs. unmapped reads. This is confusing because my average input read length is 151 bases, which is very close to the average mapped length of 148 bases, which makes me think that the majority of reads are, in fact, of that approximate length, and should therefore be mappable. Again, any ideas are very welcome! Thanks :)

0
Entering edit mode

Can you verify the following is not happening with your data? (LINK)

I encountered this issue when in the two paired-end input FASTQ files mates are out-of-order, i.e. mates are not found at the same line of the two files. This leads to a lot of not properly mapped read pairs that STAR throws into the "too short" bucket.

i.e. your R1/R2 reads are possibly not in sync. This could happen if you trimmed them individually.

0
Entering edit mode

"Too short" just means they didn't map. It has nothing to do with read length.

0
Entering edit mode

Thank you -- is there any reason that BWA MEM would be so good at mapping the fastq files (90+%) whereas STAR would perform much worse on the same files?

0
Entering edit mode

Well, are you sure that it's not paired reads getting out of sync? You've confirmed that you get the same bad results when aligning one end at a time?

0
Entering edit mode

Hi all, thank you for your many helpful suggestions. I have not been able to test some of these out yet as the server I have been using to run the alignments is currently down. Hopefully, though, it should be back up soon and I will be able to confirm my working hypothesis, which is the following: I re-downloaded the full mm39 genome assembly from Ensembl, and discovered that the file was about 30 times as large as the one I previously used to generate a STAR genome index. I re-generated an index with the new (apparently intact) file, and this time round it took me much longer (25 minutes) than it had previously taken me. Based on that, I am assuming that I used a partial/corrupted/incomplete genome assembly file to generate an index, which meant that most of my reads weren't aligning. I will test this out as soon as I can by attempting to re-map my fastq files with the new index and confirm whether mapping rates this time round are higher.

Honestly though I have no idea how the genome assembly file "shrank" -- I downloaded it a couple of weeks ago and other than copying it to my institutional server to run the alignment there, to my knowledge I didn't otherwise manipulate it.

0
Entering edit mode

full mm39 genome assembly from Ensembl, and discovered that the file was about 30 times as large as the one I previously used to generate a STAR genome index.

You likely downloaded the top level assembly which includes haplotypes etc. For most analyses one does not need this (not for RNAseq anyway). You should download primary assembly which includes the main chromosomes and re-index this. You will find that this would be the correct size ~2.7G.

0
Entering edit mode

Thank you very much -- as far as I could tell, the primary file was the one I went for initially, but I will keep this in mind for the future.

2
Entering edit mode
10 weeks ago
a_bis ▴ 20

Hi all, I can confirm that my initial botched-up index was the cause of the very low mapping rate. Proper index re-generation and re-alignment is now giving me 84% mapping rate. Also, the alignment speed is much higher now (the alignment was done within 30 minutes with --runThreadN 16), so I assume that my incomplete initial index was also to blame for that. Thanks everyone for your input and suggestions, I've learned a lot in the process.

1
Entering edit mode

Since this fixed your problem I am moving this comment to an answer. You can accept it (green check mark) to provide closure to this thread.