Question

>99% unmapped reads

0

Entering edit mode

4.8 years ago

RiccPicc ▴ 10

Hi everyone,

I am using STAR to map trimmed pair-end reads of D. melanogaster to its reference genome downloaded from NCBI. Also gff annotations file was downloaded there.

From log.final.out file 99.9% of reads were unmapped classified as "too short". I attach command lines used to index and map.

STAR --runThreadN 4 --runMode genomeGenerate \
    --genomeDir dmel_genome/index \
    --genomeFastaFiles dmel_genome/dmel_chr2L.fasta \
                       dmel_genome/dmel_chr2R.fasta \
                       dmel_genome/dmel_chr3L.fasta \
                       dmel_genome/dmel_chr3R.fasta \
                       dmel_genome/dmel_chr4.fasta \
                       dmel_genome/dmel_chrX.fasta \
                       dmel_genome/dmel_chrY.fasta \
                       dmel_genome/dmel_chrMT.fasta \
    --sjdbGTFfile dmel_genome/dmel_anno.gff \
    --sjdbGTFtagExonParentTranscript Parent \
    --sjdbOverhang 99


STAR --runThreadN 4 --genomeDir dmel_genome/index/ \
    --readFilesIn trimmed_reads/Dmel_paired_trimmo_A_R1.fq \
                  trimmed_reads/Dmel_paired_trimmo_A_R2.fq \
    --outFileNamePrefix results/STAR/dmel_ \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped Within \
    --outSAMattributes Standard

I found a possible error in --sjdbOverhang 99 which has to be 142 and not 99 since the max length is 143, but could this error be the one that cause these high number of unmapped reads? Should I map the paired ends individually to avoid this problem?

STAR RNA-Seq • 1.7k views

ADD COMMENT • link updated 4.8 years ago by swbarnes2 14k • written 4.8 years ago by RiccPicc ▴ 10

1

Entering edit mode

Maybe the sequencing is not from Drosophila melanogaster? You can check a few reads with blast to check if you really have D. melanogaster reads, or you can quickly check the sequencing run with BBTools SendSketch.

ADD REPLY • link 4.8 years ago by h.mon 35k

0

Entering edit mode

Good point, could be contamination. You can paste 100 reads into the NCBI blast or a local blast to get a quick impression too.

ADD REPLY • link 4.8 years ago by colindaven 6.4k

0

Entering edit mode

what is the length distribution of your reads after trimming. IMO, the max length is irrelevant when the problem is 99.9% of reads were unmapped classified as "too short"

ADD REPLY • link 4.8 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

1/3 of the reads length is 143. The lengths range from 50 to 143.

ADD REPLY • link 4.8 years ago by RiccPicc ▴ 10

score 2 · Answer 1 · 2019-07-12

2

Entering edit mode

4.8 years ago

colindaven 6.4k

Could be an error with a truncated reference sequence. Does the FASTA file have all chromosomes ?

Try the Ensembl Drosophila as well, I have had better experience with the annotations from there as opposed to NCBI. iGenomes is a further good source.

Also I don't use the sjdbOverhang to create genomes

STAR --runMode genomeGenerate --runThreadN 8 --genomeDir mm10_star  --genomeFastaFiles mm10.fa &

Finally, @carambak, STAR iteratively trims reads during alignment, and any unaligned reads are then "too short". Weird but true.

ADD COMMENT • link 4.8 years ago by colindaven 6.4k

0

Entering edit mode

Thanks for the note Colin. I realize should enhance my RNAseq knowledge a bit before commenting more on the subject

ADD REPLY • link 4.8 years ago by Carambakaracho ★ 3.2k

1

Entering edit mode

STAR iteratively trims reads during alignment, and any unaligned reads are then "too short". Weird but true.

Thanks again, this really was useful. While I'm more of a microbiology / WGS bioinformatics guy, it never harms to know some things beyond my core areas. However, this is what I call a hidden feature.

I tried to figure where Alex Dobin's github manual mentions trimming, and the 49 pages don't seem to cover it - or I missed it. Neither does their original publication from 2013 mention it explicitly, with your comment I realize trimming is what figure 1c and its explication is about. Only the second publication in 2015 mentioned it explicitly - though you still have to figure out that "too short" also means "too short after trimming"

ADD REPLY • link 4.8 years ago by Carambakaracho ★ 3.2k

score 1 · Answer 2 · 2019-07-12

1

Entering edit mode

4.8 years ago

swbarnes2 14k

I found a possible error in --sjdbOverhang 99 which has to be 142 and not 99 since the max length is 143, but could this error be the one that cause these high number of unmapped reads?

The STAR manual says that the alignment is not sensitive to this parameter.

You understand that "too short" really means "didn't map"? Are you sure that you have the right organism? Are you sure that your trimmed reads still look sane?

ADD COMMENT • link 4.8 years ago by swbarnes2 14k

0

Entering edit mode

Yes, I do know too short == did not map. I am on the same organism (D. melanogaster reads against D melanogaster reference genome). I saw the quality of trimmed reads is high,what should I also check?

ADD REPLY • link 4.8 years ago by RiccPicc ▴ 10