Question: >99% unmapped reads
0
gravatar for RPuni
9 days ago by
RPuni0
Italy
RPuni0 wrote:

Hi everyone,

I am using STAR to map trimmed pair-end reads of D. melanogaster to its reference genome downloaded from NCBI. Also gff annotations file was downloaded there.

From log.final.out file 99.9% of reads were unmapped classified as "too short". I attach command lines used to index and map.

STAR --runThreadN 4 --runMode genomeGenerate \
    --genomeDir dmel_genome/index \
    --genomeFastaFiles dmel_genome/dmel_chr2L.fasta \
                       dmel_genome/dmel_chr2R.fasta \
                       dmel_genome/dmel_chr3L.fasta \
                       dmel_genome/dmel_chr3R.fasta \
                       dmel_genome/dmel_chr4.fasta \
                       dmel_genome/dmel_chrX.fasta \
                       dmel_genome/dmel_chrY.fasta \
                       dmel_genome/dmel_chrMT.fasta \
    --sjdbGTFfile dmel_genome/dmel_anno.gff \
    --sjdbGTFtagExonParentTranscript Parent \
    --sjdbOverhang 99


STAR --runThreadN 4 --genomeDir dmel_genome/index/ \
    --readFilesIn trimmed_reads/Dmel_paired_trimmo_A_R1.fq \
                  trimmed_reads/Dmel_paired_trimmo_A_R2.fq \
    --outFileNamePrefix results/STAR/dmel_ \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped Within \
    --outSAMattributes Standard

I found a possible error in --sjdbOverhang 99 which has to be 142 and not 99 since the max length is 143, but could this error be the one that cause these high number of unmapped reads? Should I map the paired ends individually to avoid this problem?

rna-seq star • 178 views
ADD COMMENTlink modified 9 days ago by swbarnes26.0k • written 9 days ago by RPuni0
1

Maybe the sequencing is not from Drosophila melanogaster? You can check a few reads with blast to check if you really have D. melanogaster reads, or you can quickly check the sequencing run with BBTools SendSketch.

ADD REPLYlink written 9 days ago by h.mon26k

Good point, could be contamination. You can paste 100 reads into the NCBI blast or a local blast to get a quick impression too.

ADD REPLYlink written 6 days ago by colindaven1.6k

what is the length distribution of your reads after trimming. IMO, the max length is irrelevant when the problem is 99.9% of reads were unmapped classified as "too short"

ADD REPLYlink written 9 days ago by Carambakaracho1.4k

1/3 of the reads length is 143. The lengths range from 50 to 143.

ADD REPLYlink written 9 days ago by RPuni0
2
gravatar for colindaven
9 days ago by
colindaven1.6k
Hannover Medical School
colindaven1.6k wrote:

Could be an error with a truncated reference sequence. Does the FASTA file have all chromosomes ?

Try the Ensembl Drosophila as well, I have had better experience with the annotations from there as opposed to NCBI. iGenomes is a further good source.

Also I don't use the sjdbOverhang to create genomes

STAR --runMode genomeGenerate --runThreadN 8 --genomeDir mm10_star  --genomeFastaFiles mm10.fa &

Finally, @carambak, STAR iteratively trims reads during alignment, and any unaligned reads are then "too short". Weird but true.

ADD COMMENTlink written 9 days ago by colindaven1.6k

Thanks for the note Colin. I realize should enhance my RNAseq knowledge a bit before commenting more on the subject

ADD REPLYlink written 9 days ago by Carambakaracho1.4k
1

STAR iteratively trims reads during alignment, and any unaligned reads are then "too short". Weird but true.

Thanks again, this really was useful. While I'm more of a microbiology / WGS bioinformatics guy, it never harms to know some things beyond my core areas. However, this is what I call a hidden feature.

I tried to figure where Alex Dobin's github manual mentions trimming, and the 49 pages don't seem to cover it - or I missed it. Neither does their original publication from 2013 mention it explicitly, with your comment I realize trimming is what figure 1c and its explication is about. Only the second publication in 2015 mentioned it explicitly - though you still have to figure out that "too short" also means "too short after trimming"

ADD REPLYlink modified 5 days ago • written 5 days ago by Carambakaracho1.4k
1
gravatar for swbarnes2
9 days ago by
swbarnes26.0k
United States
swbarnes26.0k wrote:

I found a possible error in --sjdbOverhang 99 which has to be 142 and not 99 since the max length is 143, but could this error be the one that cause these high number of unmapped reads?

The STAR manual says that the alignment is not sensitive to this parameter.

You understand that "too short" really means "didn't map"? Are you sure that you have the right organism? Are you sure that your trimmed reads still look sane?

ADD COMMENTlink written 9 days ago by swbarnes26.0k

Yes, I do know too short == did not map. I am on the same organism (D. melanogaster reads against D melanogaster reference genome). I saw the quality of trimmed reads is high,what should I also check?

ADD REPLYlink written 6 days ago by RPuni0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 958 users visited in the last hour