Hi everyone,
I am using STAR
to map trimmed pair-end reads of D. melanogaster to its reference genome downloaded from NCBI. Also gff annotations file was downloaded there.
From log.final.out
file 99.9% of reads were unmapped classified as "too short".
I attach command lines used to index and map.
STAR --runThreadN 4 --runMode genomeGenerate \
--genomeDir dmel_genome/index \
--genomeFastaFiles dmel_genome/dmel_chr2L.fasta \
dmel_genome/dmel_chr2R.fasta \
dmel_genome/dmel_chr3L.fasta \
dmel_genome/dmel_chr3R.fasta \
dmel_genome/dmel_chr4.fasta \
dmel_genome/dmel_chrX.fasta \
dmel_genome/dmel_chrY.fasta \
dmel_genome/dmel_chrMT.fasta \
--sjdbGTFfile dmel_genome/dmel_anno.gff \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 99
STAR --runThreadN 4 --genomeDir dmel_genome/index/ \
--readFilesIn trimmed_reads/Dmel_paired_trimmo_A_R1.fq \
trimmed_reads/Dmel_paired_trimmo_A_R2.fq \
--outFileNamePrefix results/STAR/dmel_ \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--outSAMattributes Standard
I found a possible error in --sjdbOverhang 99
which has to be 142 and not 99 since the max length is 143, but could this error be the one that cause these high number of unmapped reads?
Should I map the paired ends individually to avoid this problem?
Maybe the sequencing is not from Drosophila melanogaster? You can check a few reads with blast to check if you really have D. melanogaster reads, or you can quickly check the sequencing run with BBTools SendSketch.
Good point, could be contamination. You can paste 100 reads into the NCBI blast or a local blast to get a quick impression too.
what is the length distribution of your reads after trimming. IMO, the max length is irrelevant when the problem is 99.9% of reads were unmapped classified as "too short"
1/3 of the reads length is 143. The lengths range from 50 to 143.