Entering edit mode
7 months ago
klsywd
▴
10
Hello, my output file is ending up with each chromosome listed twice, and also each read is being mapped twice to the same location. Any idea what could be wrong? here is an example. It's illumina 150bp paired end reads from SRA (downloaded via fastq-dump --split-files) mapped to an unannotated/unpublished genome.
SRR31736706.1 99 Chromosome23 7398268 3 150M = 7398439 321 GNCATCACCATCGGTAACGAGAGGTTCCGTTGCCCTGAGGCTCTCTTCCAGCCTTCCTTCTTGGGTATGGAATCGTGCGGTATCCACGAGACCGTGTACAACTCCATCATGAAGTGCGACGTTGACATCCGTAAGGACCTGTACGCCAAC I#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NH:i:2 HI:i:1 AS:i:295 nM:i:1
SRR31736706.1 147 Chromosome23 7398439 3 150M = 7398268 -321 ACCATGTACCCCGGTATCGCCGACAGGATGCAGAAGGAGATCACCGCCCTCGCTCCCTCCACCATCAAGATCAAGAGCATCGCTCCCCCCGAGAGGAAGTACTCCGTATGGATCGGTGGATCCATCCTGGCTTCCCTCTCCACCTTCCAG IIII99IIII9I-IIIIIIII-999-I99-I9999--I99-II9I-999-9-9-9I9-99-II--9----I-II----I--III-II9IIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NH:i:2 HI:i:1 AS:i:295 nM:i:1
SRR31736706.1 355 Chromosome23 7398268 3 150M = 7398439 321 GNCATCACCATCGGTAACGAGAGGTTCCGTTGCCCTGAGGCTCTCTTCCAGCCTTCCTTCTTGGGTATGGAATCGTGCGGTATCCACGAGACCGTGTACAACTCCATCATGAAGTGCGACGTTGACATCCGTAAGGACCTGTACGCCAAC I#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NH:i:2 HI:i:2 AS:i:295 nM:i:1
SRR31736706.1 403 Chromosome23 7398439 3 150M = 7398268 -321 ACCATGTACCCCGGTATCGCCGACAGGATGCAGAAGGAGATCACCGCCCTCGCTCCCTCCACCATCAAGATCAAGAGCATCGCTCCCCCCGAGAGGAAGTACTCCGTATGGATCGGTGGATCCATCCTGGCTTCCCTCTCCACCTTCCAG IIII99IIII9I-IIIIIIII-999-I99-I9999--I99-II9I-999-9-9-9I9-99-II--9----I-II----I--III-II9IIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NH:i:2 HI:i:2 AS:i:295 nM:i:1
You can see that for the same read ID, the exact same read is being mapped to the same location twice, but the second time around its flagged '355' or '403' for 'not primary alignment'
command:
STAR --genomeDir [path to genome directory] --genomeFastaFiles [path to genome.fasta] --readFilesIn SRR31736706_1.fastq SRR31736706_2.fastq
long shot but it's not like you have double entries in your fasta file, right?
No, there's only one entry per chromosome/scaffold in the genome fasta file and in the genomeDir
OK, and the read is also not present twice in your input fastq file?
without knowing the exact cause, you can perhaps limit the output of STAR by using:
or