Question: problem indexing genome and alignment with STAR aligner
0
gravatar for pr.khavari
21 months ago by
pr.khavari0
Iran, Tehran, University of tehran
pr.khavari0 wrote:

Hi everyone,
I am trying to generate genome indexes with STAR to align my RNAseq data, with this command line:

 /data/software/STAR/source/STAR --runThreadN 16 --runMode genomeGenerate --genomeDir star_genome3/ --genomeFastaFiles Pvulgaris_442_v2.0.fa --sjdbGTFfile phavu.G19833.gnm2.ann1.PB8d.gene_exons.gff3 --sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --genomeChrBinNbits 18 --sjdbOverhang 100

but after 5 min it ends, I think that have problem with such speedy.

The list output of it is here:

Genome SAindex chrName.txt chrStart.txt exonInfo.tab genomeParameters.txt sjdbList.fromGTF.out.tab transcriptInfo.tab SA chrLength.txt chrNameLength.txt exonGeTrInfo.tab geneInfo.tab sjdbInfo.txt sjdbList.out.tab

Then I change runmod to alignment with this script:

 /data/software/STAR/source/STAR --runMode alignReads --genomeDir /data/mshoorooei/star_genome4/ --runThreadN 16 --outFilterMismatchNmax 2 --readFilesIn PE_27_F.fq.gz PE_27_R.fq.gz --readFilesCommand gunzip -c --outFileNamePrefix 27_ --outReadsUnmapped unmapped_27 --outSAMtype BAM SortedByCoordinate

Output is here:

27_Aligned.sortedByCoord.out.bam
27_Log.final.out
27_Log.out
27_Log.progress.out
27_SJ.out.tab

unfortunately, this gives me the same problem too.

Do you have any idea? thanks for your suggestions.

rna-seq alignment genome • 1.5k views
ADD COMMENTlink modified 7 months ago by h.mon27k • written 21 months ago by pr.khavari0
1
gravatar for Michael Dondrup
21 months ago by
Bergen, Norway
Michael Dondrup46k wrote:

This looks like a normal run, and your genome is probably rather small. Look into the output of of Log.progress.out and Log.final.out. So it is just faster than you expected, but there isn't anything wrong.


Edit: I see a smaller issue:

you should use:

--outReadsUnmapped Fastx

not --outReadsUnmapped some_filename that is maybe the reason for why you don't get unmapped.out.mate1/2 files

ADD COMMENTlink modified 21 months ago • written 21 months ago by Michael Dondrup46k

thanks for your comment, My genome is nearly 600 Mb, how can I understand Log.progress.out and Log.final.out is right??

ADD REPLYlink written 21 months ago by pr.khavari0
1

You should watch the output of STAR while it is running, during genome generation it should output:

   Nov 22 10:01:37 ..... started STAR run
Nov 22 10:01:37 ... starting to generate Genome files
Nov 22 10:02:34 ... starting to sort Suffix Array. This may take a long time...
Nov 22 10:03:00 ... sorting Suffix Array chunks and saving them to disk...
Nov 22 10:06:54 ... loading chunks from disk, packing SA...
Nov 22 10:07:24 ... finished generating suffix array
Nov 22 10:07:24 ... generating Suffix Array index
Nov 22 10:09:48 ... completed Suffix Array index
Nov 22 10:09:48 ..... processing annotations GTF
Nov 22 10:09:48 ..... inserting junctions into the genome indices
Nov 22 10:10:26 ... writing Genome to disk ...
Nov 22 10:12:04 ... writing Suffix Array to disk ...
Nov 22 10:13:04 ... writing SAindex to disk
**Nov 22 10:13:26 ..... finished successfully**

This was for a 680MBase genome in 33000 scaffolds, and 120 CPUs but I don't think multi-core helps much during genome generate.

During alignment it should output something like:

Jun 06 21:12:47 ..... Started STAR run
Jun 06 21:12:47 ..... Loading genome
Jun 06 21:12:47 ..... Started mapping
Jun 06 21:14:09 ..... Finished successfully

Using Log.final.out you can then compare the number of input sequences with the number of sequences in your input file (they should be the same of course) and the mapping rate (90%+ is common for good data)

ADD REPLYlink modified 21 months ago • written 21 months ago by Michael Dondrup46k

It is running during genome generation. it seems the same.

 Dec 02 09:58:02 ..... started STAR run
Dec 02 09:58:02 ... starting to generate Genome files
Dec 02 09:58:10 ... starting to sort Suffix Array. This may take a long time...
Dec 02 09:58:13 ... sorting Suffix Array chunks and saving them to disk...
Dec 02 10:00:26 ... loading chunks from disk, packing SA...
Dec 02 10:00:40 ... finished generating suffix array
Dec 02 10:00:40 ... generating Suffix Array index
Dec 02 10:01:58 ... completed Suffix Array index
Dec 02 10:01:58 ... writing Genome to disk ...
Dec 02 10:01:58 ... writing Suffix Array to disk ...
Dec 02 10:02:00 ... writing SAindex to disk
Dec 02 10:02:01 ..... finished successfully

The running star during alignment.

Dec 02 10:28:24 ..... started STAR run
Dec 02 10:28:24 ..... loading genome
Dec 02 10:28:26 ..... started mapping
Dec 02 10:32:17 ..... started sorting BAM
Dec 02 10:33:38 ..... finished successfully
ADD REPLYlink modified 21 months ago • written 21 months ago by pr.khavari0

So there is no obvious error. Your genome generation is faster than ours, but this is probably IO related.

ADD REPLYlink written 21 months ago by Michael Dondrup46k

Okay, thanks so much for your help.

ADD REPLYlink written 21 months ago by pr.khavari0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2410 users visited in the last hour