Question

Build repeat genome index using STAR

0

Entering edit mode

5.6 years ago

yancychy ▴ 10

Hi , I downloaded the repeat genome and gtf (RepeatMasker) files from UCSC genome table browser. I want to build repeat genome index to remove the reads which may be spurious artifacts from rRNA (& other) repetitive reads. But the error is always exceeding memory limit. I adjust the memory from 30GB to 120GB.
The repeat genome file size is 2.1GB and gtf file size is 552 MB.

<h6>######################################## output</h6>

Nov 18 17:58:19 ..... started STAR run Nov 18 17:58:19 ... starting to generate Genome files slurmstepd: Job 11091167 exceeded memory limit (123675052 > 122880000), being killed slurmstepd: Exceeded job memory limit slurmstepd: * JOB 11091167 CANCELLED AT 2019-11-18T13:20:19 * on node311

<h6>############################################### Script</h6>

/home/ychen10/STAR-2.7.3a/bin/Linux_x86_64/STAR  
       --runThreadN 4 \
       --runMode genomeGenerate \
       --genomeDir index \
       --genomeFastaFiles repeatSeq.fa \
       --sjdbGTFfile repeatSeq.gtf \
       --sjdbOverhang 99 \
       --genomeChrBinNbits 16 \
       --genomeSAindexNbases 10 \
       --genomeSAsparseD 4

I am not sure the problem is caused by the repeat genome or the memory. Thanks.

STAR repeat index • 2.3k views

ADD COMMENT • link 5.6 years ago by yancychy ▴ 10

0

Entering edit mode

Thanks. I tired the --limitGenomeGenerateRAM. It produced same error.

ADD REPLY • link 5.6 years ago by yancychy ▴ 10

0

Entering edit mode

comments are for answers, please use the reply button (yeah it's a bit strange but it makes finding much easier!).

The same error from slurm? If so, something is going wrong because STAR shouldn't be using more than the limit specified. Can you try supplying say 50gig of memory but limit STAR to 40gig?

ADD REPLY • link 5.6 years ago by Mark ★ 1.7k

1

Entering edit mode

Thanks. I tired to limit STAR to 40gb. The error is same. I think the problem may caused by the input files.

repeatSeq.fa

>hg38_rmsk_L1P5 range=chr1:67108754-67109046 5'pad=0 3'pad=0 strand=+ repeatMasking=none
AACAAATAATCCCATCAAAAAGTAGGCAAAGGATATGAATAGATAATTTT
CAAAATAAGATATACAAATGAAAAAATGCTCAACATCACTAATTATCAGG
GAAATGCAAATTAAAACCACAATGAGATACTGCCTTATTCCTGAAAGAAT
GGCCATAATTTAAAAATTTTTTAAAAAATAGACCTTGGCATGGATGTGGT
AAAAAGGGAACACTTTTACACTGTTGGTGGGAATGTAAACTAGTATAAAC
ACTATGGAAAACAGTATGAAAATACCTTAAAGAATTAAAAGTA

>hg38_rmsk_AluY range=chr1:8388316-8388618 5'pad=0 3'pad=0 strand=- repeatMasking=none
GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAA
GGCGGGCGGATCATGAGGTCAGGAGATCGAGACCATCCTGGCTAACAAGG
TGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGCGGTGGC
GGGCGCCTGTAGTCCCAGCTACTCAGGAGGCTGAGGCAGGAGAATGGCGT
GAACCCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTC
CGCAGTCCAGCCTGGGCGACAGAGTGAGACTCCGTCTCAAAAAAAAAAAA
AGA

repeatSeq.gtf head -5 repeatSeq.gtf

chr1    hg38_rmsk       exon    67108754        67109046        1892.000000     +       .       gene_id "L1P5"; transcript_id "L1P5";
chr1    hg38_rmsk       exon    8388316 8388618 2582.000000     -       .       gene_id "AluY"; transcript_id "AluY";
chr1    hg38_rmsk       exon    25165804        25166380        4085.000000     +       .       gene_id "L1MB5"; transcript_id "L1MB5";
chr1    hg38_rmsk       exon    33554186        33554483        2285.000000     -       .       gene_id "AluSc"; transcript_id "AluSc";
chr1    hg38_rmsk       exon    41942895        41943205        2451.000000     -       .       gene_id "AluY"; transcript_id "AluY_dup1";

ADD REPLY • link updated 5.6 years ago by h.mon 35k • written 5.6 years ago by yancychy ▴ 10

0

Entering edit mode

Beyond me I'm sorry. I suggest posting an issue on the github page of STAR. The maintainer is excellent with troubleshooting weird cases.

ADD REPLY • link 5.6 years ago by Mark ★ 1.7k

0

Entering edit mode

Yes. Thanks very much

ADD REPLY • link 5.6 years ago by yancychy ▴ 10

0

Entering edit mode

Why not remove the repeat region maps with repeatmask regions after the alignment?

ADD REPLY • link 5.6 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

Thanks. I will try it.

ADD REPLY • link 4.8 years ago by yancychy ▴ 10

score 1 · Answer 1 · 2019-11-18

1

Entering edit mode

5.6 years ago

Mark ★ 1.7k

Add --limitGenomeGenerateRAM and see how you go. For whatever reason the indexing job is using a large amount of ram.

ADD COMMENT • link 5.6 years ago by Mark ★ 1.7k