Question: Build repeat genome index using STAR
0
gravatar for yancychy
12 months ago by
yancychy10
yancychy10 wrote:

Hi , I downloaded the repeat genome and gtf (RepeatMasker) files from UCSC genome table browser. I want to build repeat genome index to remove the reads which may be spurious artifacts from rRNA (& other) repetitive reads. But the error is always exceeding memory limit. I adjust the memory from 30GB to 120GB.
The repeat genome file size is 2.1GB and gtf file size is 552 MB.

<h6>######################################## output</h6>

Nov 18 17:58:19 ..... started STAR run Nov 18 17:58:19 ... starting to generate Genome files slurmstepd: Job 11091167 exceeded memory limit (123675052 > 122880000), being killed slurmstepd: Exceeded job memory limit slurmstepd: * JOB 11091167 CANCELLED AT 2019-11-18T13:20:19 * on node311

<h6>############################################### Script</h6>
/home/ychen10/STAR-2.7.3a/bin/Linux_x86_64/STAR  
       --runThreadN 4 \
       --runMode genomeGenerate \
       --genomeDir index \
       --genomeFastaFiles repeatSeq.fa \
       --sjdbGTFfile repeatSeq.gtf \
       --sjdbOverhang 99 \
       --genomeChrBinNbits 16 \
       --genomeSAindexNbases 10 \
       --genomeSAsparseD 4

I am not sure the problem is caused by the repeat genome or the memory. Thanks.

index star repeat • 356 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by yancychy10

Thanks. I tired the --limitGenomeGenerateRAM. It produced same error.

ADD REPLYlink written 12 months ago by yancychy10

comments are for answers, please use the reply button (yeah it's a bit strange but it makes finding much easier!).

The same error from slurm? If so, something is going wrong because STAR shouldn't be using more than the limit specified. Can you try supplying say 50gig of memory but limit STAR to 40gig?

ADD REPLYlink modified 12 months ago • written 12 months ago by Mark800
1

Thanks. I tired to limit STAR to 40gb. The error is same. I think the problem may caused by the input files.

repeatSeq.fa

>hg38_rmsk_L1P5 range=chr1:67108754-67109046 5'pad=0 3'pad=0 strand=+ repeatMasking=none
AACAAATAATCCCATCAAAAAGTAGGCAAAGGATATGAATAGATAATTTT
CAAAATAAGATATACAAATGAAAAAATGCTCAACATCACTAATTATCAGG
GAAATGCAAATTAAAACCACAATGAGATACTGCCTTATTCCTGAAAGAAT
GGCCATAATTTAAAAATTTTTTAAAAAATAGACCTTGGCATGGATGTGGT
AAAAAGGGAACACTTTTACACTGTTGGTGGGAATGTAAACTAGTATAAAC
ACTATGGAAAACAGTATGAAAATACCTTAAAGAATTAAAAGTA

>hg38_rmsk_AluY range=chr1:8388316-8388618 5'pad=0 3'pad=0 strand=- repeatMasking=none
GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAA
GGCGGGCGGATCATGAGGTCAGGAGATCGAGACCATCCTGGCTAACAAGG
TGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGCGGTGGC
GGGCGCCTGTAGTCCCAGCTACTCAGGAGGCTGAGGCAGGAGAATGGCGT
GAACCCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTC
CGCAGTCCAGCCTGGGCGACAGAGTGAGACTCCGTCTCAAAAAAAAAAAA
AGA

repeatSeq.gtf head -5 repeatSeq.gtf

chr1    hg38_rmsk       exon    67108754        67109046        1892.000000     +       .       gene_id "L1P5"; transcript_id "L1P5";
chr1    hg38_rmsk       exon    8388316 8388618 2582.000000     -       .       gene_id "AluY"; transcript_id "AluY";
chr1    hg38_rmsk       exon    25165804        25166380        4085.000000     +       .       gene_id "L1MB5"; transcript_id "L1MB5";
chr1    hg38_rmsk       exon    33554186        33554483        2285.000000     -       .       gene_id "AluSc"; transcript_id "AluSc";
chr1    hg38_rmsk       exon    41942895        41943205        2451.000000     -       .       gene_id "AluY"; transcript_id "AluY_dup1";
ADD REPLYlink modified 12 months ago by h.mon31k • written 12 months ago by yancychy10

Beyond me I'm sorry. I suggest posting an issue on the github page of STAR. The maintainer is excellent with troubleshooting weird cases.

ADD REPLYlink written 12 months ago by Mark800

Yes. Thanks very much

ADD REPLYlink written 12 months ago by yancychy10

Why not remove the repeat region maps with repeatmask regions after the alignment?

ADD REPLYlink written 12 months ago by Shicheng Guo8.5k

Thanks. I will try it.

ADD REPLYlink written 8 weeks ago by yancychy10
1
gravatar for Mark
12 months ago by
Mark800
Mark800 wrote:

Add --limitGenomeGenerateRAM and see how you go. For whatever reason the indexing job is using a large amount of ram.

ADD COMMENTlink written 12 months ago by Mark800
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1674 users visited in the last hour