Question: Where to get >200Gb to build Bovine genome index in HISAT2?
0
gravatar for Hernán
6 months ago by
Hernán110
Argentina
Hernán110 wrote:

We are trying to index the UMD 3.1.1 Bovine genome using the HISAT2 software.

The problem is that we need more than 200 Gbytes of memory for the hisat2-build script, and we were unable to get enough hardware resources in the Argentine scientific computing network. We assume more than 200Gb because of this note in the HISAT2 manual:

If you use --snp, --ss, and/or --exon, hisat2-build will need about 200GB RAM for the human genome size as index building involves a graph construction.

Do you know any facility, preferably free of charge, where we could run the indexer, provided that I already wrote a script which automates all the steps ?

To download and run the script, evaluate:

git clone https://github.com/hernanmd/hisat2_bovine.git
./make_bgumd31.sh
ADD COMMENTlink modified 6 months ago • written 6 months ago by Hernán110

You could send a request to Daehwan Kim (one of the HISAT2 authors) to see if his lab can build one for you.

ADD REPLYlink written 6 months ago by genomax55k

I wrote a mail to hisat2.genomics@gmail.com some days ago but unfortunately still not received an answer. Maybe writing to him directly would help?

ADD REPLYlink written 6 months ago by Hernán110

I would think so. He has his own lab now (linked above).

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax55k
2
gravatar for ATpoint
6 months ago by
ATpoint7.5k
Germany
ATpoint7.5k wrote:

I will see if I can get a time slot on one of our hi-memory nodes the next days, then I could build it for you.

ADD COMMENTlink written 6 months ago by ATpoint7.5k

I would be very grateful if you could get a slot in your facility. Please let me know if you ran into any issues.

ADD REPLYlink written 6 months ago by Hernán110

I submitted the job just yet, and I will come back to you once it is finished (or in case of errors =) ).

ADD REPLYlink modified 6 months ago • written 6 months ago by ATpoint7.5k

Here is the output of your script.

ADD REPLYlink modified 6 months ago • written 6 months ago by ATpoint7.5k

That looks rather small for a genome supposed to need 200G RAM.

ADD REPLYlink written 6 months ago by genomax55k

I know, expected something similar to human idx sizes. Still, ran exactly the script that was provided.

ADD REPLYlink written 6 months ago by ATpoint7.5k

You are right, the output is wrong because I've provided the input chr list not in the comma delimited format required by HISAT2. I've did this before but somehow uploaded old files to GitHub.

Now I've re-uploaded the files in (hopefully) the correct format, however I don't want to abuse your kindness, let me know if you still can allocate some minutes.

ADD REPLYlink written 6 months ago by Hernán110

I started your script. Let's see who's first... ;-)

(Server was idle anyway)

ADD REPLYlink modified 6 months ago • written 6 months ago by WouterDeCoster32k

Can you check if this is okay?

The tar contains following files:

97M Mar  6 09:43 indices/AC_000159.1.fa,.1.ht2
38M Mar  6 09:43 indices/AC_000159.1.fa,.2.ht2
35K Mar  6 09:39 indices/AC_000159.1.fa,.3.ht2
38M Mar  6 09:39 indices/AC_000159.1.fa,.4.ht2
80M Mar  6 09:44 indices/AC_000159.1.fa,.5.ht2
39M Mar  6 09:44 indices/AC_000159.1.fa,.6.ht2
351K Mar  6 09:39 indices/AC_000159.1.fa,.7.ht2
72K Mar  6 09:39 indices/AC_000159.1.fa,.8.ht2
ADD REPLYlink written 6 months ago by WouterDeCoster32k

Hi Wouter,

Apparently HISAT2 cannot parse spaces after commas in the input chromosomes list? I removed spaces and re-uploaded the GCF_AC.txt and GCF_ACNW.txt. The output should be eight files named like these:

UMD3.1.1.AC.idx.1.ht2, UMD3.1.1.AC.idx.2.ht2, UMD3.1.1.AC.idx.3.ht2, etc

(Hopefully this will be useful for someone sometime :)

ADD REPLYlink written 6 months ago by Hernán110

Yes the spaces surprised me as well. But do you think the files are correct? Or should I rerun?

ADD REPLYlink written 6 months ago by WouterDeCoster32k

Yes please, I started a run in a server to see the initial output:

Settings: Output files: "UMD3.1.1.AC.idx..ht2" Line rate: 7 (line is 128 bytes) Lines per side: 1 (side is 128 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Local sequence length: 57344 Local sequence overlap between two consecutive indexes: 1024
Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void
:8, int:4, long:8, size_t:8 Input files DNA, FASTA:
AC_000158.1.fa AC_000159.1.fa AC_000160.1.fa AC_000161.1.fa
AC_000162.1.fa AC_000163.1.fa AC_000164.1.fa AC_000165.1.fa
AC_000166.1.fa AC_000167.1.fa AC_000168.1.fa AC_000169.1.fa
AC_000170.1.fa AC_000171.1.fa AC_000172.1.fa AC_000173.1.fa
AC_000174.1.fa AC_000175.1.fa AC_000176.1.fa AC_000177.1.fa
AC_000178.1.fa AC_000179.1.fa AC_000180.1.fa AC_000181.1.fa
AC_000182.1.fa AC_000183.1.fa AC_000184.1.fa AC_000185.1.fa
AC_000186.1.fa AC_000187.1.fa Reading reference sizes Time reading reference sizes: 00:00:21 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:13 Time to read SNPs and splice sites: 00:00:04

ADD REPLYlink written 6 months ago by Hernán110

Is that an error? I can rebuild it later.

ADD REPLYlink written 6 months ago by WouterDeCoster32k

No, that should be the begin of the output when the build is successful.

ADD REPLYlink written 6 months ago by Hernán110

Just curious if you ever tried building the indexes locally (and failed) or were put off by the stiff RAM requirement?

ADD REPLYlink written 6 months ago by genomax55k

Yes, I've tried to build the index locallly but failed with a lack of memory message.

ADD REPLYlink written 6 months ago by Hernán110

I have a 500Gb server available (yes I'm spoiled). Let me know if you need any more help.

ADD REPLYlink written 6 months ago by WouterDeCoster32k
0
gravatar for Hernán
6 months ago by
Hernán110
Argentina
Hernán110 wrote:

I've managed to build an index, this is the output of hisat2-inspect -s:

Index version   2.1.0
Flags   1
2.0-compatible  0
SA-Sample       1 in 16
FTab-Chars      10
Sequence-1      AC_000158.1     158337067
Sequence-2      AC_000159.1     137060424
Sequence-3      AC_000160.1     121430405
Sequence-4      AC_000161.1     120829699
Sequence-5      AC_000162.1     121191424
Sequence-6      AC_000163.1     119458736
Sequence-7      AC_000164.1     112638659
Sequence-8      AC_000165.1     113384836
Sequence-9      AC_000166.1     105708250
Sequence-10     AC_000167.1     104305016
Sequence-11     AC_000168.1     107310763
Sequence-12     AC_000169.1     91163125
Sequence-13     AC_000170.1     84240350
Sequence-14     AC_000171.1     84648390
Sequence-15     AC_000172.1     85296676
Sequence-16     AC_000173.1     81724687
Sequence-17     AC_000174.1     75158596
Sequence-18     AC_000175.1     66004023
Sequence-19     AC_000176.1     64057457
Sequence-20     AC_000177.1     72042655
Sequence-21     AC_000178.1     71599096
Sequence-22     AC_000179.1     61435874
Sequence-23     AC_000180.1     52530062
Sequence-24     AC_000181.1     62714930
Sequence-25     AC_000182.1     42904170
Sequence-26     AC_000183.1     51681464
Sequence-27     AC_000184.1     45407902
Sequence-28     AC_000185.1     46312546
Sequence-29     AC_000186.1     51505224
Sequence-30     AC_000187.1     148823899
Num. SNPs: 0
Num. Splice Sites: 206098
Num. Exons: 233435

If anyone need to download (3Gb compressed) please let me know.

ADD COMMENTlink written 6 months ago by Hernán110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 716 users visited in the last hour