Where to get >200Gb to build Bovine genome index in HISAT2?
2
0
Entering edit mode
3.6 years ago
Hernán ▴ 200

We are trying to index the UMD 3.1.1 Bovine genome using the HISAT2 software.

The problem is that we need more than 200 Gbytes of memory for the hisat2-build script, and we were unable to get enough hardware resources in the Argentine scientific computing network. We assume more than 200Gb because of this note in the HISAT2 manual:

If you use --snp, --ss, and/or --exon, hisat2-build will need about 200GB RAM for the human genome size as index building involves a graph construction.

Do you know any facility, preferably free of charge, where we could run the indexer, provided that I already wrote a script which automates all the steps ?

To download and run the script, evaluate:

git clone https://github.com/hernanmd/hisat2_bovine.git
./make_bgumd31.sh
RNA-Seq next-gen genome indexing HISAT2 • 1.5k views
ADD COMMENT
0
Entering edit mode

You could send a request to Daehwan Kim (one of the HISAT2 authors) to see if his lab can build one for you.

ADD REPLY
0
Entering edit mode

I wrote a mail to hisat2.genomics@gmail.com some days ago but unfortunately still not received an answer. Maybe writing to him directly would help?

ADD REPLY
0
Entering edit mode

I would think so. He has his own lab now (linked above).

ADD REPLY
2
Entering edit mode
3.6 years ago
ATpoint 54k

I will see if I can get a time slot on one of our hi-memory nodes the next days, then I could build it for you.

ADD COMMENT
0
Entering edit mode

I would be very grateful if you could get a slot in your facility. Please let me know if you ran into any issues.

ADD REPLY
0
Entering edit mode

I submitted the job just yet, and I will come back to you once it is finished (or in case of errors =) ).

ADD REPLY
0
Entering edit mode

Here is the output of your script.

ADD REPLY
0
Entering edit mode

That looks rather small for a genome supposed to need 200G RAM.

ADD REPLY
0
Entering edit mode

I know, expected something similar to human idx sizes. Still, ran exactly the script that was provided.

ADD REPLY
0
Entering edit mode

You are right, the output is wrong because I've provided the input chr list not in the comma delimited format required by HISAT2. I've did this before but somehow uploaded old files to GitHub.

Now I've re-uploaded the files in (hopefully) the correct format, however I don't want to abuse your kindness, let me know if you still can allocate some minutes.

ADD REPLY
0
Entering edit mode

I started your script. Let's see who's first... ;-)

(Server was idle anyway)

ADD REPLY
0
Entering edit mode

Can you check if this is okay?

The tar contains following files:

97M Mar  6 09:43 indices/AC_000159.1.fa,.1.ht2
38M Mar  6 09:43 indices/AC_000159.1.fa,.2.ht2
35K Mar  6 09:39 indices/AC_000159.1.fa,.3.ht2
38M Mar  6 09:39 indices/AC_000159.1.fa,.4.ht2
80M Mar  6 09:44 indices/AC_000159.1.fa,.5.ht2
39M Mar  6 09:44 indices/AC_000159.1.fa,.6.ht2
351K Mar  6 09:39 indices/AC_000159.1.fa,.7.ht2
72K Mar  6 09:39 indices/AC_000159.1.fa,.8.ht2
ADD REPLY
0
Entering edit mode

Hi Wouter,

Apparently HISAT2 cannot parse spaces after commas in the input chromosomes list? I removed spaces and re-uploaded the GCF_AC.txt and GCF_ACNW.txt. The output should be eight files named like these:

UMD3.1.1.AC.idx.1.ht2, UMD3.1.1.AC.idx.2.ht2, UMD3.1.1.AC.idx.3.ht2, etc

(Hopefully this will be useful for someone sometime :)

ADD REPLY
0
Entering edit mode

Yes the spaces surprised me as well. But do you think the files are correct? Or should I rerun?

ADD REPLY
0
Entering edit mode

Yes please, I started a run in a server to see the initial output:

Settings: Output files: "UMD3.1.1.AC.idx..ht2" Line rate: 7 (line is 128 bytes) Lines per side: 1 (side is 128 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Local sequence length: 57344 Local sequence overlap between two consecutive indexes: 1024
Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void
:8, int:4, long:8, size_t:8 Input files DNA, FASTA:
AC_000158.1.fa AC_000159.1.fa AC_000160.1.fa AC_000161.1.fa
AC_000162.1.fa AC_000163.1.fa AC_000164.1.fa AC_000165.1.fa
AC_000166.1.fa AC_000167.1.fa AC_000168.1.fa AC_000169.1.fa
AC_000170.1.fa AC_000171.1.fa AC_000172.1.fa AC_000173.1.fa
AC_000174.1.fa AC_000175.1.fa AC_000176.1.fa AC_000177.1.fa
AC_000178.1.fa AC_000179.1.fa AC_000180.1.fa AC_000181.1.fa
AC_000182.1.fa AC_000183.1.fa AC_000184.1.fa AC_000185.1.fa
AC_000186.1.fa AC_000187.1.fa Reading reference sizes Time reading reference sizes: 00:00:21 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:13 Time to read SNPs and splice sites: 00:00:04

ADD REPLY
0
Entering edit mode

Is that an error? I can rebuild it later.

ADD REPLY
0
Entering edit mode

No, that should be the begin of the output when the build is successful.

ADD REPLY
0
Entering edit mode

Just curious if you ever tried building the indexes locally (and failed) or were put off by the stiff RAM requirement?

ADD REPLY
0
Entering edit mode

Yes, I've tried to build the index locallly but failed with a lack of memory message.

ADD REPLY
0
Entering edit mode

I have a 500Gb server available (yes I'm spoiled). Let me know if you need any more help.

ADD REPLY
0
Entering edit mode
3.6 years ago
Hernán ▴ 200

I've managed to build an index, this is the output of hisat2-inspect -s:

Index version   2.1.0
Flags   1
2.0-compatible  0
SA-Sample       1 in 16
FTab-Chars      10
Sequence-1      AC_000158.1     158337067
Sequence-2      AC_000159.1     137060424
Sequence-3      AC_000160.1     121430405
Sequence-4      AC_000161.1     120829699
Sequence-5      AC_000162.1     121191424
Sequence-6      AC_000163.1     119458736
Sequence-7      AC_000164.1     112638659
Sequence-8      AC_000165.1     113384836
Sequence-9      AC_000166.1     105708250
Sequence-10     AC_000167.1     104305016
Sequence-11     AC_000168.1     107310763
Sequence-12     AC_000169.1     91163125
Sequence-13     AC_000170.1     84240350
Sequence-14     AC_000171.1     84648390
Sequence-15     AC_000172.1     85296676
Sequence-16     AC_000173.1     81724687
Sequence-17     AC_000174.1     75158596
Sequence-18     AC_000175.1     66004023
Sequence-19     AC_000176.1     64057457
Sequence-20     AC_000177.1     72042655
Sequence-21     AC_000178.1     71599096
Sequence-22     AC_000179.1     61435874
Sequence-23     AC_000180.1     52530062
Sequence-24     AC_000181.1     62714930
Sequence-25     AC_000182.1     42904170
Sequence-26     AC_000183.1     51681464
Sequence-27     AC_000184.1     45407902
Sequence-28     AC_000185.1     46312546
Sequence-29     AC_000186.1     51505224
Sequence-30     AC_000187.1     148823899
Num. SNPs: 0
Num. Splice Sites: 206098
Num. Exons: 233435

If anyone need to download (3Gb compressed) please let me know.

ADD COMMENT
0
Entering edit mode

Hi, would you still have this available? Thanks.

ADD REPLY
0
Entering edit mode

Hi Hernán, is the indexed genome still available for download?

ADD REPLY
0
Entering edit mode

I had to search it in some servers, did you still need it? Please let me know

ADD REPLY

Login before adding your answer.

Traffic: 938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6