Question

Handling large genome size in BLAST+

0

Entering edit mode

2.2 years ago

samson ▴ 10

I am trying to create a local database for a crop with a genome size of 13 Gb. The genome sequence is unpublished but its currently being assembled in my group. I have the fasta file for the genome sequence and the crop has six chromosomes with a size of about 2 Gb per chromosome. Also, the current version of ncbi-blast+ is installed in the Linux server I am working on and I have successfully created the database using the "makeblastdb" command.

There are thousands of sequences which I would like to align to the local database and have formatted the input file accordingly. However, when I run the "blastn" command for the alignment, the process gets "KILLED" on the server. I am sure the problem is not related to the script however.

I tried to run this on my windows laptop using "Git Bash"; it runs for over 45 minutes, producing a little output, but crashes the system after using up the memory for the task. A colleague suggested that I run the task as an "sbatch" script on the server which I did, partitioning enough memory (upto 500G) for the run. Unfortunately, it returned an error file after some time saying "Segmentation fault (core dumped) due to out-of-memory event...." I have colleagues who worked with similar genome size and did not encounter any problem related to this, however not with chromosomes as large as this. Could this be related to the chromosome size? Please, can anyone suggest how I can handle this?

database large genome blast local • 1.4k views

ADD COMMENT • link updated 2.2 years ago by DavidStreid ▴ 90 • written 2.2 years ago by samson ▴ 10

2

Entering edit mode

Sorry this isn't a solution, but I am guessing it is a configuration issue b/c 500G of memory for a 13GB database sounds more than sufficient.

I would focus in on the Segmentation fault error. Two things come to mind,

Are resource limits insufficient for blast configured on your system? i.e. try ulimit -f -n in git bash and consider changing file size to be ulimited. open files can also be an issue, but I'm guessing it is set to a number more than the number of DB files for your 13 GB database. There's a bit more info on this in the "Memory Usage" section of the blast help
Are you writing results to a location you have write-access to?

Good luck!

ADD REPLY • link 2.2 years ago by DavidStreid ▴ 90

0

Entering edit mode

Thank you for your suggestion. I thought the 500G memory was sufficient too. But I think the problem might be associated with the reference genome. I tried out different stuffs to figure out why it was not working.

First, I took out a single chromosome from the reference genome file and created a database with it, then aligned the sequences to this database. I encountered a similar problem as before. So I thought the problem may probably be due to the chromosome size. So, I wrote a script which would split the chromosomes into two sequences (a & b) and concatenate them into one single fasta file for creating a database. In this case, I would have the chromosomes in the reference file as chr1a, chr1b, chr2a, chr2b, chr3a, chr3b, chr4a, chr4b, chr5a, chr5b, chr6a, chr6b. With this new reference file, I created a database and aligned the query sequences. This time, there was no error so I got an output. However, when I checked the output file, I discovered that some of the query sequences were trimmed (shorter than in the query file) and in some cases, had 100% identity hits when the trimmed region included the SNP.

I was not sure why this was happening, instead I tried to work around the original reference genome file. I have about 60K sequences in the query file. After trying out several query files (by reducing the number of query sequences in the file), <50 query sequences seems to work. Unfortunately, in the output, the query sequences were trimmed too, some down to 28 nucleotides, instead of 71.

Any idea why so or how to fix this?

PS: the query sequences are from the same species.

ADD REPLY • link 2.2 years ago by samson ▴ 10

1

Entering edit mode

I'm wondering if you have solved the problem by reducing the reference file size, but by nature of the query/reference - you are not getting back that many results. For the reduced split reference files, i.e. chr1a, chr1b... -

...<50 query sequences seems to work

Do the other sequences that don't work not have any results? Wondering if your setup is fine, but BLAST just isn't finding any matches - could you clarify what the other sequences return?

...in some cases, had 100% identity hits when the trimmed region included the SNP

This is very interesting - the trimmed query and reference are NOT identical, but blast is reporting them to be 100% identical? Is it possible another part of the reference (rather than the one the query came from) is being aligned by the trimmed query exactly?

How big are the files for each chromosome by the way? As reference, the preformatted blast for the ref_euk_rep_genomes DB files are ~2.5-3GB. Are the chr1 files greater than this and the chr1a/chr1b files less than this?

And one final thing to check the reference, although it's been reported to be a bit inconsistent, there is a script that "checks" the reference DB being used in the downloaded blast+ package. Maybe try that to see if it reports anything of concern about the reference?

./ncbi-blast-2.12.0+/bin/blastdbcheck

good luck

ADD REPLY • link 2.2 years ago by DavidStreid ▴ 90

0

Entering edit mode

Reducing the reference size solved the problem of the run experiencing a memory problem (oom-kill event). I think this was because the chromosome sequences were too large and required so much memory to carry out the alignment. Anyways, using fewer query sequences (< 50) also avoided this error with the original reference file. The challenge I am facing now is the poor alignment.

Do the other sequences that don't work not have any results? Wondering if your setup is fine, but BLAST just isn't finding any matches - could you clarify what the other sequences return?

All the query sequences work, but I have to run them in batch when working with the original reference. If not, it will encounter an oom-kill error. However, with the reduced reference file I can run all the 60K query sequences at once.

100% identity hits w/ the SNPs could happen if the query coverage is not 100% and the regions excluded from the query are the regions with the SNPs - is this the case?

Yes, the regions excluded contain the SNP and the coverage is not 100%.

How big are the files for each chromosome by the way? As reference, the preformatted blast for the ref_euk_rep_genomes DB files are ~2.5-3GB. Are the chr1 files greater than this and the chr1a/chr1b files less than this?

Each chromosome is ~1.9GB.

And one final thing to check the reference, although it's been reported to be a bit inconsistent, there is a script that "checks" the reference DB being used in the downloaded blast+ package. Maybe try that to see if it reports anything of concern about the reference?

I tried this and I got the report below

*Writing messages to <stdout> at verbosity (Summary) ISAM testing is ENABLED. Legacy testing is DISABLED. TaxID testing is DISABLED. By default, testing 200 randomly sampled OIDs. Testing 4 volume(s).

/../ MetaData: [ERROR] caught exception.

Result=FAILURE. 4 errors reported in 4 volume(s). Testing 1 alias(es). Result=SUCCESS. No errors reported for 1 alias(es). Total errors: 4*

Is there an issue with the database?

ADD REPLY • link 2.2 years ago by samson ▴ 10

0

Entering edit mode

I'm not sure if there's an issue with the database, or the script just reached an unexpected exception. The BLAST support team might appreciate a bug report though. Maybe reach out to them w/ the issue and attach relevant error output files. They provide an email here.

ADD REPLY • link 2.2 years ago by DavidStreid ▴ 90

1

Entering edit mode

Okay, thank you for the useful tips.

ADD REPLY • link 2.2 years ago by samson ▴ 10

0

Entering edit mode

no problem, good luck!

ADD REPLY • link 2.2 years ago by DavidStreid ▴ 90