DiscoSnp Segmentation fault
0
0
Entering edit mode
7.0 years ago
Hans ▴ 130

Hello

I am trying for the first time to use discoSnp for snp discovery and  calling. I have de-multiplexed data of 100 samples from a GBS project. In this project, to reduce complexity, DNA was cut with  a restriction enzyme and only reads starting with the right cut site are included.   The read files in fq.gz format are listed in the gzlist file with one file per line.  The pipeline is run on a linux (Ubuntu 14.04) severer with 16 cores and 128 GB of RAM. This the the stdout I get:

hanan@icci:~/stacks/samp$~/disco/run_discoSnp++.sh -r gzlist -T -u 12 use read set: gzlist use at most 12 cores Binaries in /storage/hanan/disco/build/ Running discoSnp++ 2.2.1, in directory /storage/hanan/stacks/samp with following parameters: read_sets=gzlist prefix=discoRes_k_31_c_auto c=auto C=2147483647 k=31 b=0 d=1 D=100 starting date=Wed Aug 26 12:51:36 IDT 2015 ############################################################ #################### GRAPH CREATION ####################### ############################################################ /storage/hanan/disco/build//ext/gatb-core/bin/dbgh5 -in gzlist_removemeplease -out discoRes_k_31_c_auto -kmer-size 31 -abundance-min auto -abundance-max 2147483647 -solidity-kind one -nb-cores 12 /storage/hanan/disco/run_discoSnp++.sh: line 326: 26144 Segmentation fault$DISCO_BUILD_PATH/ext/gatb-core/bin/dbgh5 -in ${read_sets}_removemeplease -out$h5prefix -kmer-size $k -abundance-min$c_dbgh5 -abundance-max $C -solidity-kind one$option_cores_gatb
there was a problem with graph construction

Thank you

Hanan

snp discosnp • 2.8k views
2
Entering edit mode

Hi Hanan,

As the program crashes early (before any message is displayed), one may suspect an input file format problem. Would you mind double-checking the gzlist file of files (and/or send it to me) and verifying that each pointed fastq/fasta file really exists?

2
Entering edit mode

Hello Pierre

Thank you for the response. The problem was that some of the .fq.gz files were empty, so the program crashed. Deleting these files from the  file of flies list  solved the problem.    Please send me the new version if you think it is safe to use. If not, I will wait.

Thank you

Hanan

0
Entering edit mode

Just tried the same command  on a subset of 20 samples and so far so good.

0
Entering edit mode

Hi

When I used the branching option -b 0 with 20 samples, I got only ~4400 SNP. When I used -b 1 ,  The memory usage went up to maximum which is 128GB  in the  KISSREADS MODULE and after a while the program was killed.   I should mention that the row data comes from a bit more than one lane of Hiseq200.

0
Entering edit mode

Hi Hanan,

I start to answer the easiest question: this memory problem in kissreads is fixed in an incoming new version that will be released this week or the next one. I may send this new version to you if you desire, but it is not 100% tested yet.

0
Entering edit mode

hello Pierre, is the new version available yet? thank you, Hanan

0
Entering edit mode

Hi.

It's ready. I'm waiting a few test results. Depending on their results, the new version will be available this week.

Pierre

0
Entering edit mode

Hi,

Thanks for this feed back, we should improve the output messages in this case.

Pierre

0
Entering edit mode

Hi All,

DiscoSnp++ 2.2.1 was released.

You may find it from here.

Pierre

0
Entering edit mode

Hi I am sorry but running the new version did not solve the memory problem. It have reached the maximum of my RAM and then crashed after an hour or so. I am using a 16 core Ubuntu 14.04 server with 128GB RAM the raw data is from 1.5 lanes of hiseq2000. I do no know if this should be a problem, but this is a GBS data where the genome is cut by restriction enzyme and all the reads are restricted to the cut sites and do not cover the whole genome. I have used this command:

run_discoSnp++.sh -r gzlist -u 14 -b 1


And this is the final output I got

/storage/hanan/disco/build//tools/kissreads2/kissreads2 -predictions discoRes_k_31_c_auto_D_100_P_1_b_1.fa -reads  gzlist -co discoRes_k_31_c_auto_D_100_P_1_b_1\_coherent -unco discoRes_k_31_c_auto_D_100_P_1_b_1\_uncoherent -k 31 -size_seeds 26 -index_stride 4 -hamming 1  -genotype -nb-cores 14
[Mapping read set 0                      ]  120  %   elapsed:  90 min 27 sec   remaining:   0 min 0  sec   cpu: 1082.6 %   mem: [106835, 106835, 106843] MB
[Mapping read set 1                      ]  110  %   elapsed: 147 min 44 sec   remaining:   0 min 0  sec   cpu: 894.1 %   mem: [126197, 126901, 126950] MB /storage/hanan/disco/run_discoSnp++.sh: line 374: 16003 Killed                  $DISCO_BUILD_PATH/tools/kissreads2/kissreads2 -predictions$kissprefix.fa -reads $read_sets -co$kissprefix\_coherent -unco $kissprefix\_uncoherent -k$k -size_seeds $smallk -index_stride$i -hamming $d$genotyping option_cores_gatb there was a problem with kissreads2:  Thank you Hanan ADD REPLY 0 Entering edit mode Hi Hanan Could you indicate the number of predictions in the discoRes_k_31_c_auto_D_100_P_1_b_1.fa file? Pierre ADD REPLY 0 Entering edit mode Hi Pierre discoRes_k_31_c_auto_D_100_P_1_b_1.fa have 144696 sequences. Thank you Hanan ADD REPLY 0 Entering edit mode This memory consumption is crazy. Are you 100% sure, the kissreads version comes from the 2.2.1 release? (your bug looks like a bug fixed a few weeks ago in kissreads). If this is the case, could you try to run the same command limiting the number of cores to 2 (-nb-cores 2)? ADD REPLY 0 Entering edit mode Hello Pierre Now the software version 2.2.1 runs without memory problems but I get this output at the end: /storage/hanan/disco/run_VCF_creator.sh -p discoRes_k_31_c_auto_D_100_P_1_b_1\_coherent.fa -o discoRes_k_31_c_auto_D_100_P_1_b_1\_coherent.vcf ##use disco SNPS : discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa ##output : discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf ...Ghost mode... ...Creation of a vcf without alignment... Traceback (most recent call last): File "/storage/hanan/disco/tools/VCF_creator.py", line 155, in <module> if variant_object.CheckCoupleVariantID()==1: File "/storage/hanan/disco/tools/ClassVCF_creator.py", line 335, in CheckCoupleVariantID bitwiseFlag=int(self.upper_path.listSam[1]) IndexError: list index out of range ... Creation of the vcf file : done ...==> discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf Vcf creation time in seconds: 0 ############################################################### #################### DISCOSNP++ FINISHED ###################### ############################################################### DiscoSnp++ total time in seconds: 6152 ################################################################################################################ fasta of predicted variant is "discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa" Ghost VCF file (1-based) is "discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf" Thanks for using discoSnp++ - http://colibread.inria.fr/discoSnp/ - Forum: http://www.biostars.org/t/discoSnp/  The size of the output files is like this: -rw-r--r-- 1 hanan hanan 363937072 Nov 17 10:28 discoRes_k_31_c_auto.h5 -rw-r--r-- 1 hanan hanan 36264 Nov 17 10:31 discoRes_k_31_c_auto_cov.h5 -rw-r--r-- 1 hanan hanan 84984535 Nov 17 11:40 discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa -rw-r--r-- 1 hanan hanan 1298719 Nov 17 11:40 discoRes_k_31_c_auto_D_100_P_1_b_1_uncoherent.fa -rw-r--r-- 1 hanan hanan 1413 Nov 17 11:40 discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf  and the vcf file contains only the headers: ##fileformat=VCFv4.1 ##filedate=20151117 ##source=VCF_creator ##SAMPLE=file://discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa ##REF=<ID=REF,Number=1,Type=String,Description="Allele of the path Disco aligned with the least mismatches"> ##FILTER=<ID=MULTIPLE,Description="Mapping type : PASS or MULTIPLE or ."> ##INFO=<ID=Ty,Number=1,Type=String,Description="SNP, INS, DEL or ."> ##INFO=<ID=Rk,Number=1,Type=Float,Description="SNP rank"> ##INFO=<ID=UL,Number=1,Type=Integer,Description="length of the unitig left"> ##INFO=<ID=UR,Number=1,Type=Integer,Description="length of the unitig right"> ##INFO=<ID=CL,Number=1,Type=Integer,Description="length of the contig left"> ##INFO=<ID=CR,Number=1,Type=Integer,Description="length of the contig right"> ##INFO=<ID=Genome,Number=1,Type=String,Description="Allele of the reference;for indel reference is "." "> ##INFO=<ID=Sd,Number=1,Type=Integer,Description="Reverse (-1) or Forward (1) Alignement"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Cumulated depth accross samples (sum)"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled Genotype Likelihoods"> ##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Depth of each allele by sample"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 G21 G22 G23 G24 G25 G26 G27 G28  Thank you for your help and hope for more peaceful times Hanan ADD REPLY 0 Entering edit mode Hi, Sorry for this bug and thank you for your message and its peaceful signature, we need this... The bug is fixed in a new release (2.2.2). This release should be quickly made publicly available. We still need to fix an issue when compiling with mac. If you can't wait for a few days just tell me I'll send you the new version personally so you can use the new VCF_creator.sh Best regards, Pierre ADD REPLY 0 Entering edit mode Hi Again, The new release 2.2.3 which correct this issue is online; http://colibread.inria.fr/software/discosnp/ Best regards, Pierre ADD REPLY 0 Entering edit mode Hello Pierre Running 2.2.3 I got up to here:  ############################################################# #################### KISSREADS MODULE ####################### ############################################################# /storage/hanan/disco/run_discoSnp++.sh: line 398: ((: >31 : syntax error: operand expected (error token is ">31 ") /storage/hanan/disco/build//tools/kissreads2/kissreads2 -predictions discoRes_k_31_c_auto_D_100_P_1_b_1.fa -reads gzlist -co discoRes_k_31_c_auto_D_100_P_1_b_1_coherent -unco discoRes_k_31_c_auto_D_100_P_1_b_1_uncoherent -k 31 -size_seeds -5 -index_stride 6 -hamming 1 -genotype -coverage_file discoRes_k_31_c_auto_cov.h5 -nb-cores 14 Indexing bank discoRes_k_31_c_auto_D_100_P_1_b_1.fa Mapping of 657398279 reads [Mapping read set 0 ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [ 210, 210, 210] MB  Thank you Hanan ADD REPLY 0 Entering edit mode Arg... Thanks. Indeed there is a problem with the size of the seed (5). I'll check this as soon as possible. Pierre ADD REPLY 0 Entering edit mode Hi, Indeed, there is a small bug in the run_disco script. You may either: • Add line 397 of run_discoSnp++.sh smallk=k

• Use version 2.2.4 which fixes this bug.

If you re-run discoSnp, don't hesitate to use the -g option, in order to avoid to recompute the graph.

Pierre

1
Entering edit mode

Using version 2.2.4 with u=2 b=1 , everything is OK . I think it will work also with u=14. Thank you, Hanan