Question

BBMap: Number of reads out does not add up to number of reads in

0

Entering edit mode

8.2 years ago

Sinji ★ 3.2k

I am running a Nextflow pipeline on our cluster (SLURM) using BBTools to trim adapters and map RNA-seq files. I receive the following error every time that my RNA-seq workflow gets to the 'mapping' stage and i'm not entirely sure if this is a BBMap problem, or a problem with our hardware and whether I should switch over to HISAT2 if this isn't an easy fix.

Any ideas? I've tried setting threads=1 in both the BBDuk and BBMap steps, but this doesn't seem to fix the issue. Could it be FASTA related?

ERROR ~ Error executing process > 'mapping (1)'

Caused by:
  Process `mapping (1)` terminated with an error exit status (1)

Command executed:

  bbmap.sh in=NT_Day1_Rep2_1_postTrimmed.fastq.gz in2=NT_Day1_Rep2_2_postTrimmed.fastq.gz outm=NT_Day1_Rep2.mapped.bam outu=NT_Day1_Rep2.unmapped.bam keepnames=t trd sam=1.3 intronlen=20 maxindel=200k ambig=random statsfile=NT_Day1_Rep2.alignmentReport.txt minid=0.76 threads=1
  samtools sort -T NT_Day1_Rep2-@ 1 -o NT_Day1_Rep2.sorted.mapped.bam NT_Day1_Rep2.mapped.bam
  samtools index -@ 1 NT_Day1_Rep2.sorted.mapped.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_036686'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_036687'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_027000'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_027000'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003668'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_046343'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001282478'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001282479'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001037668'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001040705'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001040703'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_152250'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001040704'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_152251'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001040702'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_080389'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_058203'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_018661'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001081551'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001037804'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NM_001195257'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_024044'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_049875'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_049876'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_106988'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_003594'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_106988'
  [W::sam_hdr_parse] duplicated sequence 'hg38_refGene_NR_031636'

  Genome:                   1
  Key Length:               13
  Max Indel:                200000
  Minimum Score Ratio:      0.55888
  Mapping Mode:             normal
  Exception in thread "main" java.lang.AssertionError: 
  The number of reads out does not add up to the number of reads in.
  This may indicate that a mapping thread crashed.
  If you submit a bug report, include the entire console output, not just this error message.
  5+10+0+14+0 = 29 != 30
    at align2.AbstractMapper.printOutput(AbstractMapper.java:1874)
    at align2.BBMap.testSpeed(BBMap.java:490)
    at align2.BBMap.main(BBMap.java:34)

Work dir:
  /project/SCCC/DOrso_lab/s418665/work/e3/912b57b2e499791478a55a0b864356

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Workflow completed on: Mon May 15 10:05:42 CDT 2017
Execution status: Failed
Workflow Duration: 13m 20s

bbmap • 3.9k views

ADD COMMENT • link 8.2 years ago by Sinji ★ 3.2k

0

Entering edit mode

I notice that you are not assigning memory (-Xmx30g) with your bbmap.sh mapping command. I am not sure how much memory your cluster allows by default but for human genome alignment you are going to need at least 30G (add -Xmx30g). You would need to add corresponding --mem option on SLURM side to match.

Do you have duplicate sequence headers in your genome fasta?

ADD REPLY • link 8.2 years ago by GenoMax 152k

0

Entering edit mode

The problem appears to be due to a corrupted FASTA file. Using a new reference genome file seemed to fix the problem.

ADD REPLY • link 8.2 years ago by Sinji ★ 3.2k

score 0 · Answer 1 · 2017-05-15

Hi Sinji,

I've never used Nextflow so I'm not sure if the problem is related to that or not. But given all of the error messages coming from samtools, it appears that your fasta file is invalid and contains multiple sequences with identical headers. Every sequence needs to have a unique name for a mapping program (or samtools) to function correctly. If the reference file was named "ref.fasta" you should be able to run something like:

grep hg38_refGene_NR_036686 ref.fasta

and see multiple records. If so, you need to fix the fasta file before indexing. Dedupe does have a flag for this purpose:

dedupe.sh in=ref.fasta out=deduped.fasta uniquenames

...but it's important to figure out why this happened in the first place, as the file may be corrupt.