Question

bbmap read length from fastq

0

Entering edit mode

7 weeks ago

marco.barr ▴ 80

Hello everyone, I'm aligning a Nanopore sequence fastq file using bbmap (I'm trying to use bbmap instead of minimap2 because with the latter I have losses of aligned sequences). I'm encountering this error for many reads and I can't figure out how to solve it. Probably my sequences are too long, but even using the mapPacBio.sh script, I get a similar error. What can I do? Below are my commands and an example of the error to keep it concise. Thanks to all

 ./bbmap.sh in=/home/input.fastq out=/home/input_1.sam ref=/home/reference_pi4ka/PI4KA.fasta maxindel=5800    

  Exception in thread "Thread-4" java.lang.AssertionError: Read 0, length 1597, 
  exceeds the limit of 600 You can map the reads in chunks by reformatting to fasta,
  then mapping with the setting 'fastareadlen=600

fastq bbmap • 735 views

ADD COMMENT • link updated 6 weeks ago by GenoMax 141k • written 7 weeks ago by marco.barr ▴ 80

0

Entering edit mode

bbmap.sh is designed for Illumina (or similar-length) reads; as GenoMax stated, mapPacBio.sh is the better option here. However, in general, you should use a long-read aligner like minimap2 for very long reads over ~6kbp if you want the long-range data.

The error you are mentioning should never occur in mapPacBio so it would help if you could give the command line and error message for that. Also, I don't understand what the problem is with minimap2; perhaps you could elaborate? Are not enough reads getting assigned to anything?

ADD REPLY • link 7 weeks ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for the explanation Brian. In my initial experiment, I aligned my cDNA sequence from ONT MinION using minimap2. Then I built the bam file. I noticed that the alignment focused only on a specific region of the reference with a 10X coverage and even lower, that is, the IGV display cut off some sequences that were present anyway, as if it lost some sequences in alignment. In fact, with samtools flagstats I only noticed about 70% alignment. So I thought about extracting the reads with maximum length to try to recover something in the visualization but in reality this new fastq generated 0% alignment with minimap2. Surely there is something in the input sequences but I don't understand what. I hope it was clear.

ADD REPLY • link 6 weeks ago by marco.barr ▴ 80

0

Entering edit mode

It is possible that minimap2 is soft-clipping parts of read that are not aligning. There is an option in IGV to show the clipped parts. You could turn that on to confirm. This is going to go back to the quality of your data. It is not likely that minimap2 is making an error.

ADD REPLY • link 6 weeks ago by GenoMax 141k

0

Entering edit mode

Thanks GenoMax, I 've select the "Alignments" tab of preferences and check "Show soft-clipped bases" from IGV and this is my result enter image description here

I tried to extract the maximum length read from the fastq file and noticed the presence of a bacterial contaminant in this sequence. Can this have an impact? How can I remove the contaminant in order to realign the clean sequence?

ADD REPLY • link 6 weeks ago by marco.barr ▴ 80

0

Entering edit mode

Based on the header example you posted in your other thread it looks like these are barcoded samples. Do you know if the barcodes were trimmed from the data?

What we are seeing above could be remnants of adapter/barcodes. You could use something like Remove Soft Clipped Bases to remove those softclipped bases.

What is your aim in this experiment?

ADD REPLY • link 6 weeks ago by GenoMax 141k

1

Entering edit mode

Roughly speaking, the aim was to identify the presence of a deletion in a transcript related to the PI4KA gene, where we know the deletion is present in a patient, to support DNA analysis. In my alignment, I would have expected the reads to align at the beginning of the sequence, then encounter a gap for the deletion, and finally continue in the visualization with IGV. I'm not sure if I've been clear. I've used both Biostar from the link you sent me and Chopper to try to clean up any remaining residues, including the contaminant, but I get the same result as before, this time without clipped bases, but the alignment remains localized in the central part of the reference approximately.

enter image description here

Perhaps I could try extracting the unmapped reads and see on IGV? Or reads fail o unclassified?

ADD REPLY • link 6 weeks ago by marco.barr ▴ 80

0

Entering edit mode

In my alignment, I would have expected the reads to align at the beginning of the sequence,

Based on the data that is not happening. So perhaps you need to investigate why that is so. Is this direct RNA sequencing or how is the amplicon being produced?

ADD REPLY • link 6 weeks ago by GenoMax 141k

0

Entering edit mode

I went back through the analysis and saw that the amplification bands weren't exactly perfect but I could see something. Maybe a primer went wrong? I don't know, we're evaluating. Sequencing was performed on the amplification product

ADD REPLY • link 6 weeks ago by marco.barr ▴ 80

score 1 · Answer 1 · 2024-03-08

1

Entering edit mode

7 weeks ago

GenoMax 141k

If you are having issues with minimap2 then bbmap is not likely going to cure the alignment problems. Problem may lie with your data. That said, you could try following.

$ mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered

Here you are shredding the reads to 1000 bp (maxlen=1000). While BBMap can take 6 kb reads try this first and see if you are able to align.

ADD COMMENT • link 7 weeks ago by GenoMax 141k

0

Entering edit mode

I had noticed the presence of nonspecific bacterial sequences in my reads. I managed to clean them up by trimming the foreign sequences. I have already followed the documentation you provided, but I am still encountering the same error.

ADD REPLY • link 7 weeks ago by marco.barr ▴ 80

1

Entering edit mode

I just tested mapPacBio.sh with some nanopore data (with a 2kb chunk limit) and was able to get a SAM file without any issues. Do not use bbmap.sh use mapPacBio.sh instead as shown above.