Question

Problem with ONT sequencing pipeline. Probably in the Samtools part

0

Entering edit mode

7 months ago

Emilio • 0

Hi everyone. I have two sets of .fastq files from two ONT sequencing runs and one of them isn't working after generating .bam files.

EDIT: .sam files are generated with minimap2 and then Samtools is used to create the .bam files

The .bam files are generated and recognised by samtools, and they have reasonable sizes given the number of sequences they have, but they seem to not be mapped to the custom amplicon sequence I'm using as a reference, even though samtools tells me both sets of files are mapped to the same reference.

I ran samtools view -H on files from either experiment and both have the same reference (799yT32), plus they both seem to not have any other differences in versions nor upstream scripts.

Despite that, when I run samtools view -c on them to see the number of reads, it seems to stop recognising the reference sequence, though it still shows the file as having 100k reads in it.

I'm guessing the problem stems from not recognising the custom reference sequence name, but I don't know how can it work with one set of reads and not the other.

I also tried using IGV, which shows no reads and a "Junctions" row on the .bam file from experiment 2 which hadn't appeared previously in other .bam files I opened before.

The .fastq files come from two different experiments with different chemistries (experiment 1 was done with r9.4.1 while experiment 2 was done with r10), but look functionally the same. I don't think that can cause any problems but I point it out in case it might be relevant.

Do you know anything about this problem? Is there any way to solve this?

Thanks in advance.

bam samtools ONT • 875 views

ADD COMMENT • link updated 7 months ago by GenoMax 145k • written 7 months ago by Emilio • 0

0

Entering edit mode

I have two sets of .fastq files from two ONT sequencing runs and one of them isn't working after generating .bam files.

And how did you generate the BAM files from both data? Using minimap2 or some other aligner + samtools?

Sounds like you may have corrupted file.

ADD REPLY • link 7 months ago by GenoMax 145k

0

Entering edit mode

I used minimap2 and samtools, yes. I'll add it to the post.

I checked the files and they can be read by samtools, even saying how many number of reads they have when running samtools view -c ./file.bam. Specifying the region makes it not work for one of the two datasets, though. For example:

samtools view -c ./file_from_exp1.bam 799yT32
2944
samtools view -c ./file_from_exp2.bam
114145
samtools view -c ./file_from_exp2.bam 799yT32
0

Sometimes it returns region "779yT32" specifies an invalid region or unknown reference. Continue anyway. before 0 when specifying the region (779yT32). That one may be due to a corrupted file or memory overflow on one of the bigger files, but I reran the pipeline and the error kept happening for the same one of the two datasets, so I don't think it's a corrupted file.

ADD REPLY • link 7 months ago by Emilio • 0

0

Entering edit mode

Specifying the region makes it not work for one of the two datasets

When you say region I guess you are referring to the entire chromosome because that is what seems to be in your example.

but I reran the pipeline and the error kept happening for the same one of the two datasets, so I don't think it's a corrupted file.

Is there a possibility that the input fastq file may be corrupted at a specific location?

ADD REPLY • link 7 months ago by GenoMax 145k

0

Entering edit mode

When you say region I guess you are referring to the entire chromosome because that is what seems to be in your example.

Yes. I used 799yT32, which is the only "chromosome" that appears when running samtools view -H on all bam files which used that as reference (I also used another reference file, which leads to the same problem).

Is there a possibility that the input fastq file may be corrupted at a specific location?

There are 24 fastq files per experiment, the sam and bam files are created correctly with the expected number of reads for two different reference sequences, and all of the 48 resulting sets of sam/bam files have the same problem of not finding reads when looking for the 779yT32 reference sequence. I don't think that can be due to fastq file corruption, but I don't know enough about Samtools to be sure if it's impossible for something in the fastq files to be causing this issue.

ADD REPLY • link 7 months ago by Emilio • 0

0

Entering edit mode

all of the 48 resulting sets of sam/bam files have the same problem of not finding reads when looking for the 779yT32 reference sequence

Wait so you are just saying that no reads are aligned to 779yT32 reference? And you expect that should not be the case?

That said minimap2 should happily accept <(cat *.fq.gz) as a single input for all 24 files. You can also minimap2 -a <(cat *.fa) <(cat *.fq.gz) | samtools sort --write-index -o one.bam - to get a single sorted/indexed BAM in one shot.

ADD REPLY • link 7 months ago by GenoMax 145k

0

Entering edit mode

Wait so you are just saying that no reads are aligned to 779yT32 reference? And you expect that should not be the case?

When running samtools view -H I get an output that suggests that all of my .bam files are properly aligned to their references:

$ samtools view -H ./file_from_exp1.bam
@HD     VN:1.6  SO:coordinate
@SQ     SN:799yT32      LN:8546
@PG     ID:minimap2     PN:minimap2     VN:2.26-r1175   CL:minimap2 -ax map-ont -t 12 /[path]/ref_seq/779yT32.fa /[path]/barcode03_concat.fastq
@PG     ID:samtools     PN:samtools     PP:minimap2     VN:1.10 CL:samtools view -H ./file_from_exp1.bam
$ samtools view -H ./file_from_exp2.bam
@HD     VN:1.6  SO:coordinate
@SQ     SN:799yT32      LN:8546
@PG     ID:minimap2     PN:minimap2     VN:2.26-r1175   CL:minimap2 -ax map-ont -t 12 /[path]/ref_seq/779yT32.fa /[path]/barcode06_concat.fastq
@PG     ID:samtools     PN:samtools     PP:minimap2     VN:1.10 CL:samtools view -H ./file_from_exp2.bam

But, when trying to see those alignments with IGV or count them with samtools view -c (as shown in one of my previous replies) it doesn't seem to find any read that's aligned to that reference sequence.

That said minimap2 should happily accept <(cat *.fq.gz) as a single input for all 24 files.

But I don't need to generate a single file. My sequences are separated by patient so I need to keep them on different files.

ADD REPLY • link 7 months ago by Emilio • 0

1

Entering edit mode

My sequences are separated by patient so I need to keep them on different files.

Ok. Just wanted to make sure you were not using independent files since ONT can make more than one for each sample.

Just seeing headers is not telling us much. Are you able to see output from samtools idxstats file_from_exp1.bam? You should see number of reads aligned to (each) reference when you do this.

ADD REPLY • link 7 months ago by GenoMax 145k

0

Entering edit mode

Thanks. Trying with idxstats with both files I got the following output:

$ samtools idxstats file_from_exp2.bam
799yT32 8546    0       0
*       0       0       114145
$ samtools idxstats file_from_exp1.bam
799yT32 8546    2944    0
*       0       0       5364

As a reminder: exp2 was the one with problems. Is that asterisk marking unaligned reads? Because if that's the case, it sounds like alignment is a problem and issues in fastq files could be the reason.

ADD REPLY • link 7 months ago by Emilio • 0

0

Entering edit mode

Yes the * is unaligned reads. So the issue seems to be with the alignment.

ADD REPLY • link 7 months ago by GenoMax 145k