Hi,
I am working with Nanopore sequencing data that was base called and demultiplexed with guppy (version unknown, but likely rather recent). Alignment was done using bwa tool.
Now I want to sub-sample an existing bam so that I end up with a new bam file which only contains reads that were sequenced up hour 5 into the sequencing process. I did so by first extracting all read_id
values that full fill this criteria.
awk '$10 <= 1800 {print $5}' 20230503_1351_MC-112161_AIB461_29a045a6.sequencing_summary.txt > f31583_5h_parent_ID_5h.txt
Then I use the resulting file as input for samtools view.
samtools view -b -N f31583_5h_parent_ID_5h.txt REF_aln_trim.bam > sub_sampled.bam
However, after a closer investigation I discovered that not all qname values in REF_aln_trim.bam
have a matching value in 20230503_1351_MC-112161_AIB461_29a045a6.sequencing_summary.txt
. This is counter intuitive for me as I assumed that xxx.sequencing_summary.txt contains information about every read that was produced during sequencing. Or is the read_id
column in xxx.sequencing_summary.txt
not equivalent to the qname
tag in a bam file. I would appreciate any input or suggestions how to do the aimed sub-sampling better.
Cheers!
You could use one of the solutions here to filter reads with specific time stamps: Extract fastq sequences based on date/time (which is in the header)
Thanks so much for your fast reply. Do you know why it would work better sampling from fastq files? Base on the rest of my pipeline it would be much more convenient to sample from the bam files. Is there a bullet proof way to do it?
I see that you are retrieving
read_id
. I am not totally sure how it is related toparent_read_id
in column 4 in sequencing summary file. It is possible thatqnames
in your BAM are actuallyparent_read_id
. I have asked another mod who is ONT expert to comment.