How to calculate percent coverage of each read for nanopore
1
0
Entering edit mode
14 days ago

Hello, I am assigned to investigate the read the cover the reference more than 80%. I know that samtools can be used to calculate coverage and depth for the interested regions but what about each read? For example, I have read 1-10 that cover region A, I want to know their percent coverage for all of each read. Expect output: read 1 70% cover, leftmost position 1 (like sam format column) Is there any tools or script that can be used for this? Thanks in advance.

alignment nanopore sequencing • 382 views
1
Entering edit mode

It seems like several concepts may be confused. "Reads" don't have a coverage, they're just one sequenced molecule. Instead, mapped reads can be characterized by their "length" and "mismatch rate". So are you trying to get the coverage of the sequence, get reads where >80% match the reference, get reads that cover 80% of the reference (read length), or something else entirely?

0
Entering edit mode

It seems like several concepts may be confused. "Reads" don't have a coverage, they're just one sequenced molecule. Instead, mapped reads can be characterized by their "length" and "mismatch rate". So are you trying to get the coverage of the sequence, get reads where >80% match the reference, get reads that cover 80% of the reference (read length), or something else entirely?

0
Entering edit mode

Yes, I want the reads that cover 80% of my interested region.

1
Entering edit mode
13 days ago
colindaven ★ 2.8k

Have a look at something like sambamba depth to obtain coverage stats (for a bam mapped to a reference genome)

From a bash script I use frequently:

    input=$i sec_input=${input%%.bam}

window=100000
overlap=50000
covMax=999999999


sambamba depth window -t $threads --max-coverage=$covMax --window-size=$window --overlap$overlap -c 0.00001 ${sec_input}.bam >${sec_input}_cov_window.txt &

0
Entering edit mode

Thanks, but what is the input file?

1
Entering edit mode

0
Entering edit mode

If all this is not clear I'd recommend some basics:

Galaxy - try the M. tuberculosis Variant Analysis tutorial.

https://training.galaxyproject.org/training-material/topics/variant-analysis/

And or there is lots of stuff on youtube on basic NGS bioinformatics.

0
Entering edit mode

I have tried with my sorted bam file but the message show that sambamba-depth: All files must be coordinate-sorted. Processing reference #3317 (tufA) Processing reference #3318 (fusA) Then it generated empty file, only the header column show in file.

0
Entering edit mode

It looks like you aligned to genes individually ? The standard approach is to align the reads to the complete genome.

0
Entering edit mode

Oh, I see. the region that I interested is the gene. Anyways thanks.

1
Entering edit mode

You could use samtools idxstats your.bam to get number of reads that are aligned to each gene in your reference.

1
Entering edit mode

I think you need to adjust your question. What did you do ? Which organism ? What are your research questions. Normally, align to the genome, then view the coordinate sorted bam in IGV.

For gene positions, get a GFF3 or GTF file and view this together with your reads. Good luck

0
Entering edit mode

Thanks for advice! I have question about mapping to reference genome (FASTA) with gene annotation file (GTF or GFF) what is difference of output that they generate in SAM file when compare to mapping with mapping to reference genome (FASTA) only. Thanks in advance!