Question: How to find total size of regions NOT in bigWig file
gravatar for Rubal
9 months ago by
Rubal270 wrote:

Hello All,

I would like to calculate the 'callable genome'. I have a reference genome in fasta format and a bed file with a list of genome coordinates that were excluded from variant calling. Could anybody advise on how to combine these two files to calculate how many sites in the reference genome were NOT excluded from analyses. So number of bases in the fasta file minus number of bases in regions listed in the bed file? I have found some solutions that would involved converting into bigWig format but I was wondering if there is a simpler tool for this.

Thanks in advance for your help.

Example of bed file regions:

contig1  1588    1589
contig1  3424    3428
contig2  0       401
genome bed fasta • 288 views
ADD COMMENTlink modified 9 months ago by ATpoint30k • written 9 months ago by Rubal270
gravatar for ATpoint
9 months ago by
ATpoint30k wrote:

Maybe I get this wrong, but can't you simply count the bp in the fasta as:

seqtk comp your.fasta | awk '{sum += $2} END {print sum}'

and from this subtract the bp in the BED, where bp in BED is:

awk '{sum += ($3-$2)} END {print sum}'  your.bed

Requires seqtk.

ADD COMMENTlink written 9 months ago by ATpoint30k

As an alternative to seqtk could I also use: grep -v ">" file.fasta | wc | awk '{print $3-$1}'

ADD REPLYlink modified 9 months ago by ATpoint30k • written 9 months ago by Rubal270
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 688 users visited in the last hour