Question: How to find total size of regions NOT in bigWig file
0
gravatar for Rubal
8 days ago by
Rubal220
Germany
Rubal220 wrote:

Hello All,

I would like to calculate the 'callable genome'. I have a reference genome in fasta format and a bed file with a list of genome coordinates that were excluded from variant calling. Could anybody advise on how to combine these two files to calculate how many sites in the reference genome were NOT excluded from analyses. So number of bases in the fasta file minus number of bases in regions listed in the bed file? I have found some solutions that would involved converting into bigWig format but I was wondering if there is a simpler tool for this.

Thanks in advance for your help.

Example of bed file regions:

contig1  1588    1589
contig1  3424    3428
contig2  0       401
genome bed fasta • 63 views
ADD COMMENTlink modified 8 days ago by ATpoint16k • written 8 days ago by Rubal220
4
gravatar for ATpoint
8 days ago by
ATpoint16k
Germany
ATpoint16k wrote:

Maybe I get this wrong, but can't you simply count the bp in the fasta as:

seqtk comp your.fasta | awk '{sum += $2} END {print sum}'

and from this subtract the bp in the BED, where bp in BED is:

awk '{sum += ($3-$2)} END {print sum}'  your.bed

Requires seqtk.

ADD COMMENTlink written 8 days ago by ATpoint16k
1

As an alternative to seqtk could I also use: grep -v ">" file.fasta | wc | awk '{print $3-$1}'

ADD REPLYlink modified 8 days ago by ATpoint16k • written 8 days ago by Rubal220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1653 users visited in the last hour