CNVKit does not output all the accessible regions in the targets bed file
1
0
Entering edit mode
16 months ago
ashenflower ▴ 30

Hello everybody,

I am using CNVkit on my data using hg38 as reference. The command that I am using is the following:

cnvkit.py batch sample.bam  -n control.bam -m wgs -f reference.fasta  --target-avg-size 1000 --output-dir results/

So, on the CNVkit documentation page is reported that

The batch --method wgs option uses the given reference genome’s sequencing-accessible regions (“access” BED) as the “targets” – these will be calculated on the fly if not provided.

Problem is, when it calculates accesibile regions from the reference, there are a lot of them listed in the log file that are not present in the final .bed output, and I cannot figure out why. Those are the outputs I get on a short example:

Log file:

WGS protocol: recommend '--annotate' option (e.g. refFlat.txt) to help locate genes in output files.
NC_000001.10: Scanning for accessible regions
        Accessible region NC_000001.10:10000-177417 (size 167417)
        Accessible region NC_000001.10:227417-267719 (size 40302)
        Accessible region NC_000001.10:317719-471368 (size 153649)
        Accessible region NC_000001.10:521368-2634220 (size 2112852)
        Accessible region NC_000001.10:2684220-3845268 (size 1161048)
        Accessible region NC_000001.10:3995268-13052998 (size 9057730)
        Accessible region NC_000001.10:13102998-13219912 (size 116914)
        Accessible region NC_000001.10:13319912-13557162 (size 237250)
        Accessible region NC_000001.10:13607162-17125658 (size 3518496)
        Accessible region NC_000001.10:17175658-29878082 (size 12702424)
        Accessible region NC_000001.10:30028082-103863906 (size 73835824)
        Accessible region NC_000001.10:103913906-120697156 (size 16783250)
        Accessible region NC_000001.10:120747156-120936695 (size 189539)
        Accessible region NC_000001.10:121086695-121485434 (size 398739)
        Accessible region NC_000001.10:142535434-142731022 (size 195588)
        Accessible region NC_000001.10:142781022-142967761 (size 186739)
        Accessible region NC_000001.10:143117761-143292816 (size 175055)
        Accessible region NC_000001.10:143342816-143544525 (size 201709)
        Accessible region NC_000001.10:143644525-143771002 (size 126477)
        Accessible region NC_000001.10:143871002-144095783 (size 224781)
        Accessible region NC_000001.10:144145783-144224481 (size 78698)
        Accessible region NC_000001.10:144274481-144401744 (size 127263)
        Accessible region NC_000001.10:144451744-144622413 (size 170669)
        Accessible region NC_000001.10:144672413-144710724 (size 38311)
        Accessible region NC_000001.10:144810724-145833118 (size 1022394)
        Accessible region NC_000001.10:145883118-146164650 (size 281532)
        Accessible region NC_000001.10:146214650-146253299 (size 38649)
        Accessible region NC_000001.10:146303299-148026038 (size 1722739)
        Accessible region NC_000001.10:148176038-148361358 (size 185320)
        Accessible region NC_000001.10:148511358-148684147 (size 172789)
        Accessible region NC_000001.10:148734147-148954460 (size 220313)
        Accessible region NC_000001.10:149004460-149459645 (size 455185)
        Accessible region NC_000001.10:149509645-205922707 (size 56413062)
        Accessible region NC_000001.10:206072707-206332221 (size 259514)
        Accessible region NC_000001.10:206482221-223747846 (size 17265625)
        Accessible region NC_000001.10:223797846-235192211 (size 11394365)
        Accessible region NC_000001.10:235242211-248908210 (size 13665999)
        Accessible region NC_000001.10:249058210-249240621 (size 182411)
NT_113878.1: Scanning for accessible regions
        Accessible region NT_113878.1:0-106433 (size 106433)
NT_167207.1: Scanning for accessible regions
        Accessible region NT_167207.1:0-547496 (size 547496)
NC_000002.11: Scanning for accessible regions
        Accessible region NC_000002.11:10000-3529312 (size 3519312)
        Accessible region NC_000002.11:3579312-5018788 (size 1439476)
        Accessible region NC_000002.11:5118788-16279724 (size 11160936)
        Accessible region NC_000002.11:16329724-21153113 (size 4823389)
        Accessible region NC_000002.11:21178113-31705550 (size 10527437)
        Accessible region NC_000002.11:31705551-31725939 (size 20388)
        Accessible region NC_000002.11:31726790-31816827 (size 90037)
        Accessible region NC_000002.11:31816828-31816854 (size 26)
        Accessible region NC_000002.11:31816855-31816858 (size 3)
        Accessible region NC_000002.11:31816859-33092197 (size 1275338)
        Accessible region NC_000002.11:33093197-33141692 (size 48495)
        Accessible region NC_000002.11:33142692-87668206 (size 54525514)
        Accessible region NC_000002.11:87718206-89630436 (size 1912230)
        Accessible region NC_000002.11:89830436-90321525 (size 491089)
        Accessible region NC_000002.11:90371525-90545103 (size 173578)
        Accessible region NC_000002.11:91595103-92326171 (size 731068)
        Accessible region NC_000002.11:95326171-110109337 (size 14783166)
        Accessible region NC_000002.11:110251337-149690582 (size 39439245)
        Accessible region NC_000002.11:149790582-234003741 (size 84213159)
        Accessible region NC_000002.11:234053741-239801978 (size 5748237)
        Accessible region NC_000002.11:239831978-240784132 (size 952154)
        Accessible region NC_000002.11:240809132-243102476 (size 2293344)
        Accessible region NC_000002.11:243152476-243189373 (size 36897)
NC_000003.11: Scanning for accessible regions
        Accessible region NC_000003.11:60000-17137943 (size 17077943)
NT_113878.1: Joining over small gaps
NT_167207.1: Joining over small gaps
Wrote GCF_000001405.25_GRCh37.p13_genomic.bed with 2 regions

So, multuple accessible regions are present for sequences NC_000001.10 , NC_000002.11 and NC_000003.11, but only regions for NT_113878.1 and NT_167207.1 are reported:

NT_113878.1     0       106433
NT_167207.1     0       547496

Why is that so? I cannot really figure out.

EDIT:

Following I report the .bam file header and the reference header, respectively:

.bam

@HD     VN:1.6  SO:coordinate
@SQ     SN:NC_000001.10 LN:249250621
@SQ     SN:NT_113878.1  LN:106433
@SQ     SN:NT_167207.1  LN:547496
@SQ     SN:NC_000002.11 LN:243199373
@SQ     SN:NC_000003.11 LN:17137943

ref.fasta

>NC_000001.10
>NT_113878.1
>NT_167207.1
>NC_000002.11
>NC_000003.11
CNVkit cnv bed target wgs • 1.8k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
16 months ago
barslmn ★ 2.1k

Could you post the chromosome names in the bam file and your bed file? They're probably not matching.

ADD COMMENT
0
Entering edit mode

Hello, my bed file is created by CNVkit directly from the reference I am providing, so the names (of the few regions which are reported in the final bed file given in output) are the same. (Or maybe I didn't correctly understood your question).

ADD REPLY
0
Entering edit mode

Is the chromosome names are the same in both reference fasta and input bam? Could you post the sample.bam header and reference.fasta identifier lines?

ADD REPLY
0
Entering edit mode

I edited the question and reported them there, they look the same. But I produced the bam file my mapping my reads to that reference, so I would be a bit surprised if they wouldn't

ADD REPLY
0
Entering edit mode

That kind of mismatch is the usual suspect. What is the result if you manually create the bed file and supply it with --targets argument?

NC_000001.10 0 249250621
NT_113878.1  0 106433
NT_167207.1  0 547496
NC_000002.11 0 243199373
NC_000003.11 0 17137943
ADD REPLY
0
Entering edit mode

If I do like this, it works (apparently) correctly. That's also what I have done in the end on my extended data, but I wanted to understand if it were a problem of the software, or maybe I was doing something wrong

ADD REPLY
1
Entering edit mode

TLDR:

My suggestion would be to use chromosome with UCSC names like "chr1 chr2 ...", Which you can download from here: https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml

Next to this text:

Do you want files preformatted for use in analysis pipelines?

In depth

  1. One option is to ask to author.

  2. Another option is looking at the source code.

The CNVkit lives at github: https://github.com/etal/cnvkit

You're calling the batch command so good place to start is here:

enter image description here

The part you're interested in is here. At line 36 it checks if a target bed file is given and at 38 if an access bed file is given and at 40 if a reference sequence is given. In your case since neither bed files are given and reference fasta is given block under the elif fasta is evaluated. Here it calls a function access.do_access. This function creates your bed file. We should check that out.

enter image description here

This function has a check at line 18 skip_noncanonical which is set to True by default in function definition at line 15. Which mean drop_noncanonical_contigs at line 19 is evaluated. This is the function below which imports is_canonical_contig_name from antitarget file.

enter image description here

Here they have a regex for filtering out non canonical chromosome names. Here they skip the chromosome names starting with NC. They should be filtering out chromosomes starting with NT too since those are also not canonical.

enter image description here

ADD REPLY
0
Entering edit mode

Thank you so much for your reply! I got misleaded by the fact that the NT chromosomes were in the output, so I didn't think the names could be involved. I wrote on github as a first move, but I didn't receive an answer yet. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6