Hello,
I'm trying to find overlapping genes for my CNV calls. I downloaded the gene annotations (hg18 (Mar2006, NCBI build 36)) from UCSC:
[knownGene.txt.gz]
[kgXref.txt.gz]
and the same for refGene annotation explained on PennCNV website.
But when I run the 'scan_region.pl' command an error occurs:
    C:\penncnv>scan_region.pl sample.rawcnv hg18_refGene.txt -refgene -reflink hg18_refLink.txt > sample.cnv.rg18
    Error: invalid record in template-location-file hg18_refGene.txt (expecting 16 or 10 tab-delimited fields in refGene file): <1410,2804,5917067  N525,1506,525,15824069132,140691R_02,,  873     7974,   215506,5254,2,1,,       218281,,
    238422,,        23-1,6,525,05784525,1392,       6913282406918345,,      87372251586LIS995,      37974586,1,-    85544155       8,   0 CEP68,2,30,15,2061314048,88390,33,0,21066480,21066480,21066480488390,33,,,291384439717  -8,,2106335,883909781,,2913XR1   9717    4695,210664805392OC1924750493576081593549121593> 
at C:\penncnv\scan_region.pl line 540 main::scanUCSCGene('sample.rawcnv', 'hg18_refGene.txt', 0, 'refgene', undef, undef) called at C:\penncnv\scan_region.pl line 108
Something seems to be broken in the annotation file. How can I avoid or fix this? I'm a biologist, not a computer scientist, so please be kind.;)
Thank you
Can you show how hg18_refGene.txt looks?
It's a tab-delimited txt file. When I open it in excel there are 16. columns. But from line 900 the format seems to be destroyed. Therefore I think I found the problem suspecting the extraction of the .gz archive!?
Update: Yes, extraction problems with powerarchiver. Using winrar let it works!
You should put that in the answer and then accept it, in case someone else has the same problem.