I am trying to build a cellranger arc genome for the canine canFam4 genome build. Everything was going quite smoothly until I got an error
mkref has FAILED Error building reference package Invalid gene annotation input: in GTF records for gene_id RPL10A are not contiguous in the file
so I just did a grep to pull out all the lines that contained "RPL10A" in the GTF file and there was indeed a gap starting at line 3941 and ending 5535 for that gene. Here is where I am confused the RPL10A in 3900s is on chromosome 11 and the RPL10A in the 5530s in on chromosome 12. I have never really built a genome besides following a basic tutorials so I do not know if this is an error and if I should remove one these (how do I decide?) or what it means really. I have never manipulated GTF files before but I need to know what is going on because this does not make sense to me. Thanks
Where did you get this GTF file? Is it a direct download from a reference genome website such as GENCODE?
I downloaded the GTF file from the UCSC genomes Database
Please share the link to the file.
This seems to be an entry unique to UCSC - both EnsEMBL and NCBI have the RPL10A gene in chr12, not chr11. In fact, all 4 breeds available on EnsEMBL have the gene on chr12. See screenshot below. The
NM_identifier is also made up by UCSC - there is no
NM_001252145_2in NCBI. I prefer to never use UCSC resource files as EnsEMBL > NCBI > UCSC as far as standardization goes.
I'd recommend going with EnsEMBL's GTF for the generic C. familiaris or for a more specific breed if you'd prefer that (you can navigate to the parent directory and pick a different folder to match your breed requirement).