Question

PopGenome - VCF, fasta, GTF and codons still missing

0

Entering edit mode

3.5 years ago

Maciek_Konopinski • 0

Dear All, I have been struggling with PopGenome for a while and I ran out of ideas. I have data in VCF format with SNPs from several fragments of a few genes, a reference fasta with the same names as in VCF, and GTF like:

1 VCF CDS 68 124 . + 2 gene_id "Gene.100265";
2 VCF CDS 126 405 . + 1 gene_id "Gene.100265";
3 VCF CDS 447 820 . + 1 gene_id "Gene.100265";
4 VCF CDS 864 1078 . + 1 gene_id "Gene.100265";

The genes are sequenced in a few hundred individuals. Each gene was sequenced in a few fragments and the fragments are the same for all individuals. I do not know to make PopGenome include information on coding sequence. I used: PGfile <- PopGenome::readData("./vcf/",gffpath = "./gtf/", format = "VCF", include.unknown = TRUE) PGfile <- set.synnonsyn(PGfile, ref.chr=paste0("./FASTA/references_uncoded_246_.fasta"))

PGfile@region.data@CodingSNPS are all TRUE, but PGfile@region.data@ExonSNPS are all FALSE. For some reason I get only a fraction of information from PGfile@region.data@codons and PGfile@region.data@n.nucleotides which anyway is NULL.

I guess this might be a problem about gtf file format but found no clue on how it should be formatted for PopGenome. I used "Gene", "Exon" and "CDS" (as above) for the third column, but nothing has changed.

Does anyone have idea what I did wrong?

Cheers,

Maciek Konopiński

The session info is as follows (all necessary packages are loaded):

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

PopGenome_2.7.5

EDIT I have added a lot of redundant information to gff file and now it looks as follows:

Gene.138885 . Region 1 369 . + . ID=Gene.138885;Name=Gene.138885;chromosome=Gene.138885;genome=chromosome;mol_type=genomic DNA

Gene.138885 . mRNA 1 369 . + . ID=Gene.138885_mRNA;Name=Gene.138885;chromosome=Gene.138885;genome=chromosome;mol_type=genomic DNA

Gene.138885 . gene 1 369 . + . ID=Gene.138885_gene;Name=Gene.138885;chromosome=Gene.138885;genome=chromosome;mol_type=genomic DNA

Gene.138885 . exon 264 343 . + . ID=Gene.1388851exon

Gene.138885 . CDS 264 343 . + 1 ID=Gene.1388851CDS

I reached another level of problems - set.synnonsyn returns error: Error in START[!REV, 3] : incorrect number of dimensions This error affects regions where only one CDS is present. In set.synnonsyn START contains a vector that has no dimensions while the function requires two-dimensional table. I will try to fix it manually, but I am a bit affraid it will have further consequences...

PopGenome VCF • 955 views

ADD COMMENT • link updated 3.1 years ago by JOrd • 0 • written 3.5 years ago by Maciek_Konopinski • 0

score 0 · Answer 1 · 2021-08-13

Dear Maciek

Hopefully you were able to solve these problems already.

I cannot comment on the main set of issues you reported. However, I also encountered the error: Error in START[!REV, 3] : incorrect number of dimensions following certain instances of set.synnonsyn which I also noticed occurred for genes which have only a single CDS listed in the gtf file. There is a 'dirty' way to fix this which is to just 'copy and paste' any lone CDS in the gtf file, thereby adding a dummy CDS. As each dummy is a duplicate and therefore covers the same coordinates, it should not affect the number of coding SNPs detected. Of course, this is not an elegant solution, but it enabled set.synnonsyn proceed without error for single-CDS genes.

First I read in the gtf and make a list of all ENSEMBL IDs of genes which have only one CDS:

library(ape)
library(dplyr)

gtf <- paste("stickleback_v5_ensembl_genes_reformatted.gtf")
allgenes <- read.gff(gtf, na.strings = c(".", "?"), GFF3 = FALSE)
allgenes<-subset(allgenes,feature=="CDS")[9]
allgenes$attributes<-gsub("ENS_gene=","",allgenes$attributes)
allgenes$attributes<-gsub(";.*","",allgenes$attributes)
colnames(allgenes)[1]<-"ensembl_gene_id"
allgenes <- add_count(allgenes,ensembl_gene_id,name="n_CDS")
singles <- subset(allgenes,n_CDS==1)[1]

Then I get a subset of the gtf containing only these single CDS genes:

allgenes <- read.gff(gtf, na.strings = c(".", "?"), GFF3 = FALSE)
allgenes$ID<-gsub("ENS_gene=","",allgenes$attributes)
allgenes$ID<-gsub(";.*","",allgenes$ID)
colnames(allgenes)[10]<-"ensembl_gene_id"
singlesgtf <- subset(allgenes,ensembl_gene_id %in% singles$ensembl_gene_id)

Then I rbind the CDS lines in the 'singles' gtf onto the initial gtf, resulting in a final gtf with the dummy CDS added

singlesgtf2<-rbind(allgenes,subset(singlesgtf,feature=="CDS"))[-c(10)]
singlesgtf2 <- singlesgtf2[order(singlesgtf2$seqname, singlesgtf2$start),]
write.table(singlesgtf2,file="stickleback_v5_ensembl_genes_reformatted_dummyCDS.gtf",sep="\t",quote=FALSE,row.names = FALSE,col.names = FALSE)

PopGenome no longer throws the above mentioned error when using this amended gtf.