I would like to compare genes coordinates between 2 GTF/GFF files. I tried using Gffcompare but the classification codes depends on the introns chains. I thought of something simpler, that would compare gene lengths and say if genes are equal, longer in 3' or longer in 5'. Let's take an example with longer in 3':
> cat A.gff3 chr1 ncbiRefSeq gene 5000 8000 . + . ID=gene1;Name=gene1 chr1 ncbiRefSeq transcript 5000 8000 . + . ID=XM_gene1;Parent=gene1 chr1 ncbiRefSeq exon 5000 5524 . + . Parent=XM_gene1 chr1 ncbiRefSeq exon 6755 6829 . + . Parent=XM_gene1 chr1 ncbiRefSeq exon 7800 8000 . + . Parent=XM_gene1 > cat B.gff3 chr1 ncbiRefSeq gene 5000 10000 . + . ID=gene1;Name=gene1 chr1 ncbiRefSeq transcript 5000 10000 . + . ID=XM_gene1;Parent=gene1 chr1 ncbiRefSeq exon 5000 5524 . + . Parent=XM_gene1 chr1 ncbiRefSeq exon 6755 6829 . + . Parent=XM_gene1 chr1 ncbiRefSeq exon 7800 10000 . + . Parent=XM_gene1
The comparison of these two GFF3 files should show that gene1 in B.gff3 is longer in 3' than A.gff3. I thought also of using
bedtools overlap but this wouldn't give enough details (would just return that the genes overlap but we don't know how). If the gene in B started upstream (e.g. 4000), we should get the information that this gene is longer in 5' and in 3' in B compared to A.
Using gffcompare, the 2 genes are considered as equal because they have the same intron chains:
> gffcompare -r A.gff3 B.gff3 > cat gffcmp.B.gff3.tmap ref_gene_id ref_id class_code qry_gene_id qry_id num_exons FPKM TPM cov len major_iso_id ref_match_len gene1 XM_gene1 = gene1 XM_gene1 3 0.000000 0.000000 0.000000 2801 XM_gene1 801
Thanks for the help.
Have you tried
Yes, but it just gives genes that overlap between the 2 inputs, with no details on the way they overlap.