Question: extract common CDS from two gff3 files (using confidence interval)
gravatar for mahomarr
4.4 years ago by
mahomarr0 wrote:

Is there any script/program available to compare two gff3 files and find common matches? If possible, with a confidence interval, I would like to accept as equivalent a +2 confidence interval for "start" column, and -2 for "end" column.  


first gff3 file:

gene-1    GeneMark.hmm    CDS    2    181    -231.755620    +    0    ID=.
gene-2    GeneMark.hmm    CDS    2    295    -377.586623    -    0    ID=.
gene-3    GeneMark.hmm    CDS    39    278    -307.845998    +    0    ID=.
gene-4    GeneMark.hmm    CDS    2    133    -167.844314    -    0    ID=.
gene-5    GeneMark.hmm    CDS    140    298    -206.922922    -    0    ID=.

second gff3 file:

gene-1    mga    CDS    1    181    1.332138    +    2    ID=.
gene-2    mga    CDS    1    300    1.314849    -    2    ID=.
gene-3    mga    CDS    39    300    1.318473    +    3    ID=.
gene-4    mga    CDS    1    133    1.315638    -    2    ID=.
gene-4    mga    CDS    146    300    1.339628    -    2    ID=.
gene-5    mga    CDS    1    121    1.373858    -    3    ID=.
gene-5    mga    CDS    140    300    1.318199    -    2    ID=.

As you can see gene-1 from both files are equal (inside the intervals), gene-2 is outside due the end column (295-300, more than -2 difference, 298 would being acceptable).

And the second gff3 file found one more CDS in gene-4 and gene-5 than first file, so both of them must be compared with single gene-4 and gene-5 from first gff3 file.

Then the output should be the common CDS:


Thanks in advance.

ADD COMMENTlink modified 4.4 years ago by RamRS21k • written 4.4 years ago by mahomarr0

Quick answer: bedtools intersect can do what you want ( However, do you think that 2bp criteria is correct when some genes might be 10kb in length and some only 100bp (see bedtools intersect -f -r options)?

Have in mind that bedtools compares genomic intervals, but not genes. If you need to compare gene sequences use blast.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by PoGibas4.8k

yes, in this case 2bp criteria is correct. Thanks for your help.

ADD REPLYlink written 4.4 years ago by mahomarr0

You need to define a bit better exactly how you want to handle the GFF3 file, as Pgibas mentioned. For example, do you only care about coding regions and not UTRs? What if only one CDS of a gene with 100 matches, do you still want that output?

In general, bedtools, bedops, or even just GenomicRanges in R could be used to do this sort of thing. For the latter, you'd make a two ranges of starts, with one extended by the 2bp window (this is not a confidence interval...don't use that phrase in this context as it has no meaning) on each side, and then findOverlaps. You'd then do the same with the end positions and intersect the results. That would be relatively efficient, though bedtools and bedops are usually faster.

ADD REPLYlink written 4.4 years ago by Devon Ryan89k

copy that, thanks you very much.

ADD REPLYlink written 4.4 years ago by mahomarr0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1396 users visited in the last hour