Question: Combining two databases in gtf file, bedtools gives wierd result
gravatar for marina.v.yurieva
8 months ago by
Farmington, CT
marina.v.yurieva480 wrote:

I'm trying to put genes from ENCODE and NONCODE in one gtf file and filter out NONCODE genes which are already in Encode. My bedtools command is:

bedtools intersect -wa -a NONCODE.gtf -b Gencode_exon.gtf -v -s -f 0.9

(I only take "exon" features from Gencode, otherwise it removes more transcripts from NONCODE)

This works okay for the most part but if I have overlapping transcripts in my NONCODE file, I get only one exon in the output:

enter image description here

Does anybody have any idea why does it happen and how to fix it? I can use another tool, just need to make sure I get all the Gencode genes and only the transcripts from NONCODE which don't completely (90%) overlap with Gencode. It has to be strand-specific too.

noncode gtf bedtools • 314 views
ADD COMMENTlink modified 8 months ago by Biostar ♦♦ 20 • written 8 months ago by marina.v.yurieva480

If you want the GENCODE transcript where there's overlap, then why are you using -wa with -a being your NONCODE list?

-a and -b should be flipped, no?

I also question the use of -v...

To get what you want, I think that you need -wao, with -a GENCODE and -b NONCODE

ADD REPLYlink written 8 months ago by Kevin Blighe39k

Thank you, Kevin! I tried that but it doesn't really work the way I need: I guess, when bedtools works with gtf file, it doesn't really pay attention to "gene", "transcript", "exon" fields and doesn't see the relations between them... If I use Encode file with "gene" and "transcript" fields, it overlaps it with the whole thing, not paying attention that exons are not along the whole gene and have breaks. If I use only "exon" fields, then I'll have to write another script of calculating the overlap of all the exons in the gene and, if they overlap > 90% exclude the whole gene_ID from the original file. I was wondering if there is already a tool which does all that...

ADD REPLYlink written 8 months ago by marina.v.yurieva480

Some example data and expected output would help to understand and resolve the issue better @OP

ADD REPLYlink written 8 months ago by cpad011211k

Sorry, I'll try to explain again. I need to add the transcripts from NONCODE to Gencode database but have to make sure the ones which overlap with Gencode > 90% are not included (the transcripts in the circle should be excluded). The closest I've gotten to this by using

bedtools intersect -wa -a NONCODE.gtf -b Gencode_exon.gtf -v -s -f 0.9

But that option excludes some original exons from the NONCODE transcripts (see exons with the arrows), and I need to add the original full transcripts, not modified.

Gencode_NONCODE overlap

I tried to use the suggested option bedtools intersect -wao -a Gencode.gtf -b NONCODE.gtf -s but it gives an overlap of all the features with everything (genes with exons, transcripts with exons, etc) and will require a lot of downstream parsing. I don't really know if any other software can do an overlap looking at the exons from one transcript but bedtools doesn't seem to really do that.

ADD REPLYlink written 8 months ago by marina.v.yurieva480
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1155 users visited in the last hour