Question: Filter Gtf With Another Gtf
0
gravatar for Nicolas Rosewick
6.8 years ago by
Belgium, Brussels
Nicolas Rosewick8.6k wrote:

Hi,

I've a two gtf annotation files and I want to filter one of them using the other one. The idea is to remove all the transcript that overlaps with a transcript contained in the second file. It has to be sens specific. I thought about bedtools but I don't know if such a thing is possible with it. Otherwise I will write a little script to do that but I don't want to reinvent the wheel..

Thanks

N.

gtf filter • 2.2k views
ADD COMMENTlink modified 4.9 years ago by Biostar ♦♦ 20 • written 6.8 years ago by Nicolas Rosewick8.6k
2
gravatar for Alex Reynolds
6.8 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could use BEDOPS gtf2bed to losslessly convert your two GTF files to BED, and then BEDOPS bedops --not-element-of to return elements from one file which do not overlap the second file:

$ gtf2bed < firstSet.gtf > firstSet.bed
$ gtf2bed < secondSet.gtf > secondSet.bed
$ bedops --not-element-of -1 firstSet.bed secondSet.bed > answer.bed

At the end, you have answer.bed, a BED file that can be converted back to GTF via awk, e.g.:

$ awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' answer.bed > filteredFirstSet.gtf

If you need to separate your GTF files by strand, you could do this with awk, e.g.:

$ gtf2bed < firstSet.gtf \
    | awk '$6=="+"' - \
    > firstSetSense.bed

$ gtf2bed < firstSet.gtf \
    | awk '$6=="-"' - \
    > firstSetAntisense.bed

Repeat for secondSet.gtf.

Run bedops --not-element-of on the two *Sense.bed files:

$ bedops --not-element-of -1 firstSetSense.bed secondSetSense.bed > answerSense.bed

Repeat for the two *Antisense.bed files:

$ bedops --not-element-of -1 firstSetAntisense.bed secondSetAntisense.bed > answerAntisense.bed

Then take the multiset union of the two answer*.bed files:

$ bedops --everything answerSense.bed answerAntisense.bed > answer.bed

Then awk that merged result back to GTF, if needed.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Alex Reynolds29k

Great ! Is it possible to specify a minimum overlap fraction ?

ADD REPLYlink written 6.8 years ago by Nicolas Rosewick8.6k

Yes, you can specify either bases or percentage ("fraction"). Take a look at the documentation or bedops --help.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Alex Reynolds29k

It seems that bedops do not take into account the strand of the feature.

A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr1 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

B.bed chr1 90 195 A 100 + 90 195 0 3 25,30,20 5,35,90

bedops --not-element-of -1 A.bed B.bed gives me nothing but it should give me the second line of A.bed because the feature is on the negative strand and do not overlap the actual feature in B.bed (encoded on the + strand)

but when I change the chromosome of the second feature in A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

bedops --not-element-of -1 A.bed B.bed gives me : chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

ADD REPLYlink written 6.8 years ago by Nicolas Rosewick8.6k

Perhaps you could use awk to split the BED files by strand and run bedops on each strand-specific subset. You can merge (multiset union) the results at the end with bedops --everything, running the BED-to-GTF conversion step on that merged result. See the edited answer for one way to approach this.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1840 users visited in the last hour