subset gtf file based on coordinates
2
0
Entering edit mode
6.2 years ago
Assa Yeroslaviz ★ 1.8k

Hi, I have a gtf file based on Ensembl chromosomes and gene IDs. I would like to extract the genes in it based on a second bed file. This Bed file though comes from a different source (lncRNA DB - www.noncode.org).

I would like to extract all the genes from the gtf file which falls within the coordinates of the bed files completely or partially.

I would appreciate any advices on how to do it

thanks

Assa

RNA-Seq bed gtf subset genomic ranges • 4.9k views
ADD COMMENT
2
Entering edit mode
6.2 years ago
ATpoint 81k

Use bedtools:

# Get the GTF entries that overlap with -b (-wa = return entire range of -a interval)
bedtools intersect -wa -a file.gtf -b second_file.bed

Using it without -wa will return only the interval part of -a that partially matches -b.

ADD COMMENT
0
Entering edit mode

this seems to work thanks

ADD REPLY
1
Entering edit mode
6.2 years ago

I'd load the GTF file into R and make a GRangesList out of it (split it by gene name). Then you can subsetByOverlaps() that with the BED file (you'll need to load it too). That should keep all of the genes intact. You can then write the resulting GRangesList back to a GTF file.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion, but I think I am not doing it right. :-)

I load the two files

# load the gtf file
gtf.gr <- import(con ="~/genomes/Caenorhabditis_elegans/Ensembl/WBcel235.2017/Annotation/Genes/genes.gtf", format = "gtf" )
gtf.gr <- sortSeqlevelsgtf.gr) # make sure it is sorted
gtf.gr <- sortgtf.gr)

# load the bed file
bed.NONCODE <- import("NONCODEv5_ce10.lncAndGene.sorted.ensembl.bed", format="bed")
bed.NONCODE <- sortSeqlevels(bed.NONCODE) # make sure it is sorted
bed.NONCODE <- sort(bed.NONCODE)

and split the gtf based on the geneIDs.

#split by genes to grangeslist
gtf.grl <- splitgtf.gr, mcolsgtf.gr)$gene_id)

Than I try to subset by overlaps, but i don't think it is done correctly. Do I need to subset it with the GRangesList object of with the GRanges onbject?

# subset by overlap
overlap.subset.gr <- subsetByOverlaps(x=gtf.gr, ranges = bed.NONCODE)
overlap.subset.grl <- subsetByOverlaps(x=gtf.grl, ranges = bed.NONCODE)

Does it make a different?

Exporting the GRanges objects to a gtf file using export

#export the file to a gtf format
rtracklayer::export(object = overlap.subset.gr, con = "overlap.subset.gr.gtf", format = "gtf")
rtracklayer::export(object = unlist(overlap.subset.grl), con = "overlap.subset.grl.gtf", format = "gtf")

But I get two different results here with a different number of rows. Which of the two (if any) is correct than?

thnaks Assa

ADD REPLY
0
Entering edit mode

You'll want to subset the GRangesList object, that way the gene as a whole will get selected. You shouldn't need to sort things (OK, maybe the seqlevels, I always forget about them). Make sure to export the .grl one.

ADD REPLY

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6