Question

parsing gff3 file

1

Entering edit mode

5.8 years ago

hye140817 ▴ 20

Hello,

I'm trying to map id to gene name using the gff3 file. I've been searching a lot for this topic but none of them were exactly what I'm looking for (or tools just didn't work well). Could anyone help me on this? Here's an example gff3 annotation that I have. I replaced 'tab' to ' | ' for better understanding :

chr18 | SE | gene | 25175343 | 25203976 | . | + | . | Name=chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+;gid=chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+;refseq_id=NA;ensg_id=ENSMUSG00000033632;gsymbol=AW554918;ID=chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+
chr18 | SE | mRNA | 25175343 | 25203976 | . | + | . | gid=chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+;ID=chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+.A;Parent=chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+

Also, I have another file with gid, and would like to map the gid to gsymbol, for example. Here's how the file look like :

event_name | chrom | strand | mRNA_starts | mRNA_ends
chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+ | chr18 | + | 25175343,25175343 | 25203976,25203976

Then, I would like to map gid (chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+) to gff3 file, and print out with gsymbol (AW554918). The output I'm trying to get is something like :

event_name | chrom | strand | mRNA_starts | mRNA_ends | gsymbol
chr18:25175343:25175485:+@chr18:25182055:25182149:+@chr18:25203861:25203976:+ | chr18 | + | 25175343,25175343 | 25203976,25203976 | AW554918

I think I might want to parse the attributes in gff3 file, and map gid in the second file to gsymbol. Could you help me how I can parse the attributes to multiple columns? pyhton or R would be a bit better for me to understand. Or, is there a simpler way to do this? Any suggestions to solve this problem will be really appreciated. Thank you.

RNA-Seq • 2.2k views

ADD COMMENT • link updated 5.8 years ago by cmdcolin ★ 3.8k • written 5.8 years ago by hye140817 ▴ 20

score 4 · Accepted Answer · 2018-07-27

4

Entering edit mode

5.8 years ago

cmdcolin ★ 3.8k

This is one of those cases where GFF format really does not shine, as it is sort of conglomerated in column 9 and hard to parse. In some sense I have taken a liking to BED format which has arbitrary information stuffed into individual columns (with autoSql to describe columns) but gff is more common it seems.

Anyways, if want to use a full gff3 parser, this one from our team is new and javascript based but should be performant https://github.com/GMOD/gff-js

To use, install it from NPM

mkdir test
npm install @gmod/gff

Then make a new file parse_attributes.js like this

const gff = require('@gmod/gff').default
const fs = require('fs')
fs.createReadStream('yourfile.gff')
.pipe(gff.parseStream())
.on('data', data => {
    data.forEach(record => {
      console.log(record.attributes.gid+'\t'+record.attributes.gsymbol)
    })
})

Then run

node parse_attributes.js > attributes.txt

That will create a two column file attributes.txt with gid and gsymbol. Note that this parser reconstructs parent-subfeature relationships so it only will output the data on the top level gene features but I assume is what you want anyways

Then you can simply use the unix command join to combine the two text files, but make sure they are sorted

sort attributes.txt > attributes.sorted.txt
sort events.txt > events.sorted.txt
join -t $'\t' -1 1 -2 1 attributes.sorted.txt events.sorted.txt

Where attributes.txt is the output of the nodejs script and events.txt is the other text file you refer to

ADD COMMENT • link 5.8 years ago by cmdcolin ★ 3.8k

1

Entering edit mode

Note: I realize that you could probably do this same thing using some more lightweight text chopping tools but hopefully using the actual gff3 parser and nodejs is not too much of an added pain point.

ADD REPLY • link 5.8 years ago by cmdcolin ★ 3.8k

1

Entering edit mode

Thanks cmdcolin! Good to know about this tool, and your example script helped me a lot! :)))

ADD REPLY • link 5.8 years ago by hye140817 ▴ 20

1

Entering edit mode

Sure thing. Also note that the unix command join requires sorted text files, I missed that step in my explanation. The join command is great for combining two different text files based on columns but requires that they are sorted or the output may be incomplete

ADD REPLY • link 5.8 years ago by cmdcolin ★ 3.8k

0

Entering edit mode

Very helpful. Thanks a lot!!

ADD REPLY • link 5.8 years ago by hye140817 ▴ 20

0

Entering edit mode

Sure thing. Also note that the unix command join requires sorted text files, I missed that step in my explanation. The join command is great for combining two different text files based on columns but requires that they are sorted or the output may be incomplete

ADD REPLY • link 5.8 years ago by cmdcolin ★ 3.8k