Question

Subsetting GFF3 files for gene models using Bash

0

Entering edit mode

5.4 years ago

geneticatt ▴ 140

Hi all,

I've been wondering about best practices in pattern matching in bash. On sizable files and with lots of strings to search for, the issue becomes computationally burdensome quickly. Here's an example to illustrate a problem I've had. Hoping to learn about alternative solutions here.

I generated a set of genome annotations using BRAKER2 and then ran through a set of filters to whittle away at the false positive models. I arrived at final gene model sets that only included ~ 30-40% of the BRAKER2 gene models, which is in line with the expected number of genes for the taxa I'm working with. Now that I have the final set of models, I want to subset the full GFF3s that BRAKER2 produced to exclude the gene models that I filtered out. I have a python script that can use a list of accepted genes to subset GFF3, but I really want to perform the same task in a bash command. I thought I'd try using grep, since it's -f flag to grep whole lists (like my gene list) seemed like the tool for the job.

Here's what the GFF3 looks like. It has 9 fields total, but only the 9th contains any useful info for this subsetting.

scaffold3       AUGUSTUS        gene    3808    4464    0.48    -       .       ID=g1;
scaffold3       AUGUSTUS        mRNA    3808    4464    0.48    -       .       ID=g1.t1;Parent=g1
scaffold3       AUGUSTUS        stop_codon      3808    3810    .       -       0       Parent=g1.t1;
scaffold3       AUGUSTUS        CDS     3808    3976    0.48    -       1       ID=g1.t1.CDS1;Parent=g1.t1
scaffold3       AUGUSTUS        exon    3808    3976    .       -       .       ID=g1.t1.exon1;Parent=g1.t1;
scaffold3       AUGUSTUS        intron  3977    4306    0.48    -       .       Parent=g1.t1;
scaffold3       AUGUSTUS        CDS     4307    4464    0.57    -       0       ID=g1.t1.CDS2;Parent=g1.t1
scaffold3       AUGUSTUS        exon    4307    4464    .       -       .       ID=g1.t1.exon2;Parent=g1.t1;
scaffold3       AUGUSTUS        start_codon     4462    4464    .       -       0       Parent=g1.t1;
scaffold3       AUGUSTUS        gene    12096   12527   0.82    -       .       ID=g2;
scaffold3       AUGUSTUS        mRNA    12096   12527   0.82    -       .       ID=g2.t1;Parent=g2
scaffold3       AUGUSTUS        stop_codon      12096   12098   .       -       0       Parent=g2.t1;
scaffold3       AUGUSTUS        CDS     12096   12527   0.82    -       0       ID=g2.t1.CDS1;Parent=g2.t1
scaffold3       AUGUSTUS        exon    12096   12527   .       -       .       ID=g2.t1.exon1;Parent=g2.t1;
scaffold3       AUGUSTUS        start_codon     12525   12527   .       -       0       Parent=g2.t1;

There are a few difficulties that this formatting poses for subsetting

Problem: I didn't always accept all of the isoforms presented under the same "Parent" gene, so I can't just search for the parent (e.g. 'g1234'). Instead, I need to indicate the isoform in the search as well (e.g. 'g1234.t2'). This is problematic, because to preserve GFF3 format, I also need to pull the row that indicated gene, which is ambiguous to isoform. `scaffold3 AUGUSTUS gene 3808 4464 0.48 - . ID=g1;

Solution: By duplicating the gene model name 'g1234.t2' and excluding the isoform call in the duplicated name then appending a ';' at the back side, I can target that gene row that would have otherwise been excluded. The semicolon happens to also resolve the issue of accidentally matching with a longer gene name when searching for a shorter name that is contained within it (e.g. matching 'g12345' when looking for 'g123').

Here's a look at the gene list formatting, for clarity:

g33215;

g33215.t1

g37800;

g37800.t1

g22539;

g22539.t1

I ended up wrapping grep in a shell script and submitting it to the cluster that way, since (I thought) job time restrictions in interactive mode were killing grep, but it turns out grep was being killed anyways.

command: grep -f genemodelnames.txt Jure.gff3 > subset/Jure.gff3

Looking at the error files made it pretty clear that 8GB of memory wasn't enough for this job. Unsubsetted GFF3 files are ~75MB each & the lists of genes are ~40,000 each. Its obvious to me now that grep does not scale well. I should have stuck to python.

Question: If you had to do this in bash, how would you have done it?

genome bash annotation gff3 BRAKER2 • 3.1k views

ADD COMMENT • link updated 4.7 years ago by zhanglong_6 • 0 • written 5.4 years ago by geneticatt ▴ 140

score 0 · Answer 1 · 2020-02-17

if you consider awk as being bash enough, this little bash script should work:

for i in `cat <id list file>`
do
 base=${i%.*}
 awk -v base="=$base;" -v name="=$i;" '($3 == "gene" && $9 ~ base) || $9 ~ name' <gff3 file>
done

this takes as input your original geneID list (so the full name) and replace <gff3 file> with your gff input file.

this being said: there are likely better options for filtering gff files.

score 0 · Answer 2 · 2020-10-18

0

Entering edit mode

4.7 years ago

zhanglong_6 • 0

maybe

grep -w will work

ADD COMMENT • link 4.7 years ago by zhanglong_6 • 0