Question: Parsing gff files using python and awk
0
gravatar for Afagh
5 months ago by
Afagh20
United states
Afagh20 wrote:

I have two text file that have lines like this: NC_013520.1 RefSeq gene 2229 3341 . + . ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002

I want to extract gene ids of all genes, and then compare it with another file that I have. I wrote this script but it doesn't consider the white spaces!

awk FS= "{\t ;:}" '($3 == "gene") && ($11=="Dbxref=GeneID") {print $1,$4,$5,$7,$12} ($3=="gene") && ($12=="Dbxref=GeneID") {print $1,$4,$5,$7,$13}' gff.gff

Output:

NC_013520.1 2131416 2131550 - 8637363

I can still use the output, but I want to learn how to also consider spaces as delimiter!

Also when I have this output ready and a similar output I want to compare their fields for example to find similar GeneIDs in both files.

awk • 252 views
ADD COMMENTlink modified 5 months ago by WouterDeCoster20k • written 5 months ago by Afagh20
1

I think FS= "{\t ;:}" will not compile, it should be -F"[\t ;:]" or -v FS="[\t ;:]" (I don't know if your script will do what you expect after this).

Also, if you use awk you assume that each line has the same number of fields in the same order. This is fine for the first 8 columns of the gtf, but for the "attribute" column there is no guarantee that lines have the same format (maybe in your case it's ok). I would prefer to explicitly look for the field you are interested, I think this is what shenwei356's perl option does.

ADD REPLYlink modified 5 months ago • written 5 months ago by dariober8.0k
2
gravatar for shenwei356
5 months ago by
shenwei3563.2k
China
shenwei3563.2k wrote:

How about retrieving the GeneID as a new column and discard it after downstream analysis.

I'm not familiar with awk, but you can try csvtk.

Sample data:

$ cat 1.gff 
NC_013520.1     RefSeq  gene    2229    3341    .       +       .       ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002

Retrieve the GeneID using regular expression as a new column (10th column).

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)'
NC_013520.1     RefSeq  gene    2229    3341    .       +       .       ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002    8635442

Select columns you want:

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)' | csvtk cut -H -t -f 1,4,5,7,10
NC_013520.1     2229    3341    +       8635442

Do somethings....

Remove the last (10th) column:

$ cat 10-columns | csvtk -H -t cut -f -10

You may need use csvtk inter to find common GeneID of two or more files.

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)' > 1.m.gff
$ csvtk inter -H -t -f 10 1.m.gff 2.m.gff
8635442
ADD COMMENTlink modified 5 months ago • written 5 months ago by shenwei3563.2k
2
gravatar for shenwei356
5 months ago by
shenwei3563.2k
China
shenwei3563.2k wrote:

It's not easy using awk, try Perl:

$ cat 1.gff | perl -ane '$F[8]=~/GeneID:(\d+)/; print "$F[0]\t$F[3]\t$F[4]\t$F[6]\t$1\n";'
NC_013520.1     2229    3341    +       8635442
ADD COMMENTlink written 5 months ago by shenwei3563.2k

Thanks, what if I want to extract the gene name? I tried $F[8]=~/Name:(\c+)/ didn't work.

ADD REPLYlink written 5 months ago by Afagh20
1

shoud be \w not \c for letters or numbers. learn more about the regular expression

ADD REPLYlink written 5 months ago by shenwei3563.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1357 users visited in the last hour