Python pandas transforming int to float in gff subsetting
Entering edit mode
12 weeks ago

Hey guys,

I've written this python code.

import pandas as pd
from Bio import SeqIO
import argparse
parser= argparse.ArgumentParser(add_help=False)
parser.add_argument("-h", "--help", action="help", default=argparse.SUPPRESS, help= "Get partial gff given a pattern on Names field")
parser.add_argument("-g", help= "-g: gff file", required = "True")
parser.add_argument("-l", help= "-l: list of patterns to search on Names gff field", required = "True")
parser.add_argument("-o", help= "-o: output file", required = "True")

args = parser.parse_args()

#make a list with the IDs
terms =[]
with open(args.l) as f:
    for l in f:

#open gff file
m_names=('seqname','source','feature','start','end', 'score','strand','frame','Names')
df4=pd.read_csv(args.g, names=m_names, sep="\t", skiprows=35)

#save partial gff file
df5 = df4[df4['Names'].str.contains('|'.join(terms), na=False)]
df5.to_csv(args.o, index=False, header=None, sep="\t")

I basically filter a large gff based on a list of patterns found on the last off field, which I call 'Names".

The funny thing is, if I have a list of IDs such:

['XP_037652843.1', 'XP_037652864.1']

And my original gff is (note columns 4 and 5):

NC_051307.1 Gnomon CDS 111202 111993 . + 0 ID=cds-XP_037660565.1;Parent=rna-XM_037804637.1;Dbxref =GeneID:119510355,Genbank:XP_037660565.1;Name=XP_037660565.1;gbkey=CDS;gene=LOC119510355;product=vegetative cell wall protein gp1-like;protein _id=XP_037660565.1

Once I ran the code I get a ".0" the third and fourth column. Such as (note columns 4 and 5) :

NC_051307.1 Gnomon CDS 111202.0 111993.0 . + 0 ID=cds-XP_037660565.1;Parent=rna-XM_037804637.1;Dbxref =GeneID:119510355,Genbank:XP_037660565.1;Name=XP_037660565.1;gbkey=CDS;gene=LOC119510355;product=vegetative cell wall protein gp1-like;protein _id=XP_037660565.1

So it seems to me as soon as I run the code python transforms my int into floats? Does any one knows why and what should I do for that not to happen? Thank you so much! =)

python gff pandas • 323 views
Entering edit mode


Based on this:

I guess what you could do to handle NaN and intiger values at the same time for a column is:

df4 = pd.read_csv(args.g, names=m_names, sep="\t", skiprows=35, dtype={'start': 'Int32', 'end': 'Int32'})

I have not tried this one, but it seems to do the trick.

Best, Armin

Entering edit mode
12 weeks ago
Shred ▴ 870

Pandas is a good package, but couldn't do magic. Type inferring of columns might be altered even by a single value across the whole column or some strange whitespace char. You could manage it by using the dtype parameters, as explained in docs

dtypeType name or dict of column -> type, optional Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Entering edit mode

Yeah, it was guessing those columns as floats. Setting up dtype did work. Thank you Shred.


Login before adding your answer.

Traffic: 1006 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6