Question

Python pandas transforming int to float in gff subsetting

0

Entering edit mode

21 months ago

marcela.uliano ▴ 90

Hey guys,

I've written this python code.

import pandas as pd
from Bio import SeqIO
import argparse
parser= argparse.ArgumentParser(add_help=False)
parser.add_argument("-h", "--help", action="help", default=argparse.SUPPRESS, help= "Get partial gff given a pattern on Names field")
parser.add_argument("-g", help= "-g: gff file", required = "True")
parser.add_argument("-l", help= "-l: list of patterns to search on Names gff field", required = "True")
parser.add_argument("-o", help= "-o: output file", required = "True")

args = parser.parse_args()

#make a list with the IDs
terms =[]
with open(args.l) as f:
    for l in f:
        terms.append(l.rstrip("\n"))

#open gff file
m_names=('seqname','source','feature','start','end', 'score','strand','frame','Names')
df4=pd.read_csv(args.g, names=m_names, sep="\t", skiprows=35)

#save partial gff file
df5 = df4[df4['Names'].str.contains('|'.join(terms), na=False)]
df5.to_csv(args.o, index=False, header=None, sep="\t")

I basically filter a large gff based on a list of patterns found on the last off field, which I call 'Names".

The funny thing is, if I have a list of IDs such:

['XP_037652843.1', 'XP_037652864.1']

And my original gff is (note columns 4 and 5):

NC_051307.1 Gnomon CDS 111202 111993 . + 0 ID=cds-XP_037660565.1;Parent=rna-XM_037804637.1;Dbxref =GeneID:119510355,Genbank:XP_037660565.1;Name=XP_037660565.1;gbkey=CDS;gene=LOC119510355;product=vegetative cell wall protein gp1-like;protein _id=XP_037660565.1

Once I ran the code I get a ".0" the third and fourth column. Such as (note columns 4 and 5) :

NC_051307.1 Gnomon CDS 111202.0 111993.0 . + 0 ID=cds-XP_037660565.1;Parent=rna-XM_037804637.1;Dbxref =GeneID:119510355,Genbank:XP_037660565.1;Name=XP_037660565.1;gbkey=CDS;gene=LOC119510355;product=vegetative cell wall protein gp1-like;protein _id=XP_037660565.1

So it seems to me as soon as I run the code python transforms my int into floats? Does any one knows why and what should I do for that not to happen? Thank you so much! =)

python gff pandas • 810 views

ADD COMMENT • link 21 months ago by marcela.uliano ▴ 90

2

Entering edit mode

Hi,

Based on this:

https://stackoverflow.com/questions/39666308/pd-read-csv-by-default-treats-integers-like-floats

I guess what you could do to handle NaN and intiger values at the same time for a column is:

df4 = pd.read_csv(args.g, names=m_names, sep="\t", skiprows=35, dtype={'start': 'Int32', 'end': 'Int32'})

I have not tried this one, but it seems to do the trick.

Best, Armin

ADD REPLY • link 21 months ago by dadrasarmin ▴ 20

score 2 · Answer 1 · 2022-07-10

2

Entering edit mode

21 months ago

Shred ★ 1.4k

Pandas is a good package, but couldn't do magic. Type inferring of columns might be altered even by a single value across the whole column or some strange whitespace char. You could manage it by using the dtype parameters, as explained in docs

dtypeType name or dict of column -> type, optional Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

ADD COMMENT • link 21 months ago by Shred ★ 1.4k

0

Entering edit mode

Yeah, it was guessing those columns as floats. Setting up dtype did work. Thank you Shred.

ADD REPLY • link 21 months ago by marcela.uliano ▴ 90