Find consecutive duplicate strings in rows from df
2
0
Entering edit mode
21 months ago
pramirez ▴ 10

I have a list of annotated protein sequences with their corresponding IDs. I am trying to create a function that detects consecutive duplicate entries in the first column (protein ID) and returns false or true. I tried this:

df = pd.read_csv('taxonomy.tsv', sep='\t', decimal='.')
value = df.iloc[:, 1].diff().lt(0)
print (value)

I obtain the following error:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Do you know how can I fix it?

Thank you.

python metagenomics pandas • 1.0k views
ADD COMMENT
1
Entering edit mode
21 months ago
raphael.B ▴ 520
l= list(df.iloc[:,1])
r=[False]
for k in range(1,len(l)):
    r.append(l[k]==l[k-1])
print(r)

This should do the trick

ADD COMMENT
0
Entering edit mode

Hi! Thanks! I tried your method and obtained the following error: TypeError: '(slice(None, None, None), 1)' is an invalid key

ADD REPLY
0
Entering edit mode

sorry, I forgot the iloc.

ADD REPLY
0
Entering edit mode
21 months ago
zorbax ▴ 610

it'll return all duplicate rows back

df[df.duplicated(['protein ID'], keep=False)]['protein ID']
ADD COMMENT

Login before adding your answer.

Traffic: 2713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6