Question

Dataset cleaning through python

0

Entering edit mode

2.5 years ago

anasjamshed ▴ 120

I have a dataset in the tsv file which contains gene information. First, upload it in the pandas' data frame and now I want to remove all missense mutations present in data through the 'mutation somatic status' column data

My code:

chunks=pd.read_csv("CosmicGenomeScreensMutantExport.tsv",chunksize=1000000,sep='\t')
 dfList = []

 for df in chunks:
     dfList.append(df)

 df = pd.concat(dfList,sort=False)

After removing missense mutations I want to isolate only those records that contain gene P23

Can anyone help me in this?

python pandas • 1.0k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 2.5 years ago by anasjamshed ▴ 120

score 1 · Answer 1 · 2021-10-30

1

Entering edit mode

2.5 years ago

cpad0112 21k

if your df is as posted images, try one of the two below:

print(df[(df['Mutation Description'] != '<missense_SO_Term>') & (df['Gene name'] == '<Gene_symbol>')])
print(df.query('(`Mutation Description` != "<missense_SO_Term>") & (`Gene name` == "<Gene_symbol>")'))

Please replace <missense_SO_Term> with appropriate text for missense mutations, and <Gene_symbol> with appropriate name for P23 in your data.

ADD COMMENT • link 2.5 years ago by cpad0112 21k

0

Entering edit mode

I also want to remove N.A, null and duplicated values from my dataset

ADD REPLY • link 2.5 years ago by anasjamshed ▴ 120

0

Entering edit mode

This code : print(df[(df.'mutation somatic status' !="missense") & (df.'Gene name'=="TP53")]) is giving me syntax error

ADD REPLY • link 2.5 years ago by anasjamshed ▴ 120

0

Entering edit mode

I have updated the code with appropriate column names. Please replace SO term and Gene name with appropriate values and also check the column names.

ADD REPLY • link 2.5 years ago by cpad0112 21k