Dataset cleaning through python
1
0
Entering edit mode
2.5 years ago
anasjamshed ▴ 120

I have a dataset in the tsv file which contains gene information. First, upload it in the pandas' data frame and now I want to remove all missense mutations present in data through the 'mutation somatic status' columndata

My code:

chunks=pd.read_csv("CosmicGenomeScreensMutantExport.tsv",chunksize=1000000,sep='\t')
 dfList = []

 for df in chunks:
     dfList.append(df)

 df = pd.concat(dfList,sort=False)

After removing missense mutations I want to isolate only those records that contain gene P23

Can anyone help me in this?

python pandas • 1.0k views
ADD COMMENT
1
Entering edit mode
2.5 years ago

if your df is as posted images, try one of the two below:

print(df[(df['Mutation Description'] != '<missense_SO_Term>') & (df['Gene name'] == '<Gene_symbol>')])
print(df.query('(`Mutation Description` != "<missense_SO_Term>") & (`Gene name` == "<Gene_symbol>")'))

Please replace <missense_SO_Term> with appropriate text for missense mutations, and <Gene_symbol> with appropriate name for P23 in your data.

ADD COMMENT
0
Entering edit mode

I also want to remove N.A, null and duplicated values from my dataset

ADD REPLY
0
Entering edit mode

This code : print(df[(df.'mutation somatic status' !="missense") & (df.'Gene name'=="TP53")]) is giving me syntax error

ADD REPLY
0
Entering edit mode

I have updated the code with appropriate column names. Please replace SO term and Gene name with appropriate values and also check the column names.

ADD REPLY

Login before adding your answer.

Traffic: 2483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6