Compare columns to find similarities and differences
0
0
Entering edit mode
3.9 years ago

I have the following huge file and I would like to compare if the variants in the same gene are comparable.

If the column of gnomGene has a name for the TP53 gene, then the code will check whether COSMGene has the same name for the TP53 gene. Two variant columns (gnomAD and COSMIC) should be compared if those columns have the same gene name. Same and different one can be written into different columns.

My original file

**gnomAD     gnomGene     COSMGene  COSMIC  Variant_ID1 Variant_ID1**

   p.K38Q     TP53             TP53        p.K38R      rs_NO1  rs_NO6 

   p.L83I     TP53             TP53        p.L83P      rs_NO2  rs_NO7

   p.D86N     MAD2             MAD2        p.D86E       rs_NO3  rsNO8

   p.Y116N    MAD2             MAD2        p.Y116S      rs_NO4  rsNO9

   p.V117A    HARS             HARS        p.V117G      rs_NO5  rsNO10

Final file

**gnomAD  gnomGene  COSMGene    COSMIC  Variant_ID1 Variant_ID1  Same Different**

p.K38Q  TP53    TP53    p.K38R  rs_NO1  rs_NO6     p.K38Q

p.L83I  TP53    TP53    p.L83I  rs_NO2 rs_NO7  p.L83I

p.D86N  MAD2    MAD2    p.D86E  rs_NO3  rsNO8  p.D86N

p.Y116N  MAD2   MAD2    p.Y116N rs_NO4  rsNO9  p.Y116N

p.V117A  HARS   HARS    p.V117A rs_NO5   rsNO10  p.V117A

I have tried to write a code but it did not work as I hoped.

results [ ]

with open(in_file, 'r') as var_file:

    for line in ar_file:

        if var_file["gnomGene"] == var_file["COSMGene"]:

                 if var_file["gnomAD "] == var_file["COSMIC"]:

               results[entry].append(line)

I have started to learn python but still could not figure out to get it done. Any help is highly appreciated.

snp genome sequence gene • 759 views
ADD COMMENT
0
Entering edit mode

I don't understand your problem but you're looking at awk. Something like.

awk -F '\t' '{printf("%s\t%s\n",$0,($2==$3 && $1==$4?"TRUE":"FALSE"));}' input.tsv

ADD REPLY
0
Entering edit mode

Dear Pierre,

Thanks for the feedback. Let me take my explanations further. I have a file with 6 columns with various inputs (gene names, variants, and variant IDs). I want to compare the columns with variants and find similarities / differences. First, I want to find the same genes from columns and then compare variants within the same gene. I 'm new to Python and I am confused when I have tried to compare each columns. Thanks so much for your kind help. I don't actually have a linux system, I wish I would be able to follow your suggestion.

ADD REPLY
0
Entering edit mode

Based on corrected input data:

$ awk -v OFS="\t" 'NR==1 {print $0,"Match","NoMatch"}; NR>1 {if($2==$3 && $1==$4) $7=$1; else $8=$1;print}' test.txt

or

$ awk -v OFS="\t" 'NR==1{print $0,"match","nomatch"};NR>1 {($2==$3 && $1==$4?$7=$4:$8=$4);print}' test.txt
gnomAD  gnomGene    COSMGene    COSMIC  Variant_ID1 Variant_ID1 Match   NoMatch
p.K38Q  TP53    TP53    p.K38R  rs_NO1  COSM1067639     p.K38Q
p.L83I  TP53    TP53    p.L83I  rs_NO2  COSM3854997 p.L83I
p.D86N  MAD2    MAD2    p.D86E  rs_NO3  rs1188531884        p.D86N
p.Y116N MAD2    MAD2    p.Y116N rs_NO4  COSM2687732 p.Y116N
p.V117A HARS    HARS    p.V117A rs_NO5  rs1481996636    p.V117A

Assumption is that OP wants to match 2 and 3 columns, if they are identical (same), then put identical variants (from gnomAD column) in match column and non-identical variants (from gnomAD column) in nomatch column.

When columns 2 and 3 do not match and you do not want to match on variants, please use following code:

awk -v OFS="\t" -F"\t" 'NR==1 {print $0,"Match","NoMatch"}; NR>1 {if($2==$3){if($1==$4) {$7=$1} else {$8=$1}} else {};print}' test.txt

@OP: Your input and output both are incorrect. Please correct them.

ADD REPLY
0
Entering edit mode

Thank you very much for your help. Sorry I am new to the system and it was my first time to add an input. I have corrected the input

ADD REPLY
0
Entering edit mode

No problem. I was there too. Please close the thread if it addressed the issue.

ADD REPLY

Login before adding your answer.

Traffic: 2423 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6