Can some one help with this ??
This question was posted on stackoverflow, I didn't get any answers so referring to Biostars. Post which don't receive any answer will be deleted or updated as required.
I want to merge the table that was generated from a vcf file. This merge is context dependent. I am now at beginner level python programing and was able to do some data extraction and use some level of conditional statement. I am looking the solution specifically in python. Note: I have looked into pandas and scipy but I am not able to apply the conditional statement as I desire. This problem really needs intervention from some experts in here. Thanks much in advance !
Problem description: - I have two separate text files which need to be read first. Each file has several lines of data with 7 different column. The context dependent merging involves reading the values from first two columns in each text file and then proceed to merge if both info are the same.
A few lines from text01.txt
contig pos id ref_al-My alt-al-My ref-freq-My alt-freq-My 2 15801571 . G A 0.667 0.333 2 15801604 . CAAAAACAAAA C 0.583 0.417 2 15801610 . C CA,CAAA 0.5 0.25,0.25 2 15803330 . C T 0.333 0.667 2 15803398 . G A 0.667 0.333 2 15803529 . ATGC A 0.667 0.333
Similarly some lines from text_02.txt:
contig pos id ref_al-Sp alt-al-Sp ref-freq-Sp alt-freq-Sp 2 15801610 . CAAAAA C 0.0 1.0 2 15801618 . A G 0.0 1.0 2 15802052 . C T 0.1 0.9 2 15803398 . A G 0.9 0.1 2 15803477 . G A 0.1 0.9 2 15803542 . A C 0.1 0.9
Context dependent merging:
So, both the text files have 7 columns in which first three (contig, pos, id) column names are same.
This context dependent merging involves reading the values in the first two (contig and pos) columns from both the text files.
If both contig and pos value match, new columns are added and updated to the output_text.txt file.
Eg. in the given text file two lines have same matching contig and pos value.
contig pos id ref_al-My alt-al-My ref-freq-My alt-freq-My 2 15801610 . C CA,CAAA 0.5 0.25,0.25 2 15803398 . G A 0.667 0.333
contig pos id ref_al-Sp alt-al-Sp ref-freq-Sp alt-freq-Sp 2 15801610 . CAAAAA C 0.0 1.0 2 15803398 . A G 0.9 0.1
So, we append several columns and add one new column (i.e all_ref): where,
all_ref = ref_al-My[::] + ref_al-Sp[::]
Now, the output_text.txt should contain following data:
contig pos id all_ref alt-al-My alt-al-Sp ref-freq-My ref-freq-Sp alt-freq-My alt-freq-Sp 2 15801610 . C,CAAAAA CA,CAAA C 0.5 0.0 0.25,0.25 1.0 2 15803398 . G,A A G 0.667 0.9 0.333 0.1
For other lines we will simply be append their respective values in respective columns, with values for null fields updated as periods.
contig pos id all_ref alt-al-My alt-al-Sp ref-freq-My ref-freq-Sp alt-freq-My alt-freq-Sp 2 15801571 . G A . 0.667 . 0.333 . 2 15803477 . G . A . 0.1 . 0.9
- The data values for My should come before Sp for new added column (all_ref).
I understanding this is a long question but any inputs is appreciated.
Thanks, - K