Question: Compare two columns of same file as paired rows
0
gravatar for waqasnayab
14 days ago by
waqasnayab120
Pakistan
waqasnayab120 wrote:

Hi,

Dear Community,

I have a file like:

CHR POS AACHANGE    EFFECT
chr1    11779914    R88W    _
chr1    11779915    R88Q    _
chr1    19694500    R478    H
chr1    19694501    R478H   H
chr1    23875429    Q63     R
chr1    23875430    Q63R    R
chr1    84574386    R10     L
chr1    84574387    R10L    L
chr1    89186388    E551G   R
chr1    89186389    E551K   R
chr19   57506385    A97E    E
chr19   57506386    A97     E

As you can see, each row is paired on the basis of its chromosomal position. First and second entry (11779914 and 11779915), third and fourth entry (19694500 and 19694501) and so on...,,,,!!!! I want to compare column three and four. If any of the entry in column four matches with any of the entry in column in three, than remove both paired lines. So, I want output something like this:

    CHR POS AACHANGE    EFFECT
    chr1    11779914    R88W    _
    chr1    11779915    R88Q    _
    chr1    89186388    E551G   R
    chr1    89186389    E551K   R

Since entry three and four of column four has H that matched with the column three (entry four of column three - R478H), both lines are removed (entry three and four).

Any command line solution, or python etc is appreciated. Mostly comparing columns of two different files are available but of the same file paired rows, I failed to find my solution.

Thanks,

Waqas.

snp next-gen genome • 152 views
ADD COMMENTlink modified 14 days ago by Kevin Blighe1.1k • written 14 days ago by waqasnayab120
3
gravatar for Kevin Blighe
14 days ago by
Kevin Blighe1.1k
Republic of Ireland
Kevin Blighe1.1k wrote:

I assume that the match between column 4 and 3 is only a partial (and not exact) match?, i.e., you are searching for, in column 3, the amino acid residue defined in column 4?

This looks like a reasonably easy job in R Programming Language but here are 2 BASH commands that will also work. It is made a bit more difficult by the fact that the matching pairs are not always in the same order (e.g., sometimes the match between 3 and 4 is in the second pair, whilst in others it is the first in the pair).

awk 'index($3, $4) {print}' test.tsv | cut -f3 | sed s'/.$//' > matches.list

This first records the ones that do match and saves the value in column 3 (less the final matching amino acid residue).

awk '!index($3, $4) { print }' test.tsv | grep -v -f matches.list

This then searches for the ones that do not match, and then also eliminates any value recorded in matches.list

The assumptions that I make: - your data is tab-delimited and is contained in test.tsv - the matching amino acid residue is always on the right-hand-side of the amino-acid change defined in column 3

Kevin

ADD COMMENTlink written 14 days ago by Kevin Blighe1.1k

Hi Kevin,

You exactly understand the situation. I tried with the dummy file and then with my original file. I got exactly the output which I want. I cross-checked manually as well.

Great Thanks,

Waqas.

ADD REPLYlink written 14 days ago by waqasnayab120

Great - happy to help Waqas.

Kevin

ADD REPLYlink written 14 days ago by Kevin Blighe1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1387 users visited in the last hour