Question: Compare specific parts of two columns in a text file in Linux
0
gravatar for bobbyle0210
7 months ago by
bobbyle021010
bobbyle021010 wrote:

I have a text file with several columns separated by tab character as below:

1    ATGCCCAGA  AS:i:10   XS:i:10  
2    ATGCTTGA   AS:i:10   XS:i:5  
3    ATGGGGGA   AS:i:10   XS:i:1  
4    ATCCCCGA   AS:i:20   XS:i:20

I now want to compare the last two columns AS:i:(n1) and XS:i:(n2) to obtain only lines with n1 different to n2. So, my desired output would be:

2    ATGCTTGA   AS:i:10   XS:i:5  
3    ATGGGGGA   AS:i:10   XS:i:1

Could you suggest me some ways that I can compare n1 and n2 and print out the output? Thanks in advance.

ADD COMMENTlink modified 7 months ago by bioinformatics2020140 • written 7 months ago by bobbyle021010

AS:i:(n1) and XS:i:(n2) to obtain only lines with identical n1 and n2.

in your "desired output" n1 is not identical to N2.

Anyway, your looking for awk. look at https://www.unixtutorial.org/awk-delimiter and https://www.unix.com/shell-programming-and-scripting/274247-how-compare-two-column-using-awk.html

ADD REPLYlink written 7 months ago by Pierre Lindenbaum129k
2
gravatar for tim.booth
7 months ago by
tim.booth20
tim.booth20 wrote:

Hi,

The last answer from @cpad0112 looks right to me, in that it directly answers your question and extracts and compares n1 and n2 from the third and fourth columns.

However I note that the example file looks rather like a simplified version of SAM data, and you did ask for alternative approaches. If your original source of data really is a SAM/BAM file then a more robust approach is to use htslib to parse the whole file. In Python, the pysam library gives access to htslib as documented here:

https://pysam.readthedocs.io/en/latest/api.html

In the fourth example on that page, under the heading "You can also write to a AlignmentFile" there is a prototypical filter script. In the example the test is on read.is_paired but you could instead test on read.get_tag('AS') != read.get_tag('XS'). Other command-line tools like 'bamtools' and 'samtools' have various filter options but I'm not aware of any that can compare two tags.

ADD COMMENTlink written 7 months ago by tim.booth20

Probably this is the way to go if it is an alignment file. Please use dedicated tool for operations @ bobbyle0210

ADD REPLYlink written 7 months ago by cpad011213k
1
gravatar for cpad0112
7 months ago by
cpad011213k
India
cpad011213k wrote:

Code that works with example data: @ bobbyle0210

$ awk 'a[$3]++' file.txt 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

variation of this would be:

$ cat file.txt 
1   ATGCCCAGA   AS:i:10 XS:i:10 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1  
4   ATCCCCGA    AS:i:20 XS:i:20
5   CTGATCGAT   AS:i:10 XS:i:10

$ awk '!a[$4]++ && a[$3]++' file.txt 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1
ADD COMMENTlink modified 7 months ago • written 7 months ago by cpad011213k

Hi, Thank you for your help. Could you explain in detail how the command works? I am not so familiar with linux command. Thank you :)

ADD REPLYlink written 7 months ago by bobbyle021010
1

First function prints all the rows with identical column 3 values

Second function prints all the rows where "column 4 values are non-identical and identical column 3"

If, AS:i: and XS:i: are fixed, you can use following, where column 3 last values are not equal to column 4 last values:

$ awk 'substr($3, length($3)-5, length($3)) != substr($4, length($4)-5,length($4))' file.txt 

2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

You can use following, where column 3 last values are equal to column 4 last values:

$ awk 'substr($3, length($3)-5, length($3)) == substr($4, length($4)-5,length($4))' file.txt 

1   ATGCCCAGA   AS:i:10 XS:i:10 
4   ATCCCCGA    AS:i:20 XS:i:20

If AS and XS are not fixed, but last field is separated by : and there are only 3 fields in a column, you can also use:

$ awk '{split ($3,a,":"); split ($4,b,":"); if (a[3]!=b[3]) print}' file.txt 

2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1
ADD REPLYlink modified 7 months ago • written 7 months ago by cpad011213k
0
gravatar for bioinformatics2020
7 months ago by
bioinformatics2020140 wrote:
with open("file.txt") as file:
    read_file = file.read().split("\n")
    read_file_two = [x.split("\t") for x in read_file]
    read_file_three = [[x.rstrip("  ") for x in y] for y in read_file_two]

for x in read_file_three:
    if x[1][3:] != x[2][3:]:
        print("\t".join(x),file=open("output.txt", "a"))

Quick and dirty python solution. Note this matches i:n1 with i:n2. If you want to omit the i:, change the 3: in the for loop to a 5:

ADD COMMENTlink modified 7 months ago • written 7 months ago by bioinformatics2020140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1846 users visited in the last hour