Question

awk filed with different separator

1

Entering edit mode

8.2 years ago

sacha ★ 2.4k

Hi,

Could provide the faster way to filter this data :

chr1    43  1000    gene_name=boby  gene_type=trucA
chr2    44  1000    gene_name=natt  gene_type=trucB  
chr3    45  1000    gene_name=alurika   gene_type=trucC

To :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC

CORRECTION : Original text data looks like this :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34

awk oneliner • 2.4k views

ADD COMMENT • link updated 8.0 years ago by Biostar 20 • written 8.2 years ago by sacha ★ 2.4k

1

Entering edit mode

What did you try so far? Is this really a bioinformatics question? What is the logic between input and output change? It looks rather random to me, e.g. "..trucB chr3 45 1000.." becomes "..trucB chr1 45 1000.."?

ADD REPLY • link 8.2 years ago by 5heikki 11k

0

Entering edit mode

I just answered it, but I agree with you, regarding chr3 -> chr1, I think its just a typo!

ADD REPLY • link 8.2 years ago by Sukhi Singh 11k

3

Entering edit mode

8.2 years ago

GenoMax 141k

Try

$ sed -e 's/;/\ /g' your_file | sed -e 's/=/\ /g' | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'

ADD COMMENT • link 8.2 years ago by GenoMax 141k

2

Entering edit mode

Removing redundancy

$ sed -e 's/;/\ /g' -e 's/=/\ /g' your_file | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'

ADD REPLY • link 8.2 years ago by GenoMax 141k

0

Entering edit mode

Perfect I was about to write this.. But in any case the OP should actually think why such formatting is required. I believe these are vcf file showing variant names with positions , in that case pre filtering should be done to keep only those that have the column with string PASS. In that case it should be:

cat file.txt | grep "PASS" | sed -e 's/;/\ /g' -e 's/=/\ /g' | awk -F " " '{print $1"\t",$2"\t",$3"\t",$4"\t",$6"\t"$8}' > file_flt.txt

Otherwise @genomax2 is correct about what you need.

ADD REPLY • link 8.2 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Can i just say, i love the new syntax highlighting going on here :D

(particularly how Istavan has coloured the popular bioinformatics program names red. very nice touch)

ADD REPLY • link 8.2 years ago by John 13k

0

Entering edit mode

8.2 years ago

sacha ★ 2.4k

Ok, But actually my exemple was not complete... They are more data :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34

How to get :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC

ADD COMMENT • link 8.2 years ago by sacha ★ 2.4k

1

Entering edit mode

You should really take this to heart, that when you don't fully know your data formatting, never use a regex. Regex's are great for grabbing things. They are a really bad idea for data manipulation (not to mention they're also slow)

ADD REPLY • link 8.2 years ago by John 13k

0

Entering edit mode

Yes I understand the feeling, so I updated a comment with my understanding and asked the OP if what ideally should the person be looking for and modified the command line.

ADD REPLY • link 8.2 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

With this?

sed 's/=/\t/g' file.txt | sed 's/;/\t/g' | awk '{print $1,$2,$3,$6,$8}'

Some clarification, in your input file there are 3 chromosomes but in output file only one chr? Is that what you really need?

ADD REPLY • link 8.2 years ago by venu 7.1k

0

Entering edit mode

It is a typo I believe? Since it does not make sense to change everything to chr1

ADD REPLY • link 8.2 years ago by ivivek_ngs ★ 5.2k

score 5 · Accepted Answer · 2016-03-01

5

Entering edit mode

8.2 years ago

Sukhi Singh 11k

sed -e 's/gene_name=//g' -e 's/gene_type=//g' file > file2

ADD COMMENT • link 8.2 years ago by Sukhi Singh 11k

score 3 · Accepted Answer · 2016-03-01

3

Entering edit mode

8.2 years ago

sacha ★ 2.4k

Thanks for your reply !

But I just discover right now that awk support regexp for the Fieldseperator. So this works too :

cat test | awk 'BEGIN{FS="\t|="} {print $1,$2,$3,$5,$7}'

ADD COMMENT • link 7.5 years ago by sacha ★ 2.4k

4

Entering edit mode

Double quotes just left hanging. That makes me sad.

ADD REPLY • link 8.2 years ago by John 13k

3

Entering edit mode

Lets make John happy :D

ADD REPLY • link 8.2 years ago by Sukhi Singh 11k

1

Entering edit mode

I think still you do not get the desired output with the awk you are showing , but with sed you actually get the desired output as you put in your original question. And yes the double quotes are not closed.

ADD REPLY • link 8.2 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Sure, this would take care of any pattern after a tab and before "=", my answer is valid if you only want to replace these two strings.

ADD REPLY • link 8.2 years ago by Sukhi Singh 11k