Question: awk filed with different separator
1
gravatar for sacha
2.7 years ago by
sacha1.6k
France
sacha1.6k wrote:

Hi,

Could provide the faster way to filter this data :

chr1    43  1000    gene_name=boby  gene_type=trucA
chr2    44  1000    gene_name=natt  gene_type=trucB  
chr3    45  1000    gene_name=alurika   gene_type=trucC

To :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC

CORRECTION : Original text data looks like this :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34
awk oneliner • 935 views
ADD COMMENTlink modified 2.6 years ago by Biostar ♦♦ 20 • written 2.7 years ago by sacha1.6k
1

What did you try so far? Is this really a bioinformatics question? What is the logic between input and output change? It looks rather random to me, e.g. "..trucB chr3 45 1000.." becomes "..trucB chr1 45 1000.."?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by 5heikki7.8k

I just answered it, but I agree with you, regarding chr3 -> chr1, I think its just a typo!

ADD REPLYlink written 2.7 years ago by Sukhdeep Singh9.5k
5
gravatar for Sukhdeep Singh
2.7 years ago by
Sukhdeep Singh9.5k
Netherlands
Sukhdeep Singh9.5k wrote:
sed -e 's/gene_name=//g' -e 's/gene_type=//g' file > file2
ADD COMMENTlink written 2.7 years ago by Sukhdeep Singh9.5k
3
gravatar for sacha
2.7 years ago by
sacha1.6k
France
sacha1.6k wrote:

Thanks for your reply !

But I just discover right now that awk support regexp for the Fieldseperator. So this works too :

cat test | awk 'BEGIN{FS="\t|="} {print $1,$2,$3,$5,$7}'
ADD COMMENTlink modified 2.1 years ago • written 2.7 years ago by sacha1.6k
4

Double quotes just left hanging. That makes me sad.

ADD REPLYlink written 2.7 years ago by John12k
3

Lets make John happy :D

ADD REPLYlink written 2.7 years ago by Sukhdeep Singh9.5k
1

I think still you do not get the desired output with the awk you are showing , but with sed you actually get the desired output as you put in your original question. And yes the double quotes are not closed.

ADD REPLYlink written 2.7 years ago by vchris_ngs4.5k

Sure, this would take care of any pattern after a tab and before "=", my answer is valid if you only want to replace these two strings.

ADD REPLYlink written 2.7 years ago by Sukhdeep Singh9.5k
3
gravatar for genomax
2.7 years ago by
genomax58k
United States
genomax58k wrote:

Try

$ sed -e 's/;/\ /g' your_file | sed -e 's/=/\ /g' | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'
ADD COMMENTlink written 2.7 years ago by genomax58k
2

Removing redundancy

$ sed -e 's/;/\ /g' -e 's/=/\ /g' your_file | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'
ADD REPLYlink written 2.7 years ago by genomax58k

Perfect I was about to write this.. But in any case the OP should actually think why such formatting is required. I believe these are vcf file showing variant names with positions , in that case pre filtering should be done to keep only those that have the column with string PASS. In that case it should be:

cat file.txt | grep "PASS" | sed -e 's/;/\ /g' -e 's/=/\ /g' | awk -F " " '{print $1"\t",$2"\t",$3"\t",$4"\t",$6"\t"$8}' > file_flt.txt

Otherwise @genomax2 is correct about what you need.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by vchris_ngs4.5k

Can i just say, i love the new syntax highlighting going on here :D

(particularly how Istavan has coloured the popular bioinformatics program names red. very nice touch)

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by John12k
0
gravatar for sacha
2.7 years ago by
sacha1.6k
France
sacha1.6k wrote:

Ok, But actually my exemple was not complete... They are more data :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34

How to get :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by sacha1.6k
1

You should really take this to heart, that when you don't fully know your data formatting, never use a regex. Regex's are great for grabbing things. They are a really bad idea for data manipulation (not to mention they're also slow)

ADD REPLYlink written 2.7 years ago by John12k

Yes I understand the feeling, so I updated a comment with my understanding and asked the OP if what ideally should the person be looking for and modified the command line.

ADD REPLYlink written 2.7 years ago by vchris_ngs4.5k

With this?

sed 's/=/\t/g' file.txt | sed 's/;/\t/g' | awk '{print $1,$2,$3,$6,$8}'

Some clarification, in your input file there are 3 chromosomes but in output file only one chr? Is that what you really need?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by venu5.7k

It is a typo I believe? Since it does not make sense to change everything to chr1

ADD REPLYlink written 2.7 years ago by vchris_ngs4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1835 users visited in the last hour