Question: awk filed with different separator
1
gravatar for sacha
22 months ago by
sacha940
France
sacha940 wrote:

Hi,

Could provide the faster way to filter this data :

chr1    43  1000    gene_name=boby  gene_type=trucA
chr2    44  1000    gene_name=natt  gene_type=trucB  
chr3    45  1000    gene_name=alurika   gene_type=trucC

To :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC

CORRECTION : Original text data looks like this :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34
awk oneliner • 690 views
ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 22 months ago by sacha940
1

What did you try so far? Is this really a bioinformatics question? What is the logic between input and output change? It looks rather random to me, e.g. "..trucB chr3 45 1000.." becomes "..trucB chr1 45 1000.."?

ADD REPLYlink modified 22 months ago • written 22 months ago by 5heikki7.0k

I just answered it, but I agree with you, regarding chr3 -> chr1, I think its just a typo!

ADD REPLYlink written 22 months ago by Sukhdeep Singh9.2k
5
gravatar for Sukhdeep Singh
22 months ago by
Sukhdeep Singh9.2k
Netherlands
Sukhdeep Singh9.2k wrote:
sed -e 's/gene_name=//g' -e 's/gene_type=//g' file > file2
ADD COMMENTlink written 22 months ago by Sukhdeep Singh9.2k
3
gravatar for sacha
22 months ago by
sacha940
France
sacha940 wrote:

Thanks for your reply !

But I just discover right now that awk support regexp for the Fieldseperator. So this works too :

cat test | awk 'BEGIN{FS="\t|="} {print $1,$2,$3,$5,$7}'
ADD COMMENTlink modified 15 months ago • written 22 months ago by sacha940
4

Double quotes just left hanging. That makes me sad.

ADD REPLYlink written 22 months ago by John12k
3

Lets make John happy :D

ADD REPLYlink written 22 months ago by Sukhdeep Singh9.2k
1

I think still you do not get the desired output with the awk you are showing , but with sed you actually get the desired output as you put in your original question. And yes the double quotes are not closed.

ADD REPLYlink written 22 months ago by vchris_ngs4.2k

Sure, this would take care of any pattern after a tab and before "=", my answer is valid if you only want to replace these two strings.

ADD REPLYlink written 22 months ago by Sukhdeep Singh9.2k
3
gravatar for genomax
22 months ago by
genomax40k
United States
genomax40k wrote:

Try

$ sed -e 's/;/\ /g' your_file | sed -e 's/=/\ /g' | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'
ADD COMMENTlink written 22 months ago by genomax40k
2

Removing redundancy

$ sed -e 's/;/\ /g' -e 's/=/\ /g' your_file | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'
ADD REPLYlink written 22 months ago by genomax40k

Perfect I was about to write this.. But in any case the OP should actually think why such formatting is required. I believe these are vcf file showing variant names with positions , in that case pre filtering should be done to keep only those that have the column with string PASS. In that case it should be:

cat file.txt | grep "PASS" | sed -e 's/;/\ /g' -e 's/=/\ /g' | awk -F " " '{print $1"\t",$2"\t",$3"\t",$4"\t",$6"\t"$8}' > file_flt.txt

Otherwise @genomax2 is correct about what you need.

ADD REPLYlink modified 22 months ago • written 22 months ago by vchris_ngs4.2k

Can i just say, i love the new syntax highlighting going on here :D

(particularly how Istavan has coloured the popular bioinformatics program names red. very nice touch)

ADD REPLYlink modified 22 months ago • written 22 months ago by John12k
0
gravatar for sacha
22 months ago by
sacha940
France
sacha940 wrote:

Ok, But actually my exemple was not complete... They are more data :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34

How to get :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC
ADD COMMENTlink modified 22 months ago • written 22 months ago by sacha940
1

You should really take this to heart, that when you don't fully know your data formatting, never use a regex. Regex's are great for grabbing things. They are a really bad idea for data manipulation (not to mention they're also slow)

ADD REPLYlink written 22 months ago by John12k

Yes I understand the feeling, so I updated a comment with my understanding and asked the OP if what ideally should the person be looking for and modified the command line.

ADD REPLYlink written 22 months ago by vchris_ngs4.2k

With this?

sed 's/=/\t/g' file.txt | sed 's/;/\t/g' | awk '{print $1,$2,$3,$6,$8}'

Some clarification, in your input file there are 3 chromosomes but in output file only one chr? Is that what you really need?

ADD REPLYlink modified 22 months ago • written 22 months ago by venu4.6k

It is a typo I believe? Since it does not make sense to change everything to chr1

ADD REPLYlink written 22 months ago by vchris_ngs4.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 821 users visited in the last hour