Question: Problem when trying to extract column value using "awk" from txt table
0
gravatar for Wet&DryImmunology
2.2 years ago by
Japan
Wet&DryImmunology200 wrote:

Hi I have the regions of interest (ROIs) generated using ROSE (https://bitbucket.org/young_computation/rose) the information of ROIs was generated as output as a txt "H3K27acDP_peaks_AllEnhancers_ENHANCER_TO_GENE.txt"

the inside of the txt looks like this (only the first 3 rows are shown here):

#H3K27acDP_peaks Enhancers  OVERLAP_GENES   PROXIMAL_GENES  CLOSEST_GENE    enhancerRank    isSuper
2_H3K27ac_WTDP_peak_8539_lociStitched   chr6    41482303    41510764    2   26841   100899.9372 3865.0038   1       Ephb6,Prss2 Prss2   1   1
12_H3K27ac_WTDP_peak_8627_lociStitched  chr6    71249202    71328945    12  47488   101791.9395 10342.6671  2   Cd8a,Cd8b1  Krcc1,Smyd1 Cd8b1   2   1

I wanted to extract columns from this table to generate stand gtf file as input for DESeq2 (a R package for the analysis of regions with differentially enriched regions), for that purpose, I used:

awk '{OFS="\t"; print $2, "DP_enhancers","enhancer", $3, $4, "0.000000","-",".", $12}' H3K27acDP_peaks_AllEnhancers_ENHANCER_TO_GENE.txt > H3K27acDP_enhancers.gff &

but I did not get the gtf file which I wanted, here the first 4 rows are shown:

chr6    DP_enhancers    enhancer    41482303    41510764    0.000000    -   .   1
chr6    DP_enhancers    enhancer    71249202    71328945    0.000000    -   .   Cd8b1
chr14   DP_enhancers    enhancer    54779797    54858773    0.000000    -   .   Dad1
chr17   DP_enhancers    enhancer    47640970    47694393    0.000000    -   .   Ccnd3

the problem is the first row, the "awk" seemed to fail to recognize there is a empty value for the column "OVERLAP_GENES", so instead of treating "Prss2" as $12, awk extract "1" which belongs to "enhancerRank" as $12, while the other rows seemed to be Ok. if just for the first row, I guess I could try to extract $11, instead of $12, but it would be problematic for most of the other rows. Anyone has idea to solve the problem please kindly let me know.

Thank you very much in advance.

gene • 859 views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Wet&DryImmunology200
3
gravatar for Asaf
2.2 years ago by
Asaf5.5k
Israel
Asaf5.5k wrote:

Try adding -F"\t" to the awk , i.e. awk -F"\t" '{OFS....

ADD COMMENTlink written 2.2 years ago by Asaf5.5k

@Asaf. I don't know what magic you have suggested, but it worked, perfectly! What is "-F"\t" "? why it could solve the problem?

ADD REPLYlink written 2.2 years ago by Wet&DryImmunology200
2

awk splits to columns using any whitespace, if there are consecutive whitespaces it will consider them as one delimiter. When defying the column splitting character to be tab (with -F"\t") then when it sees consecutive tabs it will treat them as two splitters.

ADD REPLYlink written 2.2 years ago by Asaf5.5k

I see. Tks for patient explanation!

ADD REPLYlink written 2.2 years ago by Wet&DryImmunology200
1

You could have also done:

awk '{FS=OFS="\t"; prin..

Field Separator equals Output Field Separator equals..

In general it's a good idea to place this kind of stuff into a begin block so that the rules are executed before anything is read:

awk 'BEGIN{FS=OFS="\t"}{print..}'
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by 5heikki8.4k
1

I think it tells the input file is tab separated.

ADD REPLYlink written 2.2 years ago by mbk0asis410

I moved this to an answer so it can get accepted.

ADD REPLYlink written 2.2 years ago by WouterDeCoster38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2020 users visited in the last hour