Question: Awk- How to compare 2 files with common columns and then get the output file with columns from each file.
0
gravatar for bright.g.january1732
4.4 years ago by
bright.g.january17320 wrote:

Hi folks

There is two input files.(File formats is tsv, both)

File1: treated.bam.tsv File2: untreated.bam.tsv

Both has ‘same fields’ like below. (I numbered and delimited with ‘/’) – each file has 23 fields.

1chrom/2pos/3ref/4reads_all/5reads_pp/6matches/7matches_pp/8mismatches/9mismatches_pp/10deletions/11deletions_pp/12insertions/13insertions_pp/14A/15A_pp/16C/17C_pp/18T/19T_pp/20G/21G_pp/22N/23N_pp

If values in 1st and 2nd columns(chrom, pos) are same in both files, I want to extract some of fields in the records, then make a new output file like this below. Output file has 15 fields from 2 input files like below.

- From file1
1chrom
2pos
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file1(the values from file1: (8mismatches+10deletions+12insertions)/ 4reads_all) 
- From file2
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file2(the values from file2: (8mismatches+10deletions+12insertions)/ 4reads_all) 
- New colume from both files.
pct_sub(the values from pct_file1 - pct_file2: ((8mismatches+10deletions+12insertions)/ 4reads_all) - ((8mismatches+10deletions+12insertions)/ 4reads_all))

In the output file, First 8 columns come from File1. treated.bam.tsv (the 8th column is the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from File1).

The rest of them come from File2, (File2. untreated.bam.tsv), and 13th column is also the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from file2).

The last field, pct_sub is calculated from the subtraction value with fields from file1(((8mismatches+10deletions+12insertions)/ 4reads_all)) and file2 from ((8mismatches+10deletions+12insertions)/ 4reads_all)).

And how can I add new column names in header of output files, like pct_file1, pct_file2, pct_sub

This is what I made for the above output file.(input and output files have all same format. TSV)

awk 'FNR==NR{array[$1,$2]=$0;next} { if ( $1 $2 in array ) print $1, $2, array[$3], array[$4], array[$8], array[$10], array[$12], (array[$8]+array[$10]+array[$12])/array[$4], $3, $4, $8, $10, $12, ($8+$10+$12)/$4, ((array[$8]+array[$10]+array[$12])/array[$4])-(($8+$10+$12)/$4) > "awkoutput.bam.tsv" }' treated.bam.tsv untreated.bam.tsv

(Actually, $1, $2 are not problem from File1 or File2)

**FILE1(treated)**                                                                                      

chrom   pos ref reads_all   reads_pp    matches matches_pp  mismatches  mismatches_pp   deletions   deletions_pp    insertions  insertions_pp   A   A_pp    C   C_pp    T   T_pp    G   G_pp    N   N_pp
chrY    59363551    G   8   0   7   0   0   0   1   0   5   0   0   0   0   0   0   0   7   0   0   0
chrY    59363552    G   7   0   7   0   0   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0
chrY    59363553    T   7   0   7   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0
chrY    59363554    G   7   0   7   0   0   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0
chrY    59363555    T   7   0   7   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0



**FILE2(untreated)**    
chrom   pos ref reads_all   reads_pp    matches matches_pp  mismatches  mismatches_pp   deletions   deletions_pp    insertions  insertions_pp   A   A_pp    C   C_pp    T   T_pp    G   G_pp    N   N_pp
chrY    59363551    G   2   0   2   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0   0   0
chrY    59363552    G   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
chrY    59363553    T   1   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0
chrY    59363554    G   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
chrY    59363555    T   1   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0

OUTPUT

chrom   pos ref reads_all   mismatches  deletions   insertions  pct_file1   ref reads_all   mismatches  deletions   insertions  pct_file2   pct_sub             
chrY    59363551    G   8   0   1   5   0.75    G   2   0   0   1   0.5 0.25                
chrY    59363552    G   7   0   0   0   0   G   1   0   0   0   0   0               
chrY    59363553    T   7   0   0   0   0   T   1   0   0   0   0   0               
chrY    59363554    G   7   0   0   0   0   G   1   0   0   0   0   0               
chrY    59363555    T   7   0   0   0   0   T   1   0   0   0   0   0

Thank you in advance.

-Jan

awk • 6.9k views
ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by bright.g.january17320
1

I have a rather simple solution with excel/R. Merge fields chr and pos by a delimiter like -, so you have one column. Then you can use the VLOOKUP function in excel or the merge() function in R.

ADD REPLYlink written 4.4 years ago by Adrian Pelin2.4k

@Adrian Pelin Thank you for your solution but I cannot do it with excel because it is over the limitation of row number in excel.

ADD REPLYlink written 4.4 years ago by bright.g.january17320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2436 users visited in the last hour
_