Question: check if two columns match in 2 annotation files and print those lines to a new output file
0
gravatar for vineelagangalapudi
5.3 years ago by
United States
vineelagangalapudi0 wrote:

I have 2 gff3 files. i want to extract genes with same gene lengths in both the files. If they are same, i want to write them to a third file

example: file 1

C20775336       maker   gene    1895    2166    .       -       .       ID=gene1;Name=maker-C20
C20775336       maker   gene    895    1166    .       -       .       ID=mRNA1;Parent=gene1;N
C20775336       maker   gene    1895    1962    .       -       .       Parent=mRNA1
C20775336       maker   gene     2795    2962    .       -       2       ID=CDS1;Parent=mRNA1

file 2

C20775336       maker   gene    895    1166    .       -       .       ID=gene1;Name=maker-C20
C20775936       maker   gene    1895    2166    .       -       .       ID=mRNA1;Parent=gene1;N
C20775336       maker   gene    2795    2962    .       -       .       Parent=mRNA1
 

output file

C20775336       maker   gene    895    1166    .       -       .       ID=gene1;Name=maker-C20

C20775336       maker   gene    2795    2962    .       -       .       Parent=mRNA1

so if columns 4 and 5 in first file is equal to column 4 and 5 in second file, print that line from second file to a new output file.

I tried to work on awk, to achieve this, but i could not figure out.

Thanks in advance

 

 

sequence • 19k views
ADD COMMENTlink modified 5.3 years ago by venu6.7k • written 5.3 years ago by vineelagangalapudi0

Joining Multiple Fields Using Unix Join (from SO) gives very nice solutions using awk and join. If I have to combine multiple files in unix I use what Michael Mrozek suggested - combine fields using awk and then join files according one key field with join.

ADD REPLYlink written 5.3 years ago by PoGibas4.8k

I have same query also, but if i don't care about rest of first column, only first column should match and print number of match found in first file and number of match found in second file.

thanks

ADD REPLYlink written 4.9 years ago by Manoj60

For quicker response you should ask this general programming question on SO.

ADD REPLYlink written 4.9 years ago by PoGibas4.8k

Sorry...I did not get about SO

ADD REPLYlink written 4.9 years ago by Manoj60

stackoverflow

ADD REPLYlink written 4.9 years ago by PoGibas4.8k
6
gravatar for venu
5.3 years ago by
venu6.7k
Germany
venu6.7k wrote:
Is this what you need?

​awk 'FNR==NR{a[$4,$5]=$0;next}{if(b=a[$4,$5]){print b}}' file1 file2
ADD COMMENTlink written 5.3 years ago by venu6.7k

+1 for a pure Awk solution (and a creative usage of FNR). Would you care to explain what the command does for people who are not familiar with Awk?

ADD REPLYlink written 5.3 years ago by Frédéric Mahé3.0k

for some reason, the above command does not work for me, i get a error message "bash: ​awk: command not found.."

 

ADD REPLYlink written 5.3 years ago by vineelagangalapudi0

I tried to work on awk. This statement from your question letting us to assume you have used awk on your machine initially, if so the above oneliner should work. Type awk -V in terminal. 

ADD REPLYlink written 5.3 years ago by venu6.7k

@Frédéric Mahé   Two variables FNR and NR stores the number of records in the current file and total number of records respectively. $0 is to hold entire line. Rest is as simple as any other comparison, matching column 4 and 5 from file1 (a) with file2 (b), if matched, print entire line from file2 (b). 

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by venu6.7k

Hi, what if I need the inverse of this? I tried awk 'FNR==NR{a[$4,$5]=$0;next}{if(b!=a[$4,$5]){print b}}' file1 file2 but all I got was an empty file.

ADD REPLYlink written 2.1 years ago by sicat.paolo2030
0
gravatar for mark.ziemann
5.3 years ago by
mark.ziemann1.2k
Australia/Mebourne/Geelong/Deakin
mark.ziemann1.2k wrote:

Try this:

##First sort each of the input files on the 1st column

sort -k 1b,1 file2 > file2s

sort -k 1b,1 file1 > file1s

##Now they can be joined on the 1st field, followed by awk 1 line

join -1 1 -2 1 file1s file2s \

| awk '{OFS="\t"} $5-$4==$13-$12' > outputfile

 

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by mark.ziemann1.2k

Thank you, your solution works perfectly. However it joins the lines from both the files . Is there a way to skip that.

for example , it did some thing like this on my origina file

scaffold1732 baker gene 85677 87592 . + . ID=gene9;Name=maker-gnlscaffold1732-augustus-gene-0.6 maker gene 85677 87592 . + . ID=gene9;Name=maker-gnl|Hsal_3.3|scaffold1732-augustus-gene-0.6

instead can i get output lines like this

scaffold1732 baker gene 85677 87592 . + . ID=gene9;Name=maker-

ADD REPLYlink written 5.3 years ago by vineelagangalapudi0

With join you can specify wanted output (simple examples).

ADD REPLYlink written 5.3 years ago by PoGibas4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 764 users visited in the last hour