Compare two vcf like files; and output the line for unique position (from first file) and line for duplicate position (from second file)
1
0
Entering edit mode
8.4 years ago
amitgsir ▴ 60

I have two tab separated values file, say

File1.txt

chr1    894573  rs13303010  GG
chr2    18674   rs10195681  CC
chr3    104972  rs990284    AA  #<--- Unique Line
chr4    111487  rs17802159  AA
chr5    200868  rs4956994   GG
chr5    303686  rs6896163   AA  #<--- Unique Line
chrX    331033  rs4606239   TT
chrY    2893277 i4000106    GG
chrY    2897433 rs9786543   GG
chrM    57  i3002191    TT

File2.txt

chr1    894573  rs13303010  GG
chr2    18674   rs10195681  AT
chr4    111487  rs17802159  AA
chr5    200868  rs4956994   CC
chrX    331033  rs4606239   TT
chrY    2893277 i4000106    GA
chrY    2897433 rs9786543   GG
chrM    57  i3002191    TA

Desired Output:

Output.txt

chr1    894573  rs13303010  GG
chr2    18674   rs10195681  AT
chr3    104972  rs990284    AA  #<--Unique Line from File1.txt
chr4    111487  rs17802159  AA
chr5    200868  rs4956994   CC
chr5    303686  rs6896163   AA  #<--Unique Line from File1.txt
chrX    331033  rs4606239   TT
chrY    2893277 i4000106    GA
chrY    2897433 rs9786543   GG
chrM    57  i3002191    TA

File1.txt has total 10 entries while File2.txt has 8 entries. I want to compare the both the file using Column 1 and Column 2. (or we can also use column 3 rsid)

If both the file's first two column values are same, it should print the corresponding line to Output.txt from File2.txt.

When File1.txt has unique combination (Column1:column2, which is not present in File2.txt) it should print the corresponding line from File1.txt to the Output.txt.

I tried various awk and perl combination available at website, but couldn't get correct answer. Any suggestion will be helpful.

Thanks,
Amit

vcf • 2.7k views
ADD COMMENT
2
Entering edit mode
8.4 years ago

Give this one a try:

grep -v -f <( cat file2.txt | tr -s ' ' | cut -f 3 -d ' ' ) file1.txt
ADD COMMENT
0
Entering edit mode

Hi Sean,

This is working for the small text file but running continuously for large file (~1million lines in both files) and throwing no result.

Also, I it is only providing the uniq line from the File1.txt while I want to keep Match positions line from File2.txt as well in output file.

Thanks, Amit

ADD REPLY
1
Entering edit mode

Since your files appear to be genomic coordinates, you may want to convert them to a "standard" format such as BED and then apply tools like bedtools or bedops. This will offer you the performance that you want on very large files.

ADD REPLY
0
Entering edit mode

Thanks Sean for the suggestions~~!! I tried some awk combination and able to get the output.

ADD REPLY
0
Entering edit mode

Did you want to answer your own question, then, so that we can see what you came up with?

ADD REPLY

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6