Question: Compare two vcf like files; and output the line for unique position (from first file) and line for duplicate position (from second file)
0
gravatar for amitgsir
3.5 years ago by
amitgsir50
Incheon, South Korea
amitgsir50 wrote:

I have two tab separated values file, say

File1.txt

chr1    894573  rs13303010  GG
chr2    18674   rs10195681  CC
chr3    104972  rs990284    AA  <--- Unique Line
chr4    111487  rs17802159  AA
chr5    200868  rs4956994   **GG**
chr5    303686  rs6896163   AA  <--- Unique Line
chrX    331033  rs4606239   TT
chrY    2893277 i4000106    **GG**
chrY    2897433 rs9786543   GG
chrM    57  i3002191    **TT**


File2.txt

chr1    894573  rs13303010  GG
chr2    18674   rs10195681  AT
chr4    111487  rs17802159  AA
chr5    200868  rs4956994   CC
chrX    331033  rs4606239   TT
chrY    2893277 i4000106    GA
chrY    2897433 rs9786543   GG
chrM    57  i3002191    TA

 

Desired Output:

Output.txt

chr1    894573  rs13303010  GG
chr2    18674   rs10195681  AT
chr3    104972  rs990284    AA  <--Unique Line from File1.txt
chr4    111487  rs17802159  AA
chr5    200868  rs4956994   CC
chr5    303686  rs6896163   AA  <--Unique Line from File1.txt
chrX    331033  rs4606239   TT
chrY    2893277 i4000106    GA
chrY    2897433 rs9786543   GG
chrM    57  i3002191    TA

File1.txt has total 10 entries while File2.txt has 8 entries. I want to compare the both the file using Column 1 and Column 2. (or we can also use column 3 rsid)

If both the file's first two column values are same, it should print the corresponding line to Output.txt from File2.txt.

When File1.txt has unique combination (Column1:column2, which is not present in File2.txt) it should print the corresponding line from File1.txt to the Output.txt.

I tried various awk and perl combination available at website, but couldn't get correct answer. Any suggestion will be helpful.

Thanks, Amit

vcf • 1.4k views
ADD COMMENTlink modified 3.5 years ago by Sean Davis25k • written 3.5 years ago by amitgsir50
2
gravatar for Sean Davis
3.5 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

Give this one a try:

grep -v -f <( cat file2.txt | tr -s ' ' | cut -f 3 -d ' ' ) file1.txt
ADD COMMENTlink written 3.5 years ago by Sean Davis25k

Hi Sean,

This is working for the small text file but running continuously for large file (~1million lines in both files) and throwing no result.

Also, I it is only providing the uniq line from the File1.txt while I want to keep Match positions line from File2.txt as well in output file.

 

Thanks, Amit

ADD REPLYlink written 3.5 years ago by amitgsir50
1

Since your files appear to be genomic coordinates, you may want to convert them to a "standard" format such as BED and then apply tools like bedtools or bedops.  This will offer you the performance that you want on very large files.

ADD REPLYlink written 3.5 years ago by Sean Davis25k

Thanks Sean for the suggestions~~!! I tried some awk combination and able to get the output.

ADD REPLYlink written 3.5 years ago by amitgsir50

Did you want to answer your own question, then, so that we can see what you came up with?

ADD REPLYlink written 3.5 years ago by Sean Davis25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2208 users visited in the last hour