How to compare lines in to files?
1
0
Entering edit mode
4.7 years ago
Kumar ▴ 170

I have two tab delimited large files containing two columns, IDs and seqs. I am looking to generate output file of mismatch lines which are not available in each other files.

Please see the example below:

File 1:

0   AAAAGTGTGTAAAGAAGGGTAAAAAAAAAAACCGGATGCGAGGCATCCGGT
1000004 TACCGGGGAGTCGCCTTTTGCAACAGCACGGCTCAG
1000001 TGGTCAGTTTATGGAACGTTACCGGGGAGTTACTTTTTGCAACAGCACGGCTCAGCGC
1000002 ACCGGGGCAACAGCACTGCGACCGCTAAAAAAG
1000003 ATCACCGGGGCAGGCATTCGCCAGCGCCAGTAGCTGG

File 2:

1000000 TTTTTACCGGGGAGTCGCCTTTTGCAACAGCGGACGGCTCAG
1000008 TACCGGGGAGTCGCCTTTTGCAACAGCACGGCTCAG
1000006 TGGTCAGTTTATGGAACGTTACCGGGGAGTTACTTTTTGCAACAGCACGGCTCAGCGC
1000005 ACCGGGGCAACAGCACTGCGACCGCTAAAAAAG
1000009 ATCACCGGGGCAGGCATTCGCCAGCGCCAGTAGCTGG

OUTPUT:

0   AAAAGTGTGTAAAGAAGGGTAAAAAAAAAAACCGGATGCGAGGCATCCGGT
1000000 TTTTTACCGGGGAGTCGCCTTTTGCAACAGCGGACGGCTCAG
alignment genome • 837 views
ADD COMMENT
1
Entering edit mode

I don't follow: the description of the problem states you want to find lines that are different between the files, correct? However, there are no lines in common between the two example files, so all lines should be included in the output. Or did I get something wrong?

ADD REPLY
0
Entering edit mode

I updated my query. I am looking to generate a file of different lines between files.

ADD REPLY
1
Entering edit mode

Many possibilities with AWK or GREP, see here for same example and solutions.

ADD REPLY
0
Entering edit mode
4.7 years ago
$ join -t ' ' -1 2 -2 2 -v 1 -v2  \
     <(tr "\t" " " < file1.txt | tr -s " " | sort -t ' ' -k2,2) \
     <(tr "\t" " " < file2.txt | tr -s " " | sort -t ' ' -k2,2)

AAAAGTGTGTAAAGAAGGGTAAAAAAAAAAACCGGATGCGAGGCATCCGGT 0
TTTTTACCGGGGAGTCGCCTTTTGCAACAGCGGACGGCTCAG 1000000
ADD COMMENT

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6