Question: Unique lines among 200 files
0
gravatar for User000
6 months ago by
User000380
User000380 wrote:

I have 200 txt files with 1 line containing the name of the read. The command line below finds the intersection.

cat *.mapped.txt | sort | uniq -d > intersection.out

How to find unique reads among these 200 files?

My files are called:

accepted.name.mapped.txt
...

The reads are like this:

HISEQ1:105:C0A57ACXX:2:1105:12172:84568
HISEQ1:105:C0A57ACXX:2:1108:17762:41110
HISEQ1:105:C0A57ACXX:2:1204:3007:9349
HISEQ1:105:C0A57ACXX:2:1204:11087:160507
HISEQ1:105:C0A57ACXX:2:1301:18982:79651
HISEQ1:105:C0A57ACXX:2:1307:3766:23853
bash ngs • 172 views
ADD COMMENTlink modified 6 months ago by Pierre Lindenbaum129k • written 6 months ago by User000380
2
gravatar for Pierre Lindenbaum
6 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:
cat *.mapped.txt | sort | uniq -u

??

ADD COMMENTlink written 6 months ago by Pierre Lindenbaum129k

Yes, I was thinking about this, but I wasn't sure....will it be very slow on my 200 files 800 MB each?

ADD REPLYlink written 6 months ago by User000380
1

should be faster:

cat *.mapped.txt | LC_ALL=C sort -T .  --buffer-size=5G | LC_ALL=C uniq -u
ADD REPLYlink written 6 months ago by Pierre Lindenbaum129k

sort -m *.mapped.txt | uniq -u > uniques.txt may be this one works as well...

ADD REPLYlink written 6 months ago by User000380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1403 users visited in the last hour