I have 10 fasta files with sequenced reads information with read sizes from 15 - 35 . I have combined the reads and collapsed in to unique reads and filtered for sizes 18 - 26 bp long unique reads. Now i wanted to count each unique read appearance in all the fasta files and make a table with sample names as columns and reads as rows. I tried to use "grep -w "sequence name" file name " to count the tags but this seems to take long time. does anyone know how to do this faster?
Edited: Sorry for the confusion. Here is the input and output
This is the input file where it contains unique sequences. i have more than million such unique sequences.
Query: >tag1 TCGGA >tag2 TCTCA >tag3 TCTCGC
These are multiple files. for example i am showing with 3 files. i have more than 20 such files. each file contains more than 10 million sequences each
File1: >file1_id1 TCGGA >file1_id1 TCGGAT >file1_id2 TCTCA >file1_id3 TCTCA File2: >file2_id1 TCTCA >file2_id2 TCTCA >file2_id3 TCTCACTA >file2_id4 TCTCGC >file2_id5 TCTCGCCTAT >file2_id6 TCTCGC File3: >file1_id1 TCGGA >file1_id1 TCGGAT >file2_id4 TCTCGC >file2_id5 TCTCGCCTAT >file2_id6 TCTCGC
I need the following output. Search has to be exact for the count. output:
sequence file1 file2 file3 tag1 TCGGA 1 0 1 tag2 TCTCA 2 2 0 tag3 TCTCGC 0 2 2