Extracting counts of reads into a matrix
2
0
Entering edit mode
3.6 years ago
rmash ▴ 20

I have a txt file with sequences

1   TACCCTGTAGAACCGAATTTGT  miRNA   mmu-mir-10b PM
2   GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   PM
3   TACCCTGTAGATCCGAATTTGT  miRNA   mmu-mir-10a PM
4   GCATTGTGGTTCAGTGGTAGAATTCTCGCCT tRNA    Mus_musculus_tRNA-Gly-GCC-2-2   IM
5   ACCCTGTAGAACCGAATTTGT   other   other   NA
6   TACCCTGTAGAACCGAATTTG   other   other   NA
7   GCATTGGTTCAGTGGTAGAATTCTCGCCT   tRNA    Mus_musculus_tRNA-Gly-GCC-2-7   IM
8   GCATTTGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   IM
9   TACCCTGTAGAACCGAATTTGTG miRNA   mmu-mir-10b PM
10  GGTGAATATAGTTTACAAAAAACATTAGACTGTGAATC  tRNA    tRNA-His    IM


I'd like a count matrix based on the 3rd value in each line such that I have something like. What's the best way to do this?

mmu-mir-10b 2
tRNA-His 1
other 2


etc

counts reads unix grep • 958 views
0
Entering edit mode

That means you have no repeated reads. I don't see any repeated read in given example.

0
Entering edit mode

The command did what it supposed to do. You have repeated substrings, but not repeated strings. For example, "ACCCTGTAGATCCGAATTTGT" is repeated 4 times in other sequences but as a substring.

0
Entering edit mode

Yes, I was doing something silly, but I've fixed it and there should now be repeats, as the 4th item in each line, how would I go about making a count matrix from this? Sorry, i'm a novice

0
Entering edit mode

Did you derive this strange file from your alignments in SAM/BAM format? Since you are dealing with miRNA (which should have no gaps) why not just count the instances of the miRNA a read is aligned to (in third field of your alignment file) to get the count matrix.

You are interested in miRNA counts and not read counts, is that correct?

0
Entering edit mode

yes thats what id like to do but dont know how to in unix

0
Entering edit mode

thats correct, have edited main post to make it clearer

0
Entering edit mode
3.6 years ago
rmash ▴ 20

This seems to work

awk '{seen[$4]++} END{for(x in seen) print x, seen[x]}' test.txt > test_counts.txt  Is there a better option? ADD COMMENT 0 Entering edit mode Better in what terms? If it gives you the right answer then this is fine. Just to be safe I suggest that you do something similar on your original SAM file so there are less chance of errors. ADD REPLY 0 Entering edit mode 3.6 years ago $ datamash -s -g 4 count 4 < test.list
Mus_musculus_tRNA-Gly-GCC-2-2   1
Mus_musculus_tRNA-Gly-GCC-2-7   1
Mus_musculus_tRNA-Gly-GCC-4-1   2
mmu-mir-10a 1
mmu-mir-10b 2
other   2
tRNA-His    1


test.list is data from OP and first column is serial numbers from 1 to 10 from OP.