1
1
Entering edit mode
11.2 years ago
molbioguy ▴ 20

I have an interesting (to me) problem that I'm not sure how to approach. I have a library of randomized short inserts (21 nt) that has been sequenced using the SOLiD platform, with 25 nt reads. The insert will be at the very start of the reads. I want to count the distinct insert sequences. The straightforward way appears to be to convert the reads to fastq, filter based on quality, and count in base space. I'm worried about errors, as I have no way of checking for them that I can see, other than the last 4 nt (22-25) which should be identical in all reads. Any suggestions or interesting approaches to accomplish this? I am not experienced with NGS projects, so sorry if this is a dumb question.

solid counts • 1.9k views
2
Entering edit mode
11.2 years ago
Farhat ★ 2.9k

I would suggest not converting to basespace. Follow the pipeline you suggested except skip the conversion to basespace part at the end. One way of counting the number of distinct inserts would be the following Python code.

import sys

inserts=set()
linenum=0
for line in open(sys.argv[1], 'r'):
linenum+=1
if linenum%4==2:

print len(inserts)


Save the code as inscount.py and run it as python inscount.py infile.fastq. Do the quality filter on the fastq file before counting the number of distinct inserts.

1
Entering edit mode

@Farhat Thanks. I always considered the difficulties of colorspace, but forgot that it could be useful to stay in there as long as possible. I need to count the frequencies of distinct sequence inserts (which I guess I wasn't so clear about in my question), rather than the total count of distinct sequences. But your answer was helpful in steering me away from premature conversion to basespace. And I can easily modify the set based approach to a hash based one to get what I want.