Counting Unaligned Distinct Colorspace Reads
1
1
Entering edit mode
12.7 years ago
molbioguy ▴ 20

I have an interesting (to me) problem that I'm not sure how to approach. I have a library of randomized short inserts (21 nt) that has been sequenced using the SOLiD platform, with 25 nt reads. The insert will be at the very start of the reads. I want to count the distinct insert sequences. The straightforward way appears to be to convert the reads to fastq, filter based on quality, and count in base space. I'm worried about errors, as I have no way of checking for them that I can see, other than the last 4 nt (22-25) which should be identical in all reads. Any suggestions or interesting approaches to accomplish this? I am not experienced with NGS projects, so sorry if this is a dumb question.

solid counts • 2.1k views
ADD COMMENT
2
Entering edit mode
12.7 years ago
Farhat ★ 2.9k

I would suggest not converting to basespace. Follow the pipeline you suggested except skip the conversion to basespace part at the end. One way of counting the number of distinct inserts would be the following Python code.

import sys

inserts=set()
linenum=0
for line in open(sys.argv[1], 'r'):
    linenum+=1
    if linenum%4==2:
        inserts.add(line[:21])

print len(inserts)

Save the code as inscount.py and run it as python inscount.py infile.fastq. Do the quality filter on the fastq file before counting the number of distinct inserts.

ADD COMMENT
1
Entering edit mode

@Farhat Thanks. I always considered the difficulties of colorspace, but forgot that it could be useful to stay in there as long as possible. I need to count the frequencies of distinct sequence inserts (which I guess I wasn't so clear about in my question), rather than the total count of distinct sequences. But your answer was helpful in steering me away from premature conversion to basespace. And I can easily modify the set based approach to a hash based one to get what I want.

ADD REPLY

Login before adding your answer.

Traffic: 2958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6