Sorting And Binning Reads From A Sam File
2
1
Entering edit mode
10.6 years ago
zeropoint1 ▴ 10

Hello everyone,

I'm a chemist branching out into bioinformatics and I may use some of the incorrect nomenclature so I apologize in advance. The following questions concern NGS data from an in vitro selection of DNA aptamers.

What I have:

A library consisting of millions of 50 nt long DNA sequences in SAM format (the files is about 2 Gb). Each sequence has an identical 5 nt at the 3' end and the remaining 45 nt are semi-random. Some sequences should appear only once in the library and some sequences should appear hundreds of times.

What I need to do:

I first need to remove any sequences without the proper 5 nt at the 3' end (for some reason, some of the sequences were misread). I then need to collect identical sequences into groups (called binning, I believe), and then sort the groups by number of sequences. As stated previously, most of the groups will have only a single member but some should be quite large.

That's really it.

If someone could please suggest a method to perform this task, i would be very grateful. I would prefer software compatible with windows 7 but I could also do this on linux. I have access to matlab.

Your help is very much appreciated.

Thanks!

sam ngs sequencing sort • 2.9k views
ADD COMMENT
0
Entering edit mode

FYI if you're going to be doing a lot of NGS work as a bioinformatician, it will be helpful to learn R as many libraries exist that will help you with tasks you'll face. For you it will be easy to learn R if you already know MATLAB. I went from MATLAB to R and it was pretty straightforward.

ADD REPLY
2
Entering edit mode
10.6 years ago

You can use a combination of bioawk and shell scripts to do this like so:

cat data.sam | ~/bin/bioawk -c sam '{ print substr($seq, 5, 45) }' | sort | uniq -c | sort -rg

will give you a count for each sequence

ADD COMMENT
0
Entering edit mode

Badass one-liner

ADD REPLY
2
Entering edit mode
10.6 years ago
export LC_ALL=C
cut -f 10 data.sam | cut -c 5-50 | sort | uniq -c | sort -rg
ADD COMMENT
1
Entering edit mode

Unexpectedly your plain version is slower than the bioawk version.

$ time cat data.sam | ~/bin/bioawk -c sam '{ print substr($seq, 5, 45) }' > /dev/null 

real    0m1.046s
user    0m1.008s
sys    0m0.052s

$ time cut -f 10 data.sam | cut -c 5-50 > /dev/null

real    0m1.288s
user    0m1.764s
sys    0m0.044s

I only ran it to see how much slower the bioawk version was ;-) plus it has the advantage of not including the headers

ADD REPLY
0
Entering edit mode

strange :-)

ADD REPLY

Login before adding your answer.

Traffic: 1736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6