Question: Couting number of similar reads in sam file and output in bed file?
1
gravatar for bobbyle0210
10 months ago by
bobbyle021010
bobbyle021010 wrote:

Hi all, Currently, after mapping, I got a sam file which contained approximately 30 million mapped reads. Then, by using bedtool bamtobed, I was able to obtain a bed file. However, there were so many many similar reads. Therefore, in order to reduce the size of the final bed file, is there any way that I can mention each read only once as well as its number of presentation in the sam file.


For instance:


var_1 0 15 ATGCATGCATGCCGTA


var_1 0 15 ATGCATGCATGCCGTA


var_1 0 15 ATGCATGCATGCCGTA


var_2 5 20 ATGCATGCGGGCCCC


Will become:


var_1 0 15 ATGCATGCATGCCGTA 3


var_2 5 20 ATGCATGCGGGCCCC 1

Thank you in advance!

rna-seq samtools bedtools • 199 views
ADD COMMENTlink modified 10 months ago by Carlo Yague5.0k • written 10 months ago by bobbyle021010
3
gravatar for Carlo Yague
10 months ago by
Carlo Yague5.0k
Canada
Carlo Yague5.0k wrote:

Use uniq -c your bed file.

uniq -c mybed.bed > uniqbed.bed

or

echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c

3 var_1 0 38 ATGCATGCATGCCGTA
1 var_2 5 40 ATGCATGCGGGCCCC
ADD COMMENTlink written 10 months ago by Carlo Yague5.0k

That's so cool, thank you. However, if I want the output in order like this:


var_1 0 15 ATGCATGCATGCCGTA 3


var_2 5 20 ATGCATGCGGGCCCC 1


Is there any command that I can use? Thanks.

ADD REPLYlink written 10 months ago by bobbyle021010
2

Add | awk '{print $2,$3,$4,$5,$1}' to end of original command.

$ echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c |  awk '{print $2,$3,$4,$5,$1}'

Results in

var_1 0 38 ATGCATGCATGCCGTA 3
var_2 5 40 ATGCATGCGGGCCCC 1
ADD REPLYlink written 10 months ago by genomax90k

Oh you beat me by a second :D

ADD REPLYlink written 10 months ago by Carlo Yague5.0k
1

sure, you can just pipe the results in awk, for instance:

echo "var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_1 0 38 ATGCATGCATGCCGTA
var_2 5 40 ATGCATGCGGGCCCC" | uniq -c | awk '{ print $2,$3,$4,$5,$1; }'
ADD REPLYlink written 10 months ago by Carlo Yague5.0k

Hi, I notice that for my bed file (which contains ~30 million reads and ~ 2.5 GB in size), I need to sort my file using sort myfile.bed before using uniq -c. Otherwise uniq -c only will not provide expected result. What could be the reason for that? When will I need to sort my file? Thanks.

ADD REPLYlink written 10 months ago by bobbyle021010

Indeed, uniq works only on sorted files because it only check for duplicate in contiguous rows. So basically, you always need to sort your files (unless they are already sorted).

ADD REPLYlink written 10 months ago by Carlo Yague5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 960 users visited in the last hour