How get intervals of low coverage from a genome sequence?
1
2
Entering edit mode
9.1 years ago
fhsantanna ▴ 610

I have a file with the following format, that relates a determined position of the genome to coverage data:

Sequence position coverage

NC_006510    1    19
NC_006510    2    22
NC_006510    3    44
NC_006510    4    1
NC_006510    5    1
NC_006510    6    0
NC_006510    7    0

... (low coverage positions, under 6)

NC_006510    1000    0
NC_006510    1001   66
NC_006510    1002    66

I would like to get intervals larger than 800 pb with low coverage along a sequence. For example, a desirable output would be like that:

NC_006510 4..1000 #In this position interval, the sequence has low coverage (<6).

Do you know a clever way to do it?

Thanks in advance.

interval low-coverage script • 2.2k views
ADD COMMENT
0
Entering edit mode

What do you mean by 800pb ?

ADD REPLY
6
Entering edit mode
9.1 years ago

If you are looking at 800bp or more stretches with coverage less than or equal to 6X, some thing like following would be useful.

bedtools genomecov -ibam input_sorted.bam -g reference.fa.fai -bg | awk '{ if ($4<=6) print }' | bedtools merge -i - | awk '{ if (($3-$2)>=800 ) print }' | less

Anyway this is not an exact answer to your question, but different approach.

ADD COMMENT
0
Entering edit mode

Very clever, thank you very much!

ADD REPLY
0
Entering edit mode

-bga option is better because computes regions of coverage "0". Also in "bedtools merge" the parameter -i has to be written.

ADD REPLY
0
Entering edit mode

Updated.

ADD REPLY

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6