Is there a more elegant way to print the list of all regions from a bed file?
Entering edit mode
11 months ago
Axzd ▴ 80

Hello, I have a bed file structured so

Chrom_3 0   20354777
Chrom_1 0   18146847
Chrom_5 0   16930519

I want this output (switch from 0based to 1-based)

Chrom_3 1
Chrom_3 2
Chrom_3 3
Chrom_3 4
Chrom_3 5
Chrom_3 6
Chrom_3 7
Chrom_3 8
Chrom_3 9

The solution I found is

bedtools makewindows -b vaga.bed -w 1 |awk '{print $1 "\t" $2}' > candidate_sites

It works but it seems to me awk alone might have been able to do it. I am very used to bedtools but I am sure there is an awkish way to do that. I just wonder how you deal with the "for each chromosome". Thanks

awk • 791 views
Entering edit mode

What is the third number in first block for? You want the output example go up to the number but the output stays two-column as shown?

Entering edit mode

yes, the first file is a 0 based chrom_x start end and I need a file with each line giving a single positional point
I know, it's not the sharpest bioinformatics question ^^'

Entering edit mode
11 months ago

What exactly do you want to achieve? Take the record Chrom_3 0 20354777, move it by 1 and split it up to 20354776 intervals of 1bp?

If so, I recommend to stick with bedtools. awk is a general purpose software that can do amazing things, but bedtools is aware that you are working with genomic intervals. For example, bedtools shift takes a reference genome file and ensures that you are not shifting your intervals to positions larger than your scaffold. Unlikely to happen when shifting just by 1bp, but otherwise handy.

But if you want to use awk, you can of course:

awk 'BEGIN{OFS="\t"} /Chrom_3/ {i = $2+1; do { print $1,i; ++i } while (i <= $3) }'

This example comprises some optional elements you might or might not want to use:

  • BEGIN{OFS="\t"} declares that all fields should be tab-separated in the output. Very handy when you want to print many columns you do not explicitly want to type the seperator for.
  • /Chrom_3/ matches only records on chromosome 3. Partial matches are also possible with e.g. $1 ~ "_3"
  • The remainder is just the loop. How you add the offset doesn't matter - directly to i or later when printing.
Entering edit mode

I think my real question should have been " is it better to make that operation with bedtools or pure awk"? And you answered beautifully.

EDIT: actually I would have needed for several chromosomes, Chrom_1 and 5 as in the example, so I was thinking about some kind of nested loop "for each Chromosom do {i = $2+1; do { print $1,i; ++i } while (i <= $3) " your solution would have also worked by rerunning the command with the different chromosomes and cat all the result files.

but it's not important now.


Login before adding your answer.

Traffic: 3073 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6