Question: awk one line command
1
gravatar for prasoonagar
10 months ago by
prasoonagar10
prasoonagar10 wrote:

I want to break my bed file into the chromosome-specific files containing the data of respective chromosome so I used the following command

 for chr in {1..22}; do awk -v chr=$chr '$1 == chr' data.txt > data_chr$chr.txt; done

The issue is that the files created do not contain any information.

My bed file has six columns and is having the following format

chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr1    10165   10215   382_570_1514    9   +
chr1    10170   10196   664_789_434 1   +
chr1    10171   10221   439_96_1067 10  +
chr1    10174   10223   378_525_47  5   +
chr1    10174   10224   94_430_1386 6   +
chr1    10217   10279   4_218_1107  5   +
chr1    10224   10267   138_1957_415    59  -
chr1    10242   10272   638_1023_1664   4   +

Any suggestions will be highly appreciated

awk • 703 views
ADD COMMENTlink modified 10 months ago by cpad01129.3k • written 10 months ago by prasoonagar10
6
gravatar for Alex Reynolds
10 months ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

You need to prefix the chr variable with the string literal chr, as right now you are asking awk to match the first field on numbers (1 through 22) and not chromosome names (chr1 through chr22).

$ for chr in {1..22}; do awk -v chr="chr${chr}" ...

But BEDOPS offers a much faster and simpler way of splitting BED files with bedextract, which uses a binary search approach on sorted BED files to jump to the start of a chromosome:

$ fn=intervals.bed
$ for chr in `bedextract --list-chr ${fn}`; do bedextract ${chr} ${fn} > ${fn}.${chr}.bed; done

With awk or grep, you need to re-read through the file linearly on every chromosome, which wastes time, especially on very large, whole-genome scale BED files. Use BEDOPS if your time is valuable to you.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Alex Reynolds26k
2
gravatar for cpad0112
10 months ago by
cpad01129.3k
India
cpad01129.3k wrote:

Using awk (same thing can be done using parallel. Please look at the end of the post for parallel):

$ awk -F "\t" '{print >$1".bed"}' test.bed

I modified OP bed as follows to split: input:

$ cat test.bed
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr1    10165   10215   382_570_1514    9   +
chr1    10170   10196   664_789_434 1   +
chr1    10171   10221   439_96_1067 10  +
chr2    10174   10223   378_525_47  5   +
chr2    10174   10224   94_430_1386 6   +
chr3    10217   10279   4_218_1107  5   +
chr3    10224   10267   138_1957_415    59  -
chr3    10242   10272   638_1023_1664   4   +

output:

$ ls chr*
chr1.bed  chr2.bed  chr3.bed

post splitting, chr1.bed:

$ cat chr1.bed 
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr1    10165   10215   382_570_1514    9   +
chr1    10170   10196   664_789_434 1   +
chr1    10171   10221   439_96_1067 10  +

using parallel:

$ seq 22 | parallel 'grep  chr{} input.bed > chr{}.bed'
ADD COMMENTlink modified 10 months ago • written 10 months ago by cpad01129.3k
2

Note that grep chr1 input.bed will return records for chr1, chr10, chr11, etc. and grep chr2 input.bed will return records for chr2, chr20, and chr21, so the output for those search keys will contain more intervals than probably expected and is likely incorrect.

ADD REPLYlink modified 10 months ago • written 10 months ago by Alex Reynolds26k

That is correct. But it can be improved with -w option:

$ seq 22 | parallel "grep  -w chr{} test.bed > chr{}.bed"

Example bed:

 $ cat test.bed
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr2    10165   10215   382_570_1514    9   +
chr2    10170   10196   664_789_434 1   +
chr3    10171   10221   439_96_1067 10  +
chr3    10174   10223   378_525_47  5   +
chr10   10174   10224   94_430_1386 6   +
chr10   10217   10279   4_218_1107  5   +
chr20   10224   10267   138_1957_415    59  -
chr20   10242   10272   638_1023_1664   4   +

After execution:

$ cat chr1.bed 
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
$ cat chr10.bed
chr10   10174   10224   94_430_1386 6   +
chr10   10217   10279   4_218_1107  5   +
$ cat chr2
chr20.bed  chr21.bed  chr22.bed  chr2.bed   
$ cat chr2.bed 
chr2    10165   10215   382_570_1514    9   +
chr2    10170   10196   664_789_434 1   +
$ cat chr20.bed 
chr20   10224   10267   138_1957_415    59  -
chr20   10242   10272   638_1023_1664   4   +
ADD REPLYlink modified 10 months ago • written 10 months ago by cpad01129.3k

This perfectly working thanks but if there are multiple files then how can parallel be used where each file need to be labeled with original file name and the chromosome name.

ADD REPLYlink written 10 months ago by prasoonagar10

If original file name is "test.bed", and you want to append this to every chromosome file to be created, then command would be:

$ seq 22 | parallel "grep  -w chr{} test.bed > test.bed.chr{}.bed"

This would create multiple files with original name and the chromosome name.

ADD REPLYlink written 10 months ago by cpad01129.3k
1
gravatar for genomax
10 months ago by
genomax57k
United States
genomax57k wrote:

Have you tried grep instead of awk? grep -w "chr1" yourfile > chr1.bed and grep -w "chr11" yourfile > chr11.bed etc.

ADD COMMENTlink modified 10 months ago • written 10 months ago by genomax57k
1
gravatar for Charles Plessy
10 months ago by
Charles Plessy2.5k
Japan
Charles Plessy2.5k wrote:

To split the BED file in multiple files, with a single awk command, without for loops:

$ cat toto.bed 
chr1    1   2
chr1    1   3
chr2    1   4
chr3    4   5
$ awk '
  NR == FNR {chr[$1] ; next}
  {for (c in chr) {
    if ($1 == c) {
      print >> c".bed"
      close(c".bed")
      next
  }}}' toto.bed toto.bed
$ head chr*
==> chr1.bed <==
chr1    1   2
chr1    1   3

==> chr2.bed <==
chr2    1   4

==> chr3.bed <==
chr3    4   5

With this syntax, your file only needs to be read twice. Big thanks to https://stackoverflow.com/questions/8363718/grep-to-multiple-output-files/8363776#8363776.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Charles Plessy2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1804 users visited in the last hour