How to edit large bed files in order to keep peaks in particular chromosome ?
1
1
Entering edit mode
4.5 years ago

Hi,

I used data set from Encode consortium for my package development, due to size of actual peak files are rather big, I can't use these data set for my package use. Because actual size of package resulted from R CMD build must be less than 4Mb on disk, I have to use rather small peak file as an example data for my package . In Encode sample's data set, each peak files contains around 100,000 peaks each. How can I edit rather big bed files in order to keep particular chromosome ? Is there any handy tools to edit peak files ? Thanks in advance :)

Best regards :

Jurat

R ChIP-Seq genome peak encode • 1.2k views
2
Entering edit mode

You could provide data for one chromosome. Choose the one important for your application.

0
Entering edit mode

@Goutham Atla: Thanks, peak files are already constructed in robust way and stored in bed file, I think there is no need to pick up important one, I think taking sample could be option. Should I take sample from each chrom ? How can I do that ? Could you elaborate your answer please ? I'm sorry if my question is simple to ask.

0
Entering edit mode

When you say "sample from each chromosome" ? Do you mean bam file ?

0
Entering edit mode

@Goutham Atla : I mean bed file, all peaks are stored in BED format file . Thanks

0
Entering edit mode

I think it would be better to pick just one chromosome rather than sampling peaks from the whole genome. If you sample from the whole genome you artificially increase the distance between peaks which may or may not be a concern.

By the way, a ChIP-Seq file of 100,000 peaks is quite extreme, most of them should be in the order of few thousands peaks (say 1000 to 30000). Are you sure you are looking at ChIP-Seq for transcription factors rather than FAIRE-Seq or nucleosomes?

0
Entering edit mode

@dariober : Yes, I am sure that I am looking at ChIP-Seq for TFBS. Thanks

2
Entering edit mode
4.5 years ago

If you have GNU Parallel installed, you can use this with BEDOPS bedextract to very quickly split a BED file by chromosome:

$bedextract --list-chr input.bed | parallel "bedextract {} input.bed > input.{}.bed"  You can then use my sample utility or GNU shuf to uniformly sample without replacement: $ sample -k ${SAMPLE_SIZE} input.chrN.bed > input.chrN.sample.bed  Or: $ shuf --head-count=\${SAMPLE_SIZE} input.chrN.bed > input.chrN.sample.bed

0
Entering edit mode

Dear Alex :

Thanks for kind instruction. How can I easily use BEDOPS tools on windows? I intend to get sample (around 1000 features) from each bed files, store these sample as BED file for further usage ? Could you teach me using BEDOPS tools to get these expected example data quickly ? Thank you very much :)

Best regards :

Jurat

1
Entering edit mode
0
Entering edit mode

@ Alex Reynolds: I don't have GNU tool, and familiar with usage of BEDOPS tools. Regarding on my issue, is there any available command list that I could directly try on windows machine? It is bit of urgent to generate small example data. Surely, BEDOPS with a lot features to learn. Is there any quick solution available ?Thanks again for your kind help.

Best regards :

Jurat

0
Entering edit mode

If you want to run Unix tools on Windows, you might try running Cygwin, or set up VirtualBox with Linux.