Question

extracting parts of genome using R

0

Entering edit mode

9.3 years ago

Affan ▴ 300

Background: I downloaded TFBS data from the Riken database and created a PWM for the MEF2 transcription factor. Now I want to test the accuracy of my PWM.

My plan is to consider a 400bp (or maybe 1000bp) neighbourhood around each TFBS. And then run the PWM on this to see the accuracy (ie, the PWM should score the highest in the middle of the segment where the binding site in).

So, what I already have is data that looks like this:

>chr1:6585537-6585547
CTATAAATAG
>chr1:6767854-6767864
CTTTGTTTAG
>chr1:8686282-8686292
CTCTTAATAG

Now based on these locations, does there exist a R package that I can say ExtractDNA(start, stop, size)

Secondary question: once I have my genome segment, how exactly do I run my PWM on it?

Edit: I think the TFBS were from hg18

PWM genome • 2.3k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Affan ▴ 300

Ram · Answer 1 · 2014-12-24

If you're not stuck using R, you might consider using some command-line tools on your FASTA data (say, a file called tfbs.fa), to make a BED file that you can apply set operations on with bedops --range:

$ grep -e ">" tfbs.fa \
    | awk '{ print substr($0, 2); }' - \
    | awk -F ':|-' '{ print $1"\t"$2"\t"$3; }' \
    | sort-bed - \
    | bedops --everything --range 400 - \
    > query.bed

Briefly, this works by filtering the input FASTA file for lines that start with a > (headers that contain your coordinates of interest). This is passed to an awk statement that strips the > character and prints the rest. This result is passed to another awk statement that prints out the chromosome and start and stop positions to tab-delimited lines. This is a minimal BED file format, so we can sort it with sort-bed and symmetrically pad around the edges by 400 bases with bedops --range. Finally, this result is written to a file called query.bed.

Once you have a BED file (here called query.bed) you could convert that back to a FASTA file (e.g. called query.fa) using a tool like bed2fasta.py or with DAS queries to UCSC servers for your build of interest.

With your expanded FASTA file, you could use a tool like FIMO to search query.fa for motifs in the padded regions that match with your PWM.

Ram · Answer 2 · 2014-12-24

0

Entering edit mode

9.3 years ago

Ram 43k

You might wanna use UCSC's DAS server - that has an API to give you sequences given the co-ordinates.

ADD COMMENT • link 2.1 years ago by Ram 43k

0

Entering edit mode

Thanks. For anyone else wondering, here is a related thread: How To Get The Sequence Of A Genomic Region From Ucsc?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Affan ▴ 300

0

Entering edit mode

:) Sorry, I should've included that link. Guess my high carb lunch has induced quite the drowsiness!

ADD REPLY • link 2.1 years ago by Ram 43k