Question

how to retreive sequences given a start and end site.

1

Entering edit mode

10.6 years ago

Affan ▴ 310

So I have a gff that has information about MEF2 transcription factor binding sites. So given a start and end site, 19815641 - 19815654 on the + strand on ChrX, where exactly do I get the sequence from?

I have 1800 lines in the gff file, so I cant do it manually. I am looking for a R solution, so basically if something like the following function exists

getSequence(start, end, strand, chr)

The goal is to create a PWM so my next question is that once I've retrieved my sequences, how do I go about aligning them? what is the best software to align 1800 short sequences?

Edit: It seems like the

bedtools getfasta -fi reference.fasta -bed gff.file -fo output.fasta

is what I need, but whats the easiest way to download hg18 reference genome?

alignment sequence • 2.3k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Affan ▴ 310

1

Entering edit mode

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/

You can use samtools faidx to retrieve sequences from reference genome using coordinates.

faidx samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Ashutosh Pandey 12k