Question: how to retreive sequences given a start and end site.
gravatar for Affan
6.1 years ago by
Affan290 wrote:

So I have a gff that has information about MEF2 transcription factor binding sites. So given a start and end site, 19815641 - 19815654 on the + strand on ChrX, where exactly do I get the sequence from?

I have 1800 lines in the gff file, so I cant do it manually. I am looking for a R solution, so basically if something like the following function exists

getSequence(start, end, strand, chr)


The goal is to create a PWM so my next question is that once I've retrieved my sequences, how do I go about aligning them? what is the best software to align 1800 short sequences?

Edit: It seems like the

bedtools getfasta -fi reference.fasta -bed gff.file -fo output.fasta 

is what I need, but whats the easiest way to download hg18 reference genome?

sequence alignment • 1.5k views
ADD COMMENTlink modified 6.0 years ago by Biostar ♦♦ 20 • written 6.1 years ago by Affan290

You can use samtools faidx to retrieve sequences from reference genome using coordinates. 

faidx samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create<ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.

ADD REPLYlink written 6.1 years ago by Ashutosh Pandey12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2173 users visited in the last hour