Question: how to retreive sequences given a start and end site.
1
Affan • 290 wrote:
So I have a gff that has information about MEF2 transcription factor binding sites. So given a start and end site, 19815641 - 19815654 on the + strand on ChrX, where exactly do I get the sequence from?
I have 1800 lines in the gff file, so I cant do it manually. I am looking for a R solution, so basically if something like the following function exists
getSequence(start, end, strand, chr)
The goal is to create a PWM so my next question is that once I've retrieved my sequences, how do I go about aligning them? what is the best software to align 1800 short sequences?
Edit: It seems like the
bedtools getfasta -fi reference.fasta -bed gff.file -fo output.fasta
is what I need, but whats the easiest way to download hg18 reference genome?
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/
You can use samtools faidx to retrieve sequences from reference genome using coordinates.
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create<ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.