From the complete genome sequence of E. coli, i am isolating the upstream 100 and downstream 50 nucleotide sequences of TSS. Position of TSS on forward strands and reverse strands are available. The sequence from NCBI is of forward strand (5' to 3' from left to right) so the process of retrieving upstream (left from TSS) and downstream (right from TSS) is straightway. Now for the reverse strand I think in the following way- complementary strand of forward is made. It will span from 3' to 5' from left to right as position don't get changed (i.e first nucleotide of forward strand from 5'end will be first nucleotide of complementary strand from 3' end). So for upstream sequences we will take sequence right to TSS and for downstream we will take sequences left to TSS. *I need to know whether i am proceeding in the right way.*
Your strategy is right. The are some tools can help you archive this.
$ cat seq.fa >seq actgnACTGN
GTF file, note that the
tss1 is on negative strand.
$ cat f.gtf seq test CDS 4 6 . + . gene_id "cds1"; transcript_id "cds1"; seq test TSS 5 7 . - . gene_id "tss1"; transcript_id "tss1";
1) Retriving TSS sequences
$ ./seqkit subseq --gtf f.gtf --feature TSS seq.fa >seq_5-7:- tss1 GTn
2) Retriving TSS sequences along with up- and (or) down-stream sequences
$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --down-stream 2 seq.fa >seq_5-7:-_us:3_ds:2 tss1 NCAGTnca
~~Here's a bug: the sequences header does not include down-stream information ("ds"). I'll fix this soon.~~Fixed in v0.4.2
3) Retriving up- or down-stream sequence respectively
$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --only-flank seq.fa >seq_5-7:-_usf:3 tss1 NCA $ ./seqkit subseq --gtf f.gtf --feature TSS --down-stream 2 --only-flank seq.fa >seq_5-7:-_dsf:2 tss1 ca
SeqKit also supports BED file, but only the chromesome, position and strand information are used.