Question: position specific sequence retrieval from whole genome sequence
gravatar for psiwach29
3.9 years ago by
psiwach2910 wrote:

From the complete genome sequence of E. coli, i am isolating the upstream 100 and downstream 50 nucleotide sequences of TSS. Position of TSS on forward strands and reverse strands are available. The sequence from NCBI is of forward strand (5' to 3' from left to right) so the process of retrieving upstream (left from TSS) and downstream (right from TSS) is straightway. Now for the reverse strand I think in the following way- complementary strand of forward is made. It will span from 3' to 5' from left to right as position don't get changed (i.e first nucleotide of forward strand from 5'end will be first nucleotide of complementary strand from 3' end). So for upstream sequences we will take sequence right to TSS and for downstream we will take sequences left to TSS. *I need to know whether i am proceeding in the right way.*

sequence gene genome • 741 views
ADD COMMENTlink modified 3.9 years ago by shenwei3565.6k • written 3.9 years ago by psiwach2910
gravatar for shenwei356
3.9 years ago by
shenwei3565.6k wrote:

Your strategy is right. The are some tools can help you archive this.

Here's is a solution of command subseq (see usage) of SeqKit, which provides executable binary files for Windows/Linux/Mac, just donwload the .tar.gz file, decompress and run.



$ cat seq.fa 

GTF file, note that the tss1 is on negative strand.

$ cat f.gtf 
seq     test    CDS     4       6       .       +       .       gene_id "cds1"; transcript_id "cds1"; 
seq     test    TSS     5       7       .       -       .       gene_id "tss1"; transcript_id "tss1";

1) Retriving TSS sequences

$ ./seqkit subseq --gtf f.gtf --feature TSS  seq.fa
>seq_5-7:- tss1

2) Retriving TSS sequences along with up- and (or) down-stream sequences

$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --down-stream 2 seq.fa 
>seq_5-7:-_us:3_ds:2 tss1

~~Here's a bug: the sequences header does not include down-stream information ("ds"). I'll fix this soon.~~Fixed in v0.4.2

3) Retriving up- or down-stream sequence respectively

$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --only-flank seq.fa 
>seq_5-7:-_usf:3 tss1

$ ./seqkit subseq --gtf f.gtf --feature TSS --down-stream 2 --only-flank seq.fa 
>seq_5-7:-_dsf:2 tss1

SeqKit also supports BED file, but only the chromesome, position and strand information are used.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by shenwei3565.6k

Thanks a lot. It helped.

ADD REPLYlink written 3.9 years ago by psiwach2910
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1879 users visited in the last hour