Question: position specific sequence retrieval from whole genome sequence
0
gravatar for psiwach29
2.3 years ago by
psiwach2910
psiwach2910 wrote:

From the complete genome sequence of E. coli, i am isolating the upstream 100 and downstream 50 nucleotide sequences of TSS. Position of TSS on forward strands and reverse strands are available. The sequence from NCBI is of forward strand (5' to 3' from left to right) so the process of retrieving upstream (left from TSS) and downstream (right from TSS) is straightway. Now for the reverse strand I think in the following way- complementary strand of forward is made. It will span from 3' to 5' from left to right as position don't get changed (i.e first nucleotide of forward strand from 5'end will be first nucleotide of complementary strand from 3' end). So for upstream sequences we will take sequence right to TSS and for downstream we will take sequences left to TSS. *I need to know whether i am proceeding in the right way.*

sequence gene genome • 542 views
ADD COMMENTlink modified 2.3 years ago by shenwei3564.5k • written 2.3 years ago by psiwach2910
0
gravatar for shenwei356
2.3 years ago by
shenwei3564.5k
China
shenwei3564.5k wrote:

Your strategy is right. The are some tools can help you archive this.

Here's is a solution of command subseq (see usage) of SeqKit, which provides executable binary files for Windows/Linux/Mac, just donwload the .tar.gz file, decompress and run.

Example:

Sequence:

$ cat seq.fa 
>seq
actgnACTGN

GTF file, note that the tss1 is on negative strand.

$ cat f.gtf 
seq     test    CDS     4       6       .       +       .       gene_id "cds1"; transcript_id "cds1"; 
seq     test    TSS     5       7       .       -       .       gene_id "tss1"; transcript_id "tss1";

1) Retriving TSS sequences

$ ./seqkit subseq --gtf f.gtf --feature TSS  seq.fa
>seq_5-7:- tss1
GTn

2) Retriving TSS sequences along with up- and (or) down-stream sequences

$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --down-stream 2 seq.fa 
>seq_5-7:-_us:3_ds:2 tss1
NCAGTnca

~~Here's a bug: the sequences header does not include down-stream information ("ds"). I'll fix this soon.~~Fixed in v0.4.2

3) Retriving up- or down-stream sequence respectively

$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --only-flank seq.fa 
>seq_5-7:-_usf:3 tss1
NCA

$ ./seqkit subseq --gtf f.gtf --feature TSS --down-stream 2 --only-flank seq.fa 
>seq_5-7:-_dsf:2 tss1
ca

SeqKit also supports BED file, but only the chromesome, position and strand information are used.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by shenwei3564.5k
1

Thanks a lot. It helped.

ADD REPLYlink written 2.3 years ago by psiwach2910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1888 users visited in the last hour