filter extract sequences
0
0
Entering edit mode
3.9 years ago
ntechen • 0

Hi, I am a bioinformatics beginner and have very little command line knowledge. I have downloaded raw sequence reads and assembled them with megahit into contigs ('genome of interest'). I have aligned to this genome several thousand DNA fragments of 100 bases long (probes) using blastn:

blastn -db Genome -query 100bpprobes.fa -out outtabular.txt -outfmt 6

The output is a tabular output file (12 columns) with some two thousand lines.

1   2   3   4   5   6   7   8   9   10  11  12
Query id    Subject id  % identity  alignment length    mismatches  gap openings    q. start    q. end  s. start    s. end  e-value bit score
3-Probe_168 k119_164365 92.31   78  6   0   43  120 1676    1599    4.53E-24    111
3- Probe _210   k119_164365 89.08   119 12  1   1   118 1676    1558    1.24E-34    147
3- Probe_252    k119_164365 90.54   74  6   1   4   76  1631    1558    1.27E-19    97.1
9-Probe _164    k119_164365 93.42   76  5   0   45  120 1678    1603    1.26E-24    113
9-Probe_205 k119_164365 91.45   117 10  0   4   120 1678    1562    4.43E-39    161
9-Probe_246 k119_164365 92.50   80  6   0   4   83  1637    1558    3.50E-25    115
242-Probe_645   k119_126278 95.65   46  2   0   75  120 4000    4045    5.94E-13    75
15-Probe_600    k119_34986  92.04   113 9   0   6   118 175 287 1.59E-38    159

I want to extract the sequences that are 500 bp upstream from the s.start til 500 bp downstream from s.end. How can I manage that? Thank you for your time.

Assembly genome sequence • 831 views
ADD COMMENT
1
Entering edit mode

Create a BED format file using columns subject id, s.start, s.end (unix command cut for this). Then use bedtools slop (LINK) to add 500 bp at either end. Then you can use bedtools getfasta (LINK) to retrieve the sequences you need.

If you need more detailed instructions search for the two tools mentioned above using an external google search on biostars.

ADD REPLY
0
Entering edit mode

GenoMax, how do you extract a n size sequence from the 5' end side? slop allows you to decrease/increase the sequence size but what I want is to extract say the first nucleotide sequences from the start or end

ADD REPLY
0
Entering edit mode

Can you clarify what you mean by end to end? You could simply add n bases to the start column to get sequence on just 5'-end.

ADD REPLY
0
Entering edit mode

Does extraction include sequence in addition to upstream and downstream sequences? If you want only 500bp upstream and 500 bp downstream (without sequence between), use flank option from bedtools to create bed file and use getFasta from bedtools suite.

ADD REPLY

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6