filter DNA sequences of defined length for certain bases at exact positions
3
0
Entering edit mode
4.5 years ago

Dear all,

I was wondering if there is a possibility to filter DNA sequences of defined length (lets say about 30 bp) forthe presence of a specific base at a defined positin.

Something like: Give me only the sequences which contain a G at position 5 and an A at position 20.

Maybe this tool is so trivial that it was never implemented in galaxy or I am not seraching with the right key words.

does anybody have an idea?

Thank you very much,

Phil

sequence • 851 views
0
Entering edit mode

What is an input file? FASTA?

0
Entering edit mode

dear all, thank you very much,

the input file would be a simple fasta and the length of the seuqences is always the same.

with my very limited understanding of programming (I have never done this before) I will try to get this working on a windows PC...

CHeers,

phil

0
Entering edit mode

0
Entering edit mode

philipp.rathert : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

cpad0112 's solution may be the only one that will work natively on windows. Other two solution s will require python (@st.ph.n) and a virtual unix environment (@Pierre).

0
Entering edit mode
4.5 years ago

linearize and filter with awk;

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '(substr($2,5,1)=="G") && (substr($2,20,1)=="A")' |\
tr "\t" "\n"

0
Entering edit mode
4.5 years ago
st.ph.n ★ 2.6k

Assuming single-line fasta input, or linearize with Pierre's awk statement. Each sequence has to have G @ 5 and A @ 20, and have a length of 30bp. If you want to know how many has G+A, G only, and A only you should find it easy to edit this code to do so.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
for line in f:
if line.startswith('>'):
h = line.strip()
s = next(f).strip()
if len(s) == 30:
if s[5] == 'G' and s[20] == 'A':
print h, '\n', s

0
Entering edit mode
4.5 years ago

my fasta:

>seq1
ACGTACTGCACC
>seq2
CTGATTTAGTCTATATGTATACT
>seq3
ATGACTATATCGTACGATCTAGTCCC


Requirements:

filter a sequence by length (12 bases in this example) and sequence should have G at the 3d position and C at the 12 position


Command:

$seqkit seq -M 12 test.fa | seqkit grep -s -R 3:3 -p G | seqkit grep -s -R 12:12 -p C  output: $ seqkit seq -M 12  test.fa | seqkit grep -s  -R 3:3 -p G | seqkit grep -s -R 12:12 -p C

>seq1
ACGTACTGCACC

0
Entering edit mode