Extract sequences from a fasta file with specific nucleotide repetition
2
0
Entering edit mode
2.5 years ago
shabbas12 ▴ 10

I have a fasta file name seqs.fa with multiple sequences i.e.,

>Seq1
GATAGAT**ATC**GAATG**ATC**

>Seq2
GATGATAG**ATC**GATGC

I want grep/extract only those sequences having ATC repeated exactly 2 times like in Seq1. How we can use grep/sed or {} method for this purpose?

linux command • 1.5k views
ADD COMMENT
0
Entering edit mode
2.5 years ago
Michael 54k

In principle, regular expressions do not count occurrences exactly (at least easily) internally, you might think of /(ATC.*ATC){1,1}/ but that does not work.

So it is easier to use a perl "one-liner", like so:

 perl -e ' $x =()= "ATCGGGGGGATC" =~ /ATC/ig; print "pattern found\n" if $x == 2 ' 

This code is just for demonstration, you need to add file parsing and output behavior depending on your requirements.

In case the pattern could overlap, like in find exactly 2x ATCATC and you want to include [ATC[ATC]ATC] as a valid hit, an additional check is required.

An alternative for making a regex that works in perl is to specify which pattern not to match in parts of the string, like so:

perl -e 'print "found\n" if  "ATCATATC" =~ /^(?:(?!ATC).)*ATC(?:(?!ATC).)*ATC(?:(?!ATC).)*$/;'

where (?:(?!ATC).)* matches everything except ATC.

ADD COMMENT
0
Entering edit mode
2.5 years ago

You may need to take the reverse strand into consideration.

Both strands:

$ seqkit locate -i -p atc seqs.fa | column -t 
seqID  patternName  pattern  strand  start  end  matched
Seq1   atc          atc      +       8      10   atc
Seq1   atc          atc      +       16     18   atc
Seq1   atc          atc      -       15     17   atc
Seq1   atc          atc      -       5      7    atc
Seq1   atc          atc      -       1      3    atc
Seq2   atc          atc      +       9      11   atc
Seq2   atc          atc      -       12     14   atc
Seq2   atc          atc      -       8      10   atc
Seq2   atc          atc      -       4      6    atc
Seq2   atc          atc      -       1      3    atc

$ seqkit locate -i -p atc seqs.fa \
    | csvtk freq -t -f seqID -nr
seqID   frequency
Seq1    5
Seq2    5

# no desired sequences

Only forward strand:

$ seqkit locate -i -p atc --only-positive-strand  seqs.fa | column -t 
seqID  patternName  pattern  strand  start  end  matched
Seq1   atc          atc      +       8      10   atc
Seq1   atc          atc      +       16     18   atc
Seq2   atc          atc      +       9      11   atc

$ seqkit locate -i -p atc --only-positive-strand  seqs.fa \
    | csvtk freq -t -f seqID -nr
seqID   frequency
Seq1    2
Seq2    1

$ seqkit locate -i -p atc -P  seqs.fa \
    | csvtk freq -t -f seqID -nr \
    | awk '$2 == 2' \
    | cut -f 1 \
    | tee list.txt
Seq1

$ seqkit grep -f list.txt seqs.fa 
[INFO] 1 patterns loaded from file
>Seq1
GATAGATATCGAATGATC
ADD COMMENT

Login before adding your answer.

Traffic: 2224 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6