grep -v remove fasta sequence
1
2
Entering edit mode
3.4 years ago
Morris_Chair ▴ 350

Hello everyone, I created a fasta file from a ChIP-seq_summit.bed where each sequence is about 500bp. I want to remove sequences containing a specific motif, so I thought to use this code:

grep -E ".[GC][AT]T[GC]AAA[GA]" ChIP-seq_summit.fa -v >ChIP-seq_summit_no_cons.fa

it's not clear to me wether this remove only the motif in the sequences or the entire line that belongs to.

I'm asking because by using the first file for finding motifs, MEME works fine but using this modified file it says that there are sequences shorter then the minimum allowed

Thank you :)

ChIP-Seq • 2.1k views
ADD COMMENT
1
Entering edit mode

grep is not right function to manipulate fasta esp multiline fasta. Use dedicated tools such as seqkit and grep function is offered by seqkit.

ADD REPLY
0
Entering edit mode

Hi cpad0112,

What is the seqkit way to extract or eliminate sequences containing a motif like in my case?

thanks

ADD REPLY
0
Entering edit mode

Input:

$ cat test.fa                                      

> seq1
CCCCCCCCCC
CCGAAGAAAG
GGGGGGGGG
GGATGAAAG
>seq2
CCGGCG

output with grep:

$ grep -E ".[GC][AT]T[GC]AAA[GA]" test.fa          

GGATGAAAG

output with seqkit to extract sequences with pattern:

$ seqkit grep -srp '.[GC][AT]T[GC]AAA[GA]' test.fa 
> seq1
CCCCCCCCCCCCGAAGAAAGGGGGGGGGGGGATGAAAG

output with seqkit to extract sequences without user pattern:

$ seqkit grep -vsrp '.[GC][AT]T[GC]AAA[GA]' test.fa 

>seq2
CCGGCG

Replace AAA with A{3}

s - search only sequences (not headers)

r- use regex

p - regex pattern

v - inverse

ADD REPLY
0
Entering edit mode

grep by default removes the line, but if your sequence is in Fasta, you are removing only the lines where it matches, example:

> seq1
CCCCCCCCCC
CCGAAGAAAG
GGGGGGGGG

will be filtered as:

> seq1
CCCCCCCCCC
GGGGGGGGG
ADD REPLY
0
Entering edit mode

Hi JC, thank you for the answer, do you know how can I remove the entire sequence?

thanks

ADD REPLY
1
Entering edit mode

Linearize the fasta sequence then use grep to remove the entire sequence and then reformat remaining lines back.

Code courtesy of @Pierre:

ADD REPLY
0
Entering edit mode

Hi Genomax

Do I have to do this sequentially? like

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

grep -E ".[GC][AT]T[GC]AAA[GA]" input.fa > input.fa

the format back to fasta is not clear to me :/

thanks for help

ADD REPLY
1
Entering edit mode

You can use a unix pipe to do the following operations

Linearize | grep -v | reformat back to fasta

If you use seqkit grep then you would not need to linearize/reformat back.

ADD REPLY
1
Entering edit mode
3.4 years ago

If you use seqkit, you can also use IUPAC compliant motifs in sequence search. Thanks to shenwei356 for pointing this on github. Following is the example code with OP motif:

input:

$ cat test.fa                           
>seq1
CCC
>seq2
G
> seq1
CCCCCCCCCC
CCGAAGAAAG
GGGGGGGGG
GGATGAAAG
>seq2
CCGGCG

output sequences with OP input motif:

$ seqkit grep -rsp  '.[GC][AT]T[GC]AAA[GA]' test.fa
> seq1
CCCCCCCCCCCCGAAGAAAGGGGGGGGGGGGATGAAAG

output sequences with OP input motif using IUPAC code:

$ seqkit grep -dsp '.SWTSA{3}R' test.fa           
> seq1
CCCCCCCCCCCCGAAGAAAGGGGGGGGGGGGATGAAAG

output sequences without OP input motif using IUPAC code:

$ seqkit grep -vdsp '.SWTSA{3}R' test.fa          
>seq1
CCC
>seq2
G
>seq2
CCGGCG
ADD COMMENT

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6