Question

How to remove sequence reads contains more than 1 X from multifasta file?

0

Entering edit mode

5.2 years ago

k.kathirvel93 ▴ 310

I have 5000 protein sequences in one multifasta file. I found more reads have gaps as X in their reads. So, want to eliminate those reads completely (Whole protein seq) from the file. I am keeping filter criteria as if a read contains morethan 2 X ( continuesly or anywhere in the read) should be removed. Thanks in advance for your help.

The input sequence looks like this

>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot2
ANSTVKKKKLLLYYYSSSEERXXXXXXXXXXXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot4
ANSTVKKKKLLLYYYSSSEEXFGHYFGHYFGHFYVXXFGFYVHCEDYHF

I want output Like this

 >Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
 >Prot3  
 ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

sequence sequencing • 3.4k views

ADD COMMENT • link updated 5.2 years ago by manojkumarbioinfo ▴ 100 • written 5.2 years ago by k.kathirvel93 ▴ 310

2

Entering edit mode

It seems to me that you keep asking similar question without making any effort to solve any of them on your own. Here you asked how to remove sequences containing N characters. Well, removing sequences with Xs is the same as removing Ns - you only change one letter in the script. You should have plenty of material with all the scripts others have written for you to solve this kind of problem with minimal effort. Most people here are helpful and kind enough to do this for you, but you will help yourself in the long run if you actually learn how to do it. You know that quote about teaching a man to fish?

enter image description here

ADD REPLY • link 5.2 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thanks for the advice.

Here you asked how to remove sequences containing N characters ?

Still i didn't get any solution for this, which is why i came up with a new thread. i tried with googlong also, but i couldn't get any. If i am better in scripting i could've done by myself. Taking ideas for the first time from others and learning from that is also fishing.

ADD REPLY • link 5.2 years ago by k.kathirvel93 ▴ 310

1

Entering edit mode

For example, this solution from that thread:

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.argv[1], "fasta"):
    if record.seq.count('N') == 0:
        print(record.format("fasta"))

Change N to X, and ==0 to < 2 and there is your solution. This is at least your fourth fishing lesson.

ADD REPLY • link 5.2 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Still i didn't get any solution for this, which is why i came up with a new thread.

When I look at that thread, I see at least three solutions. Maybe they are not to your exact liking, but it isn't true that you didn't get any solutions.

ADD REPLY • link 5.2 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

This type of task is a good learning opportunity. It could be done in about 5 or 6 lines of python for instance.

What have you attempted so far?

ADD REPLY • link 5.2 years ago by Joe 22k

0

Entering edit mode

If fasta headers don't have x, this should work with seqkit:

seqkit fx2tab test.fa | awk -F'X' 'NF<=2'  | seqkit tab2fx                                                                   
>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

$cat test.fa                                                                                                                  
>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot2
ANSTVKKKKLLLYYYSSSEERXXXXXXXXXXXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot4
ANSTVKKKKLLLYYYSSSEEXFGHYFGHYFGHFYVXXFGFYVHCEDYHF

ADD REPLY • link 5.2 years ago by cpad0112 21k

1

Entering edit mode

5.2 years ago

Pierre Lindenbaum 166k

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{gsub("X","",$2);print}' |\
sort -t $'\t' -k2,2 | uniq -f 1 -u | tr "\t" "\n"

>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

ADD COMMENT • link 5.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Mr. Pierre Lindenbaum,

Thanks for the fast reply, your awk script is removing fine reads( with no X) also. can you help again. and also i made few changes on my question, can you help with this modified thread ?

ADD REPLY • link 5.2 years ago by k.kathirvel93 ▴ 310

score 3 · Accepted Answer · 2020-05-07

3

Entering edit mode

5.2 years ago

manojkumarbioinfo ▴ 100

awk -v RS='>' '!/XX/{printf $0RT}' test.fasta

Output

Prot1

ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF

Prot3

ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

ADD COMMENT • link 5.2 years ago by manojkumarbioinfo ▴ 100

1

Entering edit mode

This doesn't work if sequence has scattered "X". OP requirement is "I am keeping filter criteria as if a read contains morethan 2 X ( continuesly or anywhere in the read) should be removed".

For eg.

output:

 $ awk -v RS='>' '!/XX/{printf $0RT}' file.fa                                                                              

>seq1
ATWDXGCX
>seq3
ATWDXGCC

with seqkit and awk:

$ seqkit fx2tab file.fa | awk -F'X' 'NF<=2'  | seqkit tab2fx                                                                   
>seq3
ATWDXGCC

Input:

$ cat file.fa                                                                                                                  
>seq1
ATWDXGCX
>seq2
ATWXXGCX
>seq3
ATWDXGCC

awk --version GNU Awk 5.0.1,

ADD REPLY • link 5.2 years ago by cpad0112 21k