Question: How to remove sequence reads contains more than 1 X from multifasta file?
0
gravatar for k.kathirvel93
19 days ago by
k.kathirvel93250
India
k.kathirvel93250 wrote:

I have 5000 protein sequences in one multifasta file. I found more reads have gaps as X in their reads. So, want to eliminate those reads completely (Whole protein seq) from the file. I am keeping filter criteria as if a read contains morethan 2 X ( continuesly or anywhere in the read) should be removed. Thanks in advance for your help.

The input sequence looks like this

>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot2
ANSTVKKKKLLLYYYSSSEERXXXXXXXXXXXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot4
ANSTVKKKKLLLYYYSSSEEXFGHYFGHYFGHFYVXXFGFYVHCEDYHF

I want output Like this

 >Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
 >Prot3  
 ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
sequencing sequence • 126 views
ADD COMMENTlink modified 18 days ago by manojkumarbioinfo60 • written 19 days ago by k.kathirvel93250
2

It seems to me that you keep asking similar question without making any effort to solve any of them on your own. Here you asked how to remove sequences containing N characters. Well, removing sequences with Xs is the same as removing Ns - you only change one letter in the script. You should have plenty of material with all the scripts others have written for you to solve this kind of problem with minimal effort. Most people here are helpful and kind enough to do this for you, but you will help yourself in the long run if you actually learn how to do it. You know that quote about teaching a man to fish?

enter image description here

ADD REPLYlink written 18 days ago by Mensur Dlakic5.4k

Thanks for the advice.

Here you asked how to remove sequences containing N characters ?

Still i didn't get any solution for this, which is why i came up with a new thread. i tried with googlong also, but i couldn't get any. If i am better in scripting i could've done by myself. Taking ideas for the first time from others and learning from that is also fishing.

ADD REPLYlink modified 18 days ago • written 18 days ago by k.kathirvel93250
1

For example, this solution from that thread:

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.argv[1], "fasta"):
    if record.seq.count('N') == 0:
        print(record.format("fasta"))

Change N to X, and ==0 to < 2 and there is your solution. This is at least your fourth fishing lesson.

ADD REPLYlink modified 18 days ago • written 18 days ago by Mensur Dlakic5.4k

Still i didn't get any solution for this, which is why i came up with a new thread.

When I look at that thread, I see at least three solutions. Maybe they are not to your exact liking, but it isn't true that you didn't get any solutions.

ADD REPLYlink written 18 days ago by Mensur Dlakic5.4k

This type of task is a good learning opportunity. It could be done in about 5 or 6 lines of python for instance.

What have you attempted so far?

ADD REPLYlink written 19 days ago by Joe16k

If fasta headers don't have x, this should work with seqkit:

seqkit fx2tab test.fa | awk -F'X' 'NF<=2'  | seqkit tab2fx                                                                   
>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

$cat test.fa                                                                                                                  
>Prot1
 ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot2
ANSTVKKKKLLLYYYSSSEERXXXXXXXXXXXFGHYFGHYFGHFYVHFGFYVHCEDYHF 
>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
>Prot4
ANSTVKKKKLLLYYYSSSEEXFGHYFGHYFGHFYVXXFGFYVHCEDYHF
ADD REPLYlink modified 18 days ago • written 18 days ago by cpad011213k
3
gravatar for manojkumarbioinfo
18 days ago by
India
manojkumarbioinfo60 wrote:

awk -v RS='>' '!/XX/{printf $0RT}' test.fasta

Output

Prot1

ANSTVKKKKLLLYYYSSSEERXFGHYFGHYFGHFYVHFGFYVHCEDYHF

Prot3

ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF

ADD COMMENTlink modified 18 days ago • written 18 days ago by manojkumarbioinfo60
1

This doesn't work if sequence has scattered "X". OP requirement is "I am keeping filter criteria as if a read contains morethan 2 X ( continuesly or anywhere in the read) should be removed".

For eg.

output:

 $ awk -v RS='>' '!/XX/{printf $0RT}' file.fa                                                                              

>seq1
ATWDXGCX
>seq3
ATWDXGCC

with seqkit and awk:

$ seqkit fx2tab file.fa | awk -F'X' 'NF<=2'  | seqkit tab2fx                                                                   
>seq3
ATWDXGCC

Input:

$ cat file.fa                                                                                                                  
>seq1
ATWDXGCX
>seq2
ATWXXGCX
>seq3
ATWDXGCC

awk --version GNU Awk 5.0.1,

ADD REPLYlink modified 18 days ago • written 18 days ago by cpad011213k
1
gravatar for Pierre Lindenbaum
19 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:
 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{gsub("X","",$2);print}' |\
sort -t $'\t' -k2,2 | uniq -f 1 -u | tr "\t" "\n"

>Prot3
ANSTVKKKKLLLYYYSSSEERFGHYFGHYFGHFYVHFGFYVHCEDYHF
ADD COMMENTlink written 19 days ago by Pierre Lindenbaum128k

Hi Mr. Pierre Lindenbaum,

Thanks for the fast reply, your awk script is removing fine reads( with no X) also. can you help again. and also i made few changes on my question, can you help with this modified thread ?

ADD REPLYlink modified 18 days ago • written 18 days ago by k.kathirvel93250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1394 users visited in the last hour