Question: How to extract only sequences that contain a specific percentage of Cysteines (C)
0
gravatar for kamel
5 weeks ago by
kamel10
kamel10 wrote:

Hello Biostar
I have a multifasta file that contains protein sequences (of different sizes), I want to extract only those sequences that contain at least 5% of cysteines (C).
Can you give me a little script so that I can extract these sequences
Thank you in advance

proteins sequence • 215 views
ADD COMMENTlink modified 5 weeks ago by cpad011211k • written 5 weeks ago by kamel10
1

See this updated post:

https://web.archive.org/web/20071027112709/http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/

Let's use Python:

make your sequences looks like 2 strings each:

> header
1-line sequence

def c_part_in_string(s): 
    countC = 0

    for i in range (0, len(s)):
        if s[i] == "C":
            countC += 1
    return countC/len(s)

    for line in my_list:
        if line.startswith('>'):
            print(line)
        else:
            if c_part_in_string(line) > 0.05: 
                print(line)

This is just an idea.

ADD REPLYlink modified 5 weeks ago by RamRS20k • written 5 weeks ago by natasha.sernova3.4k

Sorry, we cannot "give you a little script". Please tell us what you've tried so far and where you're facing difficulties, and we can help you with specifics.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by RamRS20k

I can quantify the number of C (cystein) by grep but in each sequence (ID). I have several protein sequences in a single multifasta file, I am looking for an idea to quantify the number of C in each sequence at a time, for extract the sequences which contains 5% cysteine.

ADD REPLYlink written 5 weeks ago by kamel10
2
gravatar for JC
5 weeks ago by
JC7.6k
Mexico
JC7.6k wrote:

Definitively you need some programming skills to do it by yourself. Here is some Perl code to do this:

#!/usr/bin/perl

use strict;
use warnings;
my $minC = 5; # Filter sequence below this percentage

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_); # Separate sequence id from the sequence lines
    my $seq = uc (join ("", @seq)); # Linearize the sequence in a single string, also using upper-case letters
    my $len = length $seq; # Get sequence size
    my $numC = $seq =~ tr/C/C/; # Quick way to count chars in a string
    my $perC = 100 * $numC / $len; # Calc percentage of Cs
    if ($perC >= $minC) { # Filter
         print ">$_";
    }
}

Usage: perl filterByC.pl < Fasta_In > Fasta_out

ADD COMMENTlink written 5 weeks ago by JC7.6k

Thanks for the help @JC

ADD REPLYlink written 4 weeks ago by kamel10
2
gravatar for finswimmer
5 weeks ago by
finswimmer11k
Germany
finswimmer11k wrote:

With awk:

$ awk -v FS="\n" -v RS=">" '$0 { seq=$0; $1="";  c=gsub("C", "_", $0); l=gsub(/[A-Z_]/, ".", $0); if(c/l>0.05) printf ">"seq}' input.fa
  • We set > as the record separator and \n as a the field separator.
  • If our record is not empty we save the whole record to seq.
  • We remove the id because we don't want to count the C there.
  • Then we count the replacement made from C to _ to get the number of C.
  • To get the total number of amino acid we count the replacements of each letter and a _.
  • If the fraction is greater 0.05 we print out the original sequence.
ADD COMMENTlink written 5 weeks ago by finswimmer11k

Thanks for the help @finswimmer this is very useful

ADD REPLYlink written 4 weeks ago by kamel10
0
gravatar for cpad0112
5 weeks ago by
cpad011211k
India
cpad011211k wrote:

with seqkit and awk:

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=10 {print ">"$1,$2}'
>a
atgccc
>c
ac
>d
c

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=100 {print ">"$1,$2}'
>d
c

input:

$cat test.txt 
>a
atgccc
>b
atc
>c
ac
>d
c
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by cpad011211k

Thanks for the help

ADD REPLYlink written 4 weeks ago by kamel10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1148 users visited in the last hour