Question: How to extract only sequences that contain a specific percentage of Cysteines (C)
0
gravatar for kamel
10 months ago by
kamel30
kamel30 wrote:

Hello Biostar
I have a multifasta file that contains protein sequences (of different sizes), I want to extract only those sequences that contain at least 5% of cysteines (C).
Can you give me a little script so that I can extract these sequences
Thank you in advance

proteins sequence • 376 views
ADD COMMENTlink modified 10 months ago by cpad011212k • written 10 months ago by kamel30
1

See this updated post:

https://web.archive.org/web/20071027112709/http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/

Let's use Python:

make your sequences looks like 2 strings each:

> header
1-line sequence

def c_part_in_string(s): 
    countC = 0

    for i in range (0, len(s)):
        if s[i] == "C":
            countC += 1
    return countC/len(s)

    for line in my_list:
        if line.startswith('>'):
            print(line)
        else:
            if c_part_in_string(line) > 0.05: 
                print(line)

This is just an idea.

ADD REPLYlink modified 10 months ago by RamRS25k • written 10 months ago by natasha.sernova3.7k

Sorry, we cannot "give you a little script". Please tell us what you've tried so far and where you're facing difficulties, and we can help you with specifics.

ADD REPLYlink modified 10 months ago • written 10 months ago by RamRS25k

I can quantify the number of C (cystein) by grep but in each sequence (ID). I have several protein sequences in a single multifasta file, I am looking for an idea to quantify the number of C in each sequence at a time, for extract the sequences which contains 5% cysteine.

ADD REPLYlink written 10 months ago by kamel30
2
gravatar for JC
10 months ago by
JC9.1k
Mexico
JC9.1k wrote:

Definitively you need some programming skills to do it by yourself. Here is some Perl code to do this:

#!/usr/bin/perl

use strict;
use warnings;
my $minC = 5; # Filter sequence below this percentage

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_); # Separate sequence id from the sequence lines
    my $seq = uc (join ("", @seq)); # Linearize the sequence in a single string, also using upper-case letters
    my $len = length $seq; # Get sequence size
    my $numC = $seq =~ tr/C/C/; # Quick way to count chars in a string
    my $perC = 100 * $numC / $len; # Calc percentage of Cs
    if ($perC >= $minC) { # Filter
         print ">$_";
    }
}

Usage: perl filterByC.pl < Fasta_In > Fasta_out

ADD COMMENTlink written 10 months ago by JC9.1k

Thanks for the help @JC

ADD REPLYlink written 9 months ago by kamel30
2
gravatar for finswimmer
10 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

With awk:

$ awk -v FS="\n" -v RS=">" '$0 { seq=$0; $1="";  c=gsub("C", "_", $0); l=gsub(/[A-Z_]/, ".", $0); if(c/l>0.05) printf ">"seq}' input.fa
  • We set > as the record separator and \n as a the field separator.
  • If our record is not empty we save the whole record to seq.
  • We remove the id because we don't want to count the C there.
  • Then we count the replacement made from C to _ to get the number of C.
  • To get the total number of amino acid we count the replacements of each letter and a _.
  • If the fraction is greater 0.05 we print out the original sequence.
ADD COMMENTlink written 10 months ago by finswimmer13k

Thanks for the help @finswimmer this is very useful

ADD REPLYlink written 9 months ago by kamel30
0
gravatar for cpad0112
10 months ago by
cpad011212k
India
cpad011212k wrote:

with seqkit and awk:

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=10 {print ">"$1,$2}'
>a
atgccc
>c
ac
>d
c

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=100 {print ">"$1,$2}'
>d
c

input:

$cat test.txt 
>a
atgccc
>b
atc
>c
ac
>d
c
ADD COMMENTlink modified 10 months ago • written 10 months ago by cpad011212k

Thanks for the help

ADD REPLYlink written 9 months ago by kamel30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1929 users visited in the last hour