Question: Orthomcl clusters to fasta
0
gravatar for wocana
24 days ago by
wocana10
wocana10 wrote:

Any idea how to extract orthomcl Clusters to fasta?

I got 540 clusters i just want some of them on fasta.

The final output results like this:

cluster000: Ratn|NC_000067.6:7605211..7607094 Ratn|NC_000076.6:46984187..46986072 Ratn|NC_000072.6:126135324..126137210 Ratn|NC_000071.6:43059052..43060937 Ratn|NC_000069.6:92913971..92915859 Ratn|NC_000071.6:137999400..138001284 Ratn|NC_000084.6:3880095..3881981 Ratn|NC_000073.6:72629984..72631867 Ratn|NC_000073.6:14635959..14637845 Ratn|NC_000076.6:121566110..121567992 Ratn|NC_000086.7:51987005..51988889 Ratn|NC_000083.6:9587574..9589459 Ratn|NC_000073.6:82995991..82997877 Ratn|NC_000074.6:122158254..122160140 Ratn|NC_000074.6:4455815..4457701 Ratn|NC_000072.6:32035055..32036940
fasta cluster alignment mcl ortho • 157 views
ADD COMMENTlink modified 22 days ago by RamRS18k • written 24 days ago by wocana10
1

There is no need to SHOUT. I have removed the uppercase characters from your title.

ADD REPLYlink written 24 days ago by WouterDeCoster32k

Is that representation for one cluster cluster000? So you basically want to get the intervals represented there in one fasta sequence?

ADD REPLYlink modified 24 days ago • written 24 days ago by genomax57k

The cluster im interest of has 260 fasta IDs and i wanna extract them from the fasta file, first i organaized them 1 ID per line as you can see below "list.txt". Im using that script i modified it but its not working at all

ADD REPLYlink modified 24 days ago • written 24 days ago by wocana10

im using this script but im just getting 136/260 i dont know why.. my list.txt look like this

CoGr|NW_001584579.1:68053..71823
CoGr|NC_008804.1:48922615..48926373
CoGr|NC_008801.1:545795465..545799243 
CoGr|NC_008804.1:161966386..161970130 
CoGr|NC_008807.1:226381677..226385438 
CoGr|NC_008805.1:191827747..191831534
CoGr|NC_008802.1:270867733..270871532 
CoGr|NC_008801.1:240232298..240236072
CoGr|NC_008803.1:202573007..202576775

SCRIPT

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers,_);
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);
ADD REPLYlink modified 24 days ago • written 24 days ago by wocana10

Try using Bioperl SeqIO and Bio::DB::Fasta, so you don't have to re-invent the wheels handling FASTA files...

ADD REPLYlink written 24 days ago by Vitis1.6k
1
gravatar for lieven.sterck
24 days ago by
lieven.sterck2.6k
Belgium, Ghent, VIB
lieven.sterck2.6k wrote:

Since this is an orthomcl output, you should have a blast database of the proteins you clustered. So select the orhto fam you want, parse out the proteinsIDs and get those from the blastDB.

Get the list ids you need:

grep 'cluster000:' <mcloutput file> | sed 's/cluster000://' | sed 's/ /\n/g' > ids2get.lst

get those ids from the blastdb and write to file

blastdbcmd -db <your protein blastDB> -entry_batch ids2get.lst > proteins.fasta
ADD COMMENTlink modified 21 days ago • written 24 days ago by lieven.sterck2.6k

Didnt work

Error: [blastdbcmd] CoGr|NW_001583259.1:530924..534690: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:39070112..39072248: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:104399108..104402881: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:127199820..127202817: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:130223934..130226919: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:120801884..120804770: OID not found
Error: [blastdbcmd] CoGr|NC_008805.1:235446572..235450134: OID not found
Error: [blastdbcmd] CoGr|NC_008802.1:315125658..315129411: OID not found
Error: [blastdbcmd] CoGr|NC_008804.1:399210142..399213935: OID not found
Error: [blastdbcmd] CoGr|NC_008802.1:352763833..352767598: OID not found
Error: [blastdbcmd] CoGr|NC_008803.1:129563892..129567698: OID not found
ADD REPLYlink modified 24 days ago by genomax57k • written 24 days ago by wocana10
1

are those the exact same IDs as you have in your input protein file? Can you provide the output of head on your input protein file?

You will at least have to remove the CoGR| part as that is added by orthomcl if I remember correctly

grep 'cluster000:' <mcloutput file> | cut -d ' ' -f2- | sed 's/CoGr|//g' |sed 's/ /\n/g' > ids2get.lst

I also replaced on of the sed cmds with cut (a bit neater)

ADD REPLYlink modified 21 days ago • written 24 days ago by lieven.sterck2.6k
1
gravatar for wocana
22 days ago by
wocana10
wocana10 wrote:

Thx you all at the end first i modified and then used lieven.sterck command

grep 'cluster000:' <mcloutput file> | sed 's/cluster000://' | sed 's/ /\n/g' > ids2get.lst

and then the script runned correctly

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers,_);
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);
ADD COMMENTlink written 22 days ago by wocana10

Good to hear you managed, though the retrieve via the blastdbcmd is gonna be miles faster then this approach. but if it works and suits your needs then stick with it.

ADD REPLYlink written 21 days ago by lieven.sterck2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1581 users visited in the last hour