Question

How can I extract/separate bins (sequences containing contigs) from my metagenome fasta file?

0

Entering edit mode

5.6 years ago

mjoval • 0

Hi Everyone,

I am new to bioinformatics and I am trying to extract my bins (containing contigs) from a metagenome file. I have the following scripts but when I checked the fasta file for each bin, it doesn't match with the number and contigs IDs within the bin. Could you help me what's wrong with this script? It seems that it can read the contigs with 4 digits (example: Naka_1045) and the remaining contigs containing more than 4 were not included in the fasta file. I also notice that all that it included all the first 2 digits from my contig files (ex, all those that starts with 10 in the case of Naka_1045). I hope you could help me on this. Here's the script given to me:

#!/usr/bin/perl -w

$usage = "perl.exe $0 [filename.txt] [filename.fasta]\n";

unless (@ARGV == 2) {die $usage;}


print "This script puts contigs from a metagenome fasta file into buckets with entries of the same organism(-group).\n";




#13
open INFILE, "$ARGV[0]" or die "cannot open $ARGV[0]\n";

my %contig;
# my @organism;
my %bucket;
$n= 0;


while (<INFILE>)
{
  @parameters= split;

  $contig{$parameters[0]}= $parameters[1];
#  $organism[$n]= $parameters[1];
#28
  if (not exists $bucket{$contig{$parameters[0]}})
  {
    $bucket{$contig{$parameters[0]}}= 1; 
  }
  else
  {
    $bucket{$contig{$parameters[0]}}= $bucket{$contig{$parameters[0]}} + 1;  
  }

  $n ++;
}
#40
print "$n contigs are listed in $ARGV[0].\n";
print "The metagenome comprises the following groups:\n";
print "\n";
print "Group:\tEntries:\n";

#46
foreach (keys %bucket)
{
  print "$_\t$bucket{$_}\n";
}


close INFILE;


open INFILE, "$ARGV[1]" or die "cannot open $ARGV[1]\n";
#57
my @entry;
my %sequence;
$m= 0;

while (<INFILE>)
{
  @values= split;

  if (exists $values[0])
  {
    $z= $values[0];
#69
    if ($z=~ m/>/)
    {
      $entry[$m]= $z;
      $sequence{$entry[$m]}= "";
#      print "$entry[$m]\n";
      $m++;
    }
    else
    {
#79      print ".";
      if (exists $sequence{$entry[$m - 1]})
      {
        $sequence{$entry[$m - 1]}= "$sequence{$entry[$m - 1]}$z\n";
#        print "$sequence{$entry[$m - 1]}\n";
      }
      else
      {
        print "error\n";
      }
    }
#90
  }
}

close INFILE;

$extract= "";
$strain= "";
$matched= 0;

foreach (keys %bucket)
{
  $strain= $_;
  open OUTFILE, ">$strain-$ARGV[1]" or die "cannot open $strain-$ARGV[1]\n";
#104
  for ($a= 0; $a < $m; $a ++)
  {
    $extract= $entry[$a];
    $extract= substr($extract, 1, 13);
#109
    if (exists $contig{$extract})
    {
      if ($contig{$extract} eq $strain)
      {
        print OUTFILE "$entry[$a]\n";
        print OUTFILE "$sequence{$entry[$a]}";
        $matched ++;
      }
    }
  }

  close OUTFILE;
}

print "$ARGV[1] contains $m entries; $matched thereof match to $ARGV[0].\n";

sequence genome sequencing • 1.6k views

ADD COMMENT • link updated 5.6 years ago by Sej Modha 5.3k • written 5.6 years ago by mjoval • 0

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

ADD REPLY • link 5.6 years ago by Sej Modha 5.3k

0

Entering edit mode

Thank you for helping me. I just copied it from my file and pasted here. I apologize for not putting it correctly.

ADD REPLY • link 5.6 years ago by mjoval • 0

0

Entering edit mode

Hello mjoval and welcome to biostars,

it would be more helpful if you show us how your input files looks like and how your desired output should look like based on the example input you show us.

fin swimmer

ADD REPLY • link 5.6 years ago by finswimmer 16k

0

Entering edit mode

example of input is missing.

ADD REPLY • link 5.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

My input file in txt contains the different contigs in 1st column and the bin number in the second column like the following:

Contigs (Naka2016_contig#)          Bin#(#)
Naka2016_1  190
Naka2016_2  156
Naka2016_3  146
Naka2016_4  151
Naka2016_5  182
Naka2016_6  151
Naka2016_7  146
Naka2016_8  190
Naka2016_9  156
Naka2016_10 189
Naka2016_11 125
Naka2016_12 118
Naka2016_13 133
Naka2016_14 146
Naka2016_15 156
Naka2016_16 118
Naka2016_17 133
Naka2016_18 162
Naka2016_19 146
Naka2016_20 189

I would like to have a separate fasta files of the bin# containing the contigs that belong to that bin#. (Ex. Bin#151- containing Contig Naka2016_4, Naka2016_6; Bin#146-containing Naka2016_3,Naka2016_7,Naka2016_14,Naka2016_19). Using the scripts would gave me that fasta file but some of the contigs doesn't correspond to the input list that I have. Thank you in advance for your help. I hope i have given the sample input correctly.

ADD REPLY • link updated 5.6 years ago by finswimmer 16k • written 5.6 years ago by mjoval • 0