Question: extract subset of 21 nt. from csv file
0
gravatar for archana.bioinfo87
13 months ago by
archana.bioinfo87180 wrote:

Hi, I am trying to get 1-21nt, 2-22nt, 3-23nt etc till the end of sequences. For example, if I have sequences of 51 nt in length then I will get the 30 fragments with 21 nt in each fragment. I tried this script whcih looks to work fine but only for first 2 lines only.

fasta <- read.csv("trial.csv", header = TRUE, sep =",") 
dd1 <- as.character(fasta$Seq)
IDS <- as.character(fasta$Ids)
b <- numeric() 
final <- NULL
for (i in 1:length(dd1))
{
  for ( j in 1:nchar(dd1[i])) 
  {
  a <-dd1[i]
  b[j] <- substr(a, j, j+20)
  #print (nchar(b[j]))
  #print (nchar(dd1[i]))
  #print (length(dd1[i]))
  ifelse ((nchar(b[j]))==21 && (j<=(nchar(dd1[i])-21)),{
  print(b[j]) 
  next}, break())
  } 
final <- paste(IDS[i],b, sep = ",") 
write.table(final, paste(i,"final.csv")) 
}



  Example : trial.csv will look like this 
    Ids,Seq
    hsa_circ_0000013,GGCTCCAGGGAGCTTGGCTTCTGTAGAAGTTCTAAGGAAGCGGTACGAACTCCACGGCGGTGGGGCGCTAACTAGCAGGGACCCCTGCAAGTGTTGGTCGGGGGCCTCGAGCTGCCTGAGCTGACACGAGGGGAGGGGTCTGTGTAGCCAACAG    hsa_circ_0000026,CAATCCCACAGAGTATTGATGAGGAAACTGAAGTTTGGAGCGATCACATCATTTTCCCAAGAATTCCTAGAGGACCTGTGCAACAACCTCTTGAGGATCGAATCTTCACTCCCGCTGTCTCAGCAGTCTACAGCACGGTAACACAAGTGGCAAGACAGCCGGGAACCCCTACCCCATCCCCTTATTCAGCACATGAAATAAACAAGGGGCATCCAAATCTTGCGGCAACGCCCCCGGGACATGCATCGTCCCCTGGACTCTCTCAA
hsa_circ_0000037,ATACATTTGGGCCTGTCTACCTGCCTTTGGGGCAATTTGCAGGTTTTGAGAAGTAGAAATGAGGGTCTGGAGAGGGCATCTGTGAGCCTCTTCTGGGAACCCCTCCCTTGTAGGT
hsa_circ_0000050,GATGAATTCAAAAGACTATTTGCTAAATATGGAGAACCAGGAGAAGTTTTTATCAACAAAGGCAAAGGATTCGGATTTATTAAGCTTGTGAGTGTAATTGTTTGATTTTACGTAGAATTAAAAAGGGTGGGGGATTTTTTTGTCACTACAAACGCTGAAGGCTTGGTTTTTAAACTGGGGAGGATAAATTGATCTTTTAGATTTTTCACCATTCTTACAGGAAAAATGCTTGCGGTATAATGCATAATTGTTGCTACCTAAGAGAAAAGAGGGGTGGGGTGGGTAAACTAAGTGGTGTTAGTGGGTGCTGCCTAAAGGTAATGGTCGAACTGAGCTGAAGGAAGAAAGGGA

Any help is much appreiciated.

Thanks

R • 294 views
ADD COMMENTlink modified 13 months ago by JC12k • written 13 months ago by archana.bioinfo87180
1

If you are doing an assignment then by all means continue with your code.

If you just need to print all 21-mers from a file use kmercountexact.sh from BBMap suite like this.

kmercountexact.sh in=test.fa k=21 out=stdout.fa (to print to screen)
ADD REPLYlink written 13 months ago by genomax92k

Thanks. I am going to try this as well.

ADD REPLYlink written 13 months ago by archana.bioinfo87180
1
gravatar for JC
13 months ago by
JC12k
Mexico
JC12k wrote:

I am not sure about the output you want, but here is how I would do in perl:

#!/usr/bin/perl

use strict;
use warnings;

my $len = 21;
while (<>) {
    next if (/^Ids/); # skip header
    chomp;
    my ($id, $seq) = split(/,/, $_);
    print "$id";
    for (my $i = 0; $i<=(length $seq)-$len; $i++) {
        my $ss = substr($seq, $i, $len);
        print ",$ss";
    }
    print "\n";
}

Example:

$ perl split21.pl < trial.csv > final.csv
$ head final.csv 
hsa_circ_0000013,GGCTCCAGGGAGCTTGGCTTC,GCTCCAGGGAGCTTGGCTTCT,CTCCAGGGAGCTTGGCTTCTG,
ADD COMMENTlink written 13 months ago by JC12k

Thanks for your time and effort. I really appreciate it.

ADD REPLYlink written 13 months ago by archana.bioinfo87180

Hi,

I am getting extra space when gereratig the outfile from the perl script. I tried to mofiy it but space included in between in my sequences.

As, you know previously I was trying from 1st position to last-21 nt but now I am looking for 1st to last+21nt (from 1-21 start position)

Any help is much appreciated.

I tried this code but getting error`

 #!/usr/bin/perl

    use strict;
use warnings;
#while(my $row = <>)
#{
#   $row =~ 's/\s//g';
#}
my $len = 21;
while (<>) {
    next if (/^Ids/); # skip header
    chomp;
    my ($id, $seq) = split(/,/, $_);
 my $ss = substr($seq, 0, 21);   
#print "$ss";
my $merge= ($seq.$ss);
#print "$merge";
$merge=~ 's/\s+$//g';
    for (my $i = 0; $i<=(length $merge)-$len; $i++) {
        my $st = substr($merge, $i, $len);
# $st= ~ 's/\s+$//g';
#if(length $st < $len) {next};
       print "$id,$st";
print "\n";
    }
  }
ADD REPLYlink modified 12 months ago • written 12 months ago by archana.bioinfo87180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1944 users visited in the last hour