Question

How to break long contigs into smaller ones

0

Entering edit mode

7.8 years ago

wangli_lily • 0

Dear All,

I want to bin the contigs using the ESOM method. But the length of input sequences needs > 2kb and <5kb as the suggestion in the original paper. What kind of software can I use to split the contigs longer than 5 kb into smaller ones.

Thanks!

next-gen sequence • 2.4k views

ADD COMMENT • link updated 6.5 years ago by rasmus.kirkegaard • 0 • written 7.8 years ago by wangli_lily • 0

score 1 · Answer 1 · 2016-08-13

You can use SeqKit with subcommand sliding, usage, download

Example, for a 12 bp sequence

$ more t.fa
>seq
ACTGNactgnAC

$ seqkit stat t.fa 
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
t.fa  FASTA   DNA          1       12       12       12       12

Let's sliding with 5-bp window and step of 5-bp

$ seqkit sliding -W 5 -s 5 t.fa
>seq_sliding:1-5
ACTGN
>seq_sliding:6-10
actgn
>seq_sliding:11-15
AC

For your case, change the 5 to 4000, for example.

BTW, input and output of FASTQ format are also supported.

score 0 · Answer 2 · 2017-10-10

You can do it with R

# Load libraries
library("Biostrings")
library(seqinr)

# Variables
size_to_split<-99999
input_file<-"assembly.fasta"
# End of variables


assembly<-readDNAStringSet(filepath=input_file)

newframe<-data.frame(name_seq=character(0),size_seq=integer(0),seq_seq=character(0),stringsAsFactors = F)

name_seq = names(assembly)
seq_seq = paste(assembly)
size_seq<-nchar(seq_seq)
mydataframe <- data.frame(name_seq, size_seq, seq_seq, stringsAsFactors = F)

# For every contig 
for (i in which(size_seq>size_to_split)) {
  # Test if the contig is longer than the limit of size_to_split
  if (mydataframe[i,2]>size_to_split) {
    # Get the number of fragments needed for the contig
    fragments<-ceiling(mydataframe[i,2]/size_to_split)
    # For every fragment
    for (n in 1:fragments) {
      # Calculate the start position
      start<-(1+((n-1)*size_to_split))
      # Check if the fragment is the last fragment
      if (n==fragments) {
        # Set the stop position to the length tf the contig
        stop=nchar(mydataframe[i,3])
      } else {
        # Else calculate the stop position
        stop<- (((n)*size_to_split))
      }
      splitted<-substr(mydataframe[i,3], start = start, stop = stop)
      tmp_frame<-data.frame(name_seq=(paste0(i,"_",n)),size_seq=nchar(splitted),seq_seq=splitted,stringsAsFactors = F)
      newframe<-rbind(newframe,tmp_frame)
    }
  } else {
      newframe<-rbind(newframe,mydataframe[i,])
    }
}
newframe2<-rbind(newframe,
                 mydataframe[which((size_seq<=size_to_split)),])
# Write the splitted sequence to a new file
write.fasta(sequences = as.list(newframe2$seq_seq),names = newframe$name_seq,file.out = paste0("splitted_",input_file))