Question

How to cut fasta sequences in three equivalement parts ?

0

Entering edit mode

7.5 years ago

kevinm ▴ 40

Hello,

I have a list of sequences.fasta and i would like to cut all the sequences in this list in three parts (5'-1/3-1/3-1/3-3'), i think the gt command (GenomeTools) could help me but i have no clue on how use it in this particular case... Does someone have an idea ?

Thanks by advance

Kevin

sequence rna-seq split genometools • 2.5k views

ADD COMMENT • link updated 7.5 years ago by ssv.bio ▴ 200 • written 7.5 years ago by kevinm ▴ 40

0

Entering edit mode

Cut them into three new fasta sequences or else? Are all sequences the same length?

ADD REPLY • link 7.5 years ago by GenoMax 141k

score 4 · Accepted Answer · 2016-10-03

4

Entering edit mode

7.5 years ago

Pierre Lindenbaum 161k

linearize ( http://stackoverflow.com/documentation/bioinformatics/4194/ ), get the length of the DNA sequence, print each part with awk. something like:

awk '{L=length($2);L3=int(L/3);printf("%s 1/3\n%s\n%s 2/3\n%s\n%s 3/3\n%s\n",$1,substr($2,1,L3),$1,substr($2,L3+1,L3),$1,substr($2,1+2*L3));}

ADD COMMENT • link 7.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre ! I thank you so much ! your awk one-liner works perfectly ! I am not familiar with AWK language, i certainly will try to learn the basics of it, 'cause it do the job ! Thanks again !

ADD REPLY • link 7.5 years ago by kevinm ▴ 40

0

Entering edit mode

Hi Pierre,

On a separate note, what is the best resource to learn writing one liners like this one?

ADD REPLY • link 7.5 years ago by lakhujanivijay 5.8k

1

Entering edit mode

Internet/search engines :-)

On a serious note look for learning resources (here is one). There is an O'Reilly book on awk as well.

ADD REPLY • link 7.5 years ago by GenoMax 141k

score 3 · Accepted Answer · 2016-10-04

In R:

Example file with sequences:

> gene1 
ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC
> gene2 
CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG
> gene3 
ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA

1) Extract 1/3 of each sequence and store each 1/3 as individual sequence, in a single fasta file i.e new fasta file will have new 3 sequences, but with first 1/3 of each sequence . Source file is named genes.fa and new file is named subgenes.fa

$ library(Biostrings)
$ genes=readDNAStringSet("genes.fa", format = "fasta")
$ subfasta=subseq(genes,1,ceiling(width(genes)/3))
$ writeXStringSet(subfasta, "subgenes.fa")

2) Extract 1/3 of from each sequence, combine all first 1/3 sequences as a single sequence and store it as a single sequence in a fasta format . Source file is named genes.fa and new file is named "new_gene.fasta". New sequence will have an ID "new_gene".

$ library(Biostrings)
$ genes=readDNAStringSet("genes.fa", format = "fasta")
$ subfasta=subseq(genes,1,ceiling(width(genes)/3))
$ csubfasta=DNAStringSet(unlist(subfasta))
$ names(csubfasta)="new_gene"
$ writeXStringSet(DNAStringSet(csubfasta),"new_gene.fasta")