Question: Extract sequences consecutively from a large fasta files and save in new fasta files
0
gravatar for Dineshkumar K
8 months ago by
Kasaragod, Kerala, India
Dineshkumar K40 wrote:

I would like to extract sequences consecutively from a large fasta files and save in new fasta files. For example, I have two fasta files namely species1.fasta and species2.fasta as given below,

species1.fasta
>tag1 
TGCAAGAG
>amylase 
ATCGAGAT

species2.fasta
>prot8
ATGTTTGCGA
>lip1
TGATATAGAT

and I need to extract first positioned sequences from both species1.fasta and species2.fasta and save in new file namely group1.fasta. Same manner, the second positioned sequences to be extracted and saved in group2.fasta, and the expected output should be like this,

group1.fasta
>tag1 
TGCAAGAG
>prot8
ATGTTTGCGA

group2.fasta
>amylase 
ATCGAGAT
>lip1
TGATATAGAT

Like this I have to do for 3000 sequences, manual process is impossible, therefore, I have tried the following script, wherein I could not loop the process. Therefore, please help me in this regard.

my script
for file in *.fasta do cat "$file" \   | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} \  {printf("%s",$0);} \
 END {printf("\n");}' \   | awk '{if ( NR==1 ) print $0;}' \   | awk '{print $1"\n"$2}' | uniq > "new_${file}"

Thank you.

fasta shell R python perl • 317 views
ADD COMMENTlink modified 8 months ago by Chirag Parsania1.8k • written 8 months ago by Dineshkumar K40
3
  1. make a directory "output" in the same directory where all fasta files are located.
  2. run following command:

    $ parallel seqkit range -r {}:{} *.fasta -o output/group_{}.fasta ::: $(seq 1 $(grep -c ">" species1.fasta))

do not output fasta files in the same directory. Download seqkit tool from here: https://github.com/shenwei356/seqkit.

If you want to be careful, dry-run parallel command.

ADD REPLYlink modified 8 months ago • written 8 months ago by cpad011213k

Thank you @cpad0112. However, it shows error as follows,

ga@pls:~/Documents/data/xap95/ortho_cluster/orthogenes_dna/test$./seqkit range -r {}:{} *.fasta -o output/group_{}.fasta ::: $(seq 1 $(grep -c ">" species1.fasta))
[ERRO] invalid range: {}:{}. type "seqkit range -h" for more examples
ADD REPLYlink modified 8 months ago • written 8 months ago by Dineshkumar K40
1

@ Dineshkumar K

I guess you forgot to use parallel. Try installing parallel and use the below function:

 parallel  'seqkit range -r {}:{} *.fasta -o output/group_{}.fasta' ::: $(seq 1 $(grep -c ">" species1.fasta))
ADD REPLYlink modified 8 months ago • written 8 months ago by cpad011213k

@ cpad0112 I created a output folder and then ran the command as you mentioned, but there is no file has been created in output directory. The terminal printed as follows,

ga@pls:~/Desktop/int/tolc$ parallel --dry-run 'seqkit range -r {}:{} *.fasta -o output/group_{}.fasta' ::: $(seq 1 $(grep -c ">" species1.fasta))

Academic tradition requires you to cite works you base your article on. When using programs that use GNU Parallel to process data for publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

seqkit range -r 1:1 *.fasta -o output/group_1.fasta seqkit range -r 2:2 *.fasta -o output/group_2.fasta

I think I have not properly installed seqkit, even I kept seqkit executive file in the concerned folder where the fasta files lies, despite it is not working.

ADD REPLYlink modified 8 months ago • written 8 months ago by Dineshkumar K40
1

why did you append --dry-run to @cpad0112 s code? Could you retry running the command precisely as it was specified

ADD REPLYlink written 8 months ago by russhh5.4k

Thank you so much @cpad0112 and @russhh. It works awesome. I ran the following command,

ga@pls:~/Desktop/int/tolc$ parallel  './seqkit range -r {}:{} *.fasta -o output/group_{}.fasta' ::: $(seq 1 $(grep -c ">" species1.fasta))

and successfully got results in the output folder, The output is given below,

group1.fasta
>tag1 
TGCAAGAG
>prot8
ATGTTTGCGA

group2.fasta
>amylase 
ATCGAGAT
>lip1
TGATATAGAT
ADD REPLYlink modified 8 months ago • written 8 months ago by Dineshkumar K40

Can you be sure there will be the same number of sequences in the two input fasta files each time you run this? Which of the languages that you've labelled this question do you have the most experience with?

ADD REPLYlink written 8 months ago by russhh5.4k

@russhh, Yes, I have the same number of sequences in the two fasta file ( In reality, I have 30 fasta files and each fasta file consist of 3000 sequences in it. I need to extract position wise 1,2,3...3000 sequences from all the 30 fasta file and save in new files like group1.fasta, group2.fasta.....group3000.fasta). I am not having much exposure in progromming, but I googled many shell scripts and tried to compile and serve my purpose and ultimately failed. My shell script can extract sequences one after another from the 30 fasta files by changing the sequence position each time in (NR==1), But, manually I can not do this for 3000 sequence positions. Moreover, I need to save the same in new file consecutively, that is why I seek help in this regard.

ADD REPLYlink modified 8 months ago • written 8 months ago by Dineshkumar K40

It sounds like you understand the processes involved in doing this, but your code is way too complicated. Have a look at the biopython SeqIO webpage: https://biopython.org/wiki/SeqIO .

ADD REPLYlink written 8 months ago by russhh5.4k

Thank you russhh, for your suggestion. I would try to do the same.

ADD REPLYlink modified 8 months ago • written 8 months ago by Dineshkumar K40
2
gravatar for Joe
8 months ago by
Joe17k
United Kingdom
Joe17k wrote:

A fairly naive approach (code not tested as I'm not at a computer to do so at the moment, so it will likely be wrong in some places, but the general approach should work)

from Bio import SeqIO
import sys

for file in sys.argv[1:]:
    for i, rec in enumerate(SeqIO.parse(file, 'fasta')):
        file  = open('./sequences_' + str(i), 'a')
        file.write(rec.format('fasta'))
        file.close()

This should iterate all records in all files, and write a new file with the number. If the file doesnt yet exist, it will create it, if it does, it will append to it.

Run as:

python scriptname.py /path/to/files/*.fasta

ADD COMMENTlink modified 8 months ago • written 8 months ago by Joe17k
1

Thank you @Joe, It serve my purpose perfectly.

ADD REPLYlink written 8 months ago by Dineshkumar K40
1
gravatar for Chirag Parsania
8 months ago by
Chirag Parsania1.8k
University of Macau
Chirag Parsania1.8k wrote:

R way

sp1_fa  <- Biostrings::readBStringSet("species1.fasta")

sp2_fa  <- Biostrings::readBStringSet("species2.fasta")


purrr::map(seq_along(sp1_fa) ,  ~  Biostrings::writeXStringSet( filepath = paste("group" ,..1,".fasta",sep = "") ,
                                                                c(sp1_fa[..1] , sp2_fa[..1]) ) )

Created on 2019-11-14 by the reprex package (v0.3.0)

edit : general solution, can be applied to regardless of file numbers

fasta_files <- list.files(path = "." ,pattern = "*.fasta" ,full.names = T)
n_files = length(seq_list)

seq_list <- fasta_files %>% 
        tibble::enframe() %>% 
        dplyr::mutate(seqs = purrr::map(value, ~ Biostrings::readBStringSet(..1))) %>%
        dplyr::pull(seqs)


for(i in c(1:n_files)){

        seq_list %>% 
                purrr::map( ~ purrr::pluck(..1,function(x)x[i])) %>%  
                Biostrings::BStringSetList() %>% 
                unlist() %>% Biostrings::writeXStringSet(paste("group" ,i,".fasta",sep = ""))

}
ADD COMMENTlink modified 8 months ago • written 8 months ago by Chirag Parsania1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1112 users visited in the last hour