Question

Fast way to extract specific sequences from large fasta

0

Entering edit mode

4.9 years ago

mnsp088 ▴ 100

Hi all!

I have ~2k text files, each with ~1k protein names (one protein name per line) and I need to extract the sequences of these proteins from a large master fasta file which contains ~5.5 million sequences. I wrote this code to do this task and it works but it is taking a really long time to process even 1 text file.

for file in `find text_files | grep .txt`;
do
echo $file
file2=$(echo $file | sed -re 's/^.+\///')
cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa
done

Any ideas on better ways to go about this? particularly to speed things up.

Thank you for your suggestions.

sequencing fasta • 2.7k views

ADD COMMENT • link updated 2.5 years ago by loicborcard01 ▴ 40 • written 4.9 years ago by mnsp088 ▴ 100

1

Entering edit mode

You should show examples of the protein names from the protein names files and from the fasta file.

Also, did you edit the find text_files | grep .txt part? I don't think this command would find anything at all.

Anyway, I think

samtools faidx master_fasta.fs $(cut -c 1- ${file}) > ./fastas/$file2.fa

would be faster than

cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

edit: you can also do

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

edit 2: I didn't state clearly, but I believe what is slowing your command is xargs -n 1, which is not necessary anyway.

ADD REPLY • link 4.9 years ago by h.mon 35k

1

Entering edit mode

Thank you for your suggestions, and all suggestions below. I would like to report that I have run all of them, and your solution

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

was the quickest. faSomeRecords also sped it up a lot, but this solution was still quicker.

Thanks again!

ADD REPLY • link 4.9 years ago by mnsp088 ▴ 100

1

Entering edit mode

faSomeRecords utility from Jim Kent at UCSC should be a fast way to do this.

Additional options in: Extract reads from fasta file (specific read_names)and make a new fasta file

ADD REPLY • link 4.9 years ago by GenoMax 141k

score 3 · Answer 1 · 2021-11-10

I know this post is old but I would suggest this approach using seqkit and GNU parallel:

The key here is to split the big fasta file into several parts

For your particular case I would do something like that

seqkit split2 -p 100 big.fasta -O tmp # the splitted fasta in put into a folder called tmp

parallel -j 20 " zcat {1} | seqkit grep -n -f {2} >> new.fa" ::: tmp/*.fasta ::: folder_protein_files/*.txt # you might need to tweak it because I don't know your exact setup

Using this approach you first split your fasta file into several parts and process them in parallel using seqkit and GNU parallel. First use parallel --dryrun to test your code.

hope it helps,

Loïc

score 1 · Answer 2 · 2019-06-13

Indexing is almost certainly the right way to do this. Even with biopython which will typically be slower, its likely do-able using SeqIO's index() method ( https://biopython.org/wiki/SeqIO ):

from Bio import SeqIO

index = SeqIO.index('bigfasta.fa', 'fasta')
print(index['myfastaID'])

I don't have a massive fasta file to hand to actually test this performance though.

It will likely still be slower than faidx or faSomeRecords implemented in straight up faster languages.