Tool:Retrieve a subset of FASTA from large Illumina multi-FASTA file
3
5
Entering edit mode
6.8 years ago
hcwang ▴ 50

Over the past few days, I've tried many methods to extract subset of FASTA from a multi-FASTA file based on the header IDs. I've tried samtools, hpcgridrunner, biopython and various other fasta extractor tools. However, none of them worked very well or very efficiently. Using samtools, it took me days to retrieve a 2 million subset from a 55 million multi-fasta sequence file (8.4G in storage).

Today, I finally found a solution to the problem. I wish to share with you so you may help others who have the similar problems. I'm not sure if anyone has posted this solution before. If someone has, please link them to this post as well so they can help people along the way. Using the script below, it only takes about 10 minutes on my server to build a sorted fasta in one-liner format, and extracting a 2 million subset using fasta header IDs within 2 minutes.

The script was run in LINUX system.

The steps are as follows:

#1. Convert a multi-line multi-fasta file <$fasta> into a one-liner multi-fasta <$converted_fasta> and sort the one-liner in numerical order. **(replace $fasta and$converted_fasta with real file names)**

awk '/^>/ {printf("\n%s__",$0);next; } { printf("%s__",$0);}  END {printf("\n");}' $fasta | sort -n >$converted_fasta

#2. Sort and make unique your ID headers. (replace $GOOD_ID and$GOOD_ID_sorted with real file names)

sort -n $GOOD_ID | sort -u >$GOOD_ID_sorted

#3. Use the fixed-string fgrep combined with LC_ALL=C command to extract all fasta sequences matched to the headers.

LC_ALL=C fgrep -w -f $GOOD_ID_sorted$fasta | awk -F'__' '{for(i=1; i<NF; i+=1) {print $i}}' >$GOOD_READS.fasta

#4. If you wish to extract reads that were not matched to the header IDs, you may do so with fgrep -v command to grab the inverse matches (unmatched).

LC_ALL=C fgrep -v -w -f $GOOD_ID_sorted$fasta | awk -F'__' '{for(i=1; i<NF; i+=1) {print $i}}' >$UNMATCHED_READS.fasta


The total runtime for these procedures is about 10 minutes. I hope this post can help someone in the future. Please comment below or fix the code as I am a beginner in the field of computer science and bioinformatics.

Update: I updated the script so it now keeps the original fasta structure. It basically converts all line symbol "\n" into "__" and then in step 3, it converts "__" back to "\n". Please note if you do have symbol "__" in your fasta lines, please replace "__" symbol with something that do not exist in the lines.

1
Entering edit mode

faSomeRecords from Jim Kent's Utilities (UCSC).

$chmod u+x faSomeRecords ./faSomeRecords faSomeRecords - Extract multiple fa records usage: faSomeRecords in.fa listFile out.fa options: -exclude - output sequences not in the list file.  ADD REPLY 0 Entering edit mode This is a terrific and wonderful piece of software. So efficient and rapid, that sometimes you believe that it did not work, although it did its jobs actually ADD REPLY 1 Entering edit mode From BBMap: filterbyname.sh in=[file] out=[outfile] names=[string,string,string] include=[true or false] It works on fasta, fastq, and sam. The names can either be literal names (like "E.coli") or files containing one name per line. It also allows substring matching, such as "NC_000913.3" when all you know is the accession number and the full name looks like "ncbi|511145 NC_000913.3 Escherichia coli str. K-12 substr. MG1655". The include toggle lets you either keep only the names in the list, or keep everything else. ADD REPLY 0 Entering edit mode Does this one work with retrieval based on Query FASTA IDs? For example, I have a long list of query IDs from a Illumina sequencing file, does BBMap efficiently retrieve a subset of the Illumina read FASTA? ADD REPLY 1 Entering edit mode Yes, it does; filterbyname is extremely fast (except in substring mode). Just be sure that the query names are either in a proper sequence file (fasta or fastq) or else they are in a text file with one query per line, not starting with ">" or "@" symbols which are not part of the sequence names. Although, you can use the "ths" flag (truncateheadersymbol) to ignore a leading ">" or "@" if you provide a text file containing headers that include those symbols. ADD REPLY 0 Entering edit mode ADD REPLY 6 Entering edit mode 6.8 years ago lh3 33k seqtk subseq large.fa id.txt  or awk 'BEGIN{while((getline<"id.txt")>0)l[">"$1]=1}/^>/{f=l[$1]}f' large.fa  ADD COMMENT 0 Entering edit mode Download seqtk here. ADD REPLY 0 Entering edit mode This should be an (the) answer, as it's really the only way to go about it! If I had to re-format and re-write every file from which I extract random sequences, the storage array would have a fit! From a practical perspective, if you are actually storing data on a shared array with some kind of backup system in place, you should avoid doing this excessively in any case, as it endlessly complicates the archiving procedures. ADD REPLY 0 Entering edit mode I agree with you. I wrote my script only because I'm trying to get quick results with very limited memory resource. I believe the memory footprint for extracting reads in step 2 and 3 are very minimum (which is the key to the success of my run). The converted_fasta in step 1 of my script takes about 10 minutes to build for a 4GB multi-fasta read file and it does make a duplicate of the fasta. So, if you don't have any storage space, you should consider using seqtk. The extraction time in step 2 and 3 takes about 2 minutes to do for 2 million IDs. ADD REPLY 0 Entering edit mode Thank you for the reply! I wish I found this answer before so I can avoid hitting so many walls. ADD REPLY 2 Entering edit mode Purely in the interest of saving you time (next time you have a question such as this): If you had searched Biostars (via Google, built in search engine is not as good) with the title you used for this post you would have found these threads (and several others): ADD REPLY 0 Entering edit mode Thanks for the information! In fact, I tried almost all of the methods these threads provided. It might be because I'm not used to using other tools. samtools faidx took many hours to retrieve 10,000 fasta records from a 40M read file. Biopython simply took forever to load the 40M read file. So, I combined several simple unix commands and they seemed to work the best for me. I'm very aware that I am very new to the field and I probably did not set up the tools to their maximum efficiency. Again, thank you very much for your input. ADD REPLY 0 Entering edit mode Hi so I'm trying to do something very similar to hcwang, and installed seqtk, its in my directory on the server but it won't recognize the seqtk command. Do you have any ideas as to why this might be? ADD REPLY 0 Entering edit mode Did you compile seqtk first? You may need to add execute permissions before you can run it. chmod u+x seqtk OR /full_path_to/seqtk  ADD REPLY 0 Entering edit mode I did do that, and yet it still is telling me that the seqtk command is not found... ADD REPLY 0 Entering edit mode Are you using a real file path from your computer to where the seqtk executable is present? ADD REPLY 0 Entering edit mode I believe so. I can see the file seqtk in my directory with all the files I'm working with. I was thinking that maybe somehow it couldn't access the executable somehow, but I'm not sure how that would be either. I appreciate your patience, I'm very new to all of this. ADD REPLY 0 Entering edit mode If you go into the directory where you see seqtk and run ./seqtk what happens? ADD REPLY 0 Entering edit mode I get usage info, the version, and a list of commands to use with short definitions ADD REPLY 0 Entering edit mode Great. So the application is ready for use. Are you looking to extract sequences as shown by @lh3 at the top? ADD REPLY 0 Entering edit mode Yes I'm trying to extract a set of receptor sequences (whose names I have specified in a txt file, each unique ID occupying a single line) from the mouse refseq database. So I had thought that using the seqtk subseq command would work ADD REPLY 0 Entering edit mode Yes it should as long as your sequence ID's match exactly between the two files. You can't use a database (blast?) as one of the inputs. ADD REPLY 0 Entering edit mode It seems to be working, however the output file which should contain all the sequences I want is empty. My only idea is that the list of IDs only has the Ensembl gene names, but the sequences in the fasta file have both the gene and transcript IDs. Would that prevent seqtk from successfully searching and extracting the sequences? ADD REPLY 0 Entering edit mode If you are getting an empty output file then it is not working. Can you post a few lines of the ID's you are trying to extract. Also post 10 lines from output of this command grep "^>" your_sequence_file | head -5  ADD REPLY 0 Entering edit mode You should try seqkit, a cross-platform and efficient toolkit for FASTA/Q file manipulation. It's very fast, even faster than the famous seqtk in some cases. See benchmark result Searching by list of IDs (which are parts of the fasta headers in your case): seqkit grep -n -r -f ids.txt large.fa > result.fa  ADD REPLY 0 Entering edit mode So the code entered looks like this: cd seqtk export PATH=/workdir/spretus.vno/seqtk:$PATH
seqtk subseq mouse.refseq.fasta vmnr.txt > VRs.fasta


The VRs.fasta file is created but is empty. The list of IDs in the text file look like the list below but much longer:

ENSMUSG00000082353
ENSMUSG00000090303


However, in the mouse.refseq.fasta file each sequence is indicated as such: (Gene|Transcript) ENSMUSG00000090773|ENSMUST00000173382

Here are the first 5 lines:

>ENSMUSG00000020333|ENSMUST00000000145
>ENSMUSG00000000001|ENSMUST00000000001
>ENSMUSG00000000094|ENSMUST00000000096
>ENSMUSG00000061689|ENSMUST00000000094
>ENSMUSG00000020152|ENSMUST00000000137


I had hoped that seqtk would still search based on the first half, but that is not the case? It has to be the exact full length in which case I would need IDs for all the gene|transcripts

0
Entering edit mode

Correct. Most of these methods expect the ID's to be identical between the two files. In your case some other solution (code) is going to be needed.

Since you got that file from Ensembl (BioMart?) perhaps you can get a new copy with just the first ID. Then this method will work. Actually since you have Ensembl ID's you can use them with BioMart to just export that set of sequences.

0
Entering edit mode

Actually filterbyname.sh from BBMap allows sub-string matching which should work in your case (C: Retrieve a subset of FASTA from large multi-FASTA file ) if you don't want to re-download the file.

0
Entering edit mode

I’ve run into some problems using BBmap

My list in my txt file looks like:(isoseq.8.16.IDs.txt)

c10984/f1p0/1692
c8658/f1p0/1429
c3671/f1p0/1618
c11188/f1p0/1685


 >c81/f43p16/1853 isoform=c81;full_length_coverage=43;non_full_length_coverage=16;isoform_length=1853
>c85/f3p13/1912 isoform=c85;full_length_coverage=3;non_full_length_coverage=13;isoform_length=1912
>c87/f45p12/1996 isoform=c87;full_length_coverage=45;non_full_length_coverage=12;isoform_length=1996
>c97/f2p12/1832 isoform=c97;full_length_coverage=2;non_full_length_coverage=12;isoform_length=1832

filterbyname.sh in=polished_high_qv_consensus_isoforms.8.16.fasta out=practice_isoseq.fasta names=isoseq.8.16.IDs.txt include=true


But the out file (practice_isoseq.fasta) is also empty. Is there something I can do to improve the substring matching?

1
Entering edit mode

Reason that is not working is because of the (unfortunate) presence of "/" in the names. If you can replace the / with an "_" you will find that filterbyname does work as advertised.

@Brian may be along later to comment if it is possible to get around this observation.

0
Entering edit mode

That's interesting; I had no idea. Thanks for the observation! I'll fix it in the next release.

0
Entering edit mode

Would the presence of "/" in the names also cause a problem with seqtk, if I had the full names (such as the two example IDs below) in the txt file?

c13667/f1p2/1252 isoform=c13667;full_length_coverage=1;non_full_length_coverage=2;isoform_length=1252 c2819/f1p4/3296 isoform=c2819;full_length_coverage=1;non_full_length_coverage=4;isoform_length=3296

1
Entering edit mode

Possibly. "/" is used as a file system separator in unix and while it is possible to 'escape' it (not use it in that context), filterbyname.sh was not able to accommodate that (and Seqtk may also be in that category).

As long as the / are only present in the sequence names you could change them to a _ by using the code below (and saving to a new file in process). You would need to do this for both your ID's and the sequence file.

sed 's/\//_/g' your_file > new_file


Then filterbyname.sh (and perhaps seqtk) should work.

You can then change the "_" back to a "/" for the resulting files.

1
Entering edit mode

Thank you so much, really appreciate all the help! It works great using filterbyname.sh :)

0
Entering edit mode

I doubt it. It's probably a BBTools-specific bug. seqtk should work fine.

0
Entering edit mode

I know I now have to entire ID name and using seqtk is also producing emtpy files, so I thought maybe it was the same issue. If I find some other reason it's happening I'll let you know. Thanks for the help!

1
Entering edit mode
6.8 years ago
EagleEye 7.4k

FASTACMD

http://www.animalgenome.org/blast/doc/fastacmd.html

fastacmd -d Input.fasta -s Chromosome -L start,stop

0
Entering edit mode

Please comment on the efficiency of this program. Also, it seems like this one requires GI/Accession IDs to work. Does it work if I have a long list of query IDs?

0
Entering edit mode

You do not need GI/Accession IDs and it is fast, you have to provide query like this,

Example if you want to extract particular range of sequence from reference genome

Example, reference_genome.fa:

.>chr1

AATTGGCCGTAGGCAGGGCGATTTC

.>chr2

CGTAGGCAGGGCGATTTCGTCCAAG

Example query to retrieve chromosome 1 from 2-5 nt sequence,

fastacmd -d reference_genome.fa -s chr1 -L 2,5

This will give you sequence

ATTG

Complete documentation for more options:

http://nebc.nerc.ac.uk/bioinformatics/documentation/blast/fastacmd.html

1
Entering edit mode
6.8 years ago

You should try seqkit, a cross-platform and efficient toolkit for FASTA/Q file manipulation.

It's very fast, even faster than the famous seqtk in this case. See benchmark result

seqkit grep -n -f ids.txt large.fa > result.fa


And you can specify the ID by using custom regular expression to fulfill your specific requirements. Searching by default seq identifier list,

seqkit grep -f ids.txt large.fa > result.fa


You can even search using sequence motif (not just regular expression), please read the usage.

0
Entering edit mode

For extracting sequence by ID list, the latest version of seqkit is faster than seqtk now. You could try~

1
Entering edit mode

Please don't bump your own posts. Consider this a warning - it did lure me in this case, but do not do it again.

Anyway, I looked at your webpage, and it seems interesting. I do have a recommendation, though - perhaps you've heard this before? In English, "fakit" sounds like "Fake it", meaning "Intentionally produce false data". I highly recommend you change the name.

Also, I would be interested to see how it compares to the BBMap package, speed-wise. I wrote that in Java, but I'm sort of interested in Go.

In bioinformatics, speed is great! But, I have yet to meet a PI who will accept reduced accuracy as a condition for increased speed. So, in addition to the speed, I suggest you note the accuracy as well. For many commands like fq -> fa conversion it's irrelevant, but for others... I don't even know what some of them mean. Like, "Validate bases"? "Recognize RNA"? "Search by motifs"? "Subseq"? "Deduplicate seqs"? ...etc. Of course, I have an idea of what they mean, but it may be very different from your idea of what they mean. Those are important to clarify, and if accuracy is important for the operation (meaning it is non-lossless and nuanced), then it is essential to note the accuracy as well.

2
Entering edit mode

Hi Brian, sorry for "bumping my own posts". I should have edited the origional answer. Not gonna happen again.

For the accuracy, functional test script was used to ensure it. And for the benchmarks, accuracy is the basic requirement, and I've checked it by md5sum. It should be explicitly clarified.

I also tested the performance of "seqtk subseq", "bbmap filterbyname.sh" and "fakit grep".

Since "bbmap filterbyname.sh" wraps the output sequence, so I set seqtk and fakit to do so. If not, seqtk could be faster ( 21.967s -> 17.510s) and so is fakit (6.330s -> 6.1s).

#### summary of the test data
$fakit stat dataset_A.fa file seq_format seq_type num_seqs min_len avg_len max_len dataset_A.fa FASTA DNA 67,748 56 41,442.5 5,976,145$ du -h dataset_A.fa
2.7G    dataset_A.fa

#### extract ~50% full heads and shuffle them
#### full head. i.e. "HWI-D00524:112:C6AAGANXX:7:1101:1682:1998 1:N:0:GTTTCG"
$fakit sample -p 0.5 dataset_A.fa | fakit seq -n | shuf > ids_A.full.txt$ wc -l ids_A.full.txt
33814 ids_A.full.txt
$cat ids_A.full.txt | sort | uniq | wc -l 33814 #### seqtk memusg -t seqtk subseq -l 70 dataset_A.fa ids_A.full.txt > ids_A.txt.seqtk.fa elapsed time: 21.967s peak rss: 11.5 MB #### bbmap memusg -t filterbyname.sh substring=f in=dataset_A.fa names=ids_A.full.txt out=ids_A.txt.bbmap.fa overwrite=t Input is being processed as unpaired Time: 16.126 seconds. Reads Processed: 67748 4.20k reads/sec Bases Processed: 2807643808 174.10m bases/sec Reads Out: 33933 Bases Out: 1443437699 elapsed time: 16.344s peak rss: 472.47 MB #### fakit memusg -t fakit grep --by-name --pattern-file ids_A.full.txt dataset_A.fa --line-width 70 > ids_A.txt.fakit.fa elapsed time: 6.330s peak rss: 67.75 MB ### check results$ md5sum ids_A.txt.*.fa
5463555b4f4ffd0d2259c3bc339057ae  ids_A.txt.bbmap.fa
2269d3450148d75cf586aa38c4cea352  ids_A.txt.fakit.fa
2269d3450148d75cf586aa38c4cea352  ids_A.txt.seqtk.fa

$fakit stat ids_A.txt.*.fa file seq_format seq_type num_seqs min_len avg_len max_len ids_A.txt.bbmap.fa FASTA DNA 33,933 81 42,537.9 5,976,145 ids_A.txt.fakit.fa FASTA DNA 33,815 56 40,343.2 4,887,515 ids_A.txt.seqtk.fa FASTA DNA 33,815 56 40,343.2 4,887,515$ for f in ids_A.txt.*.fa; do echo $f; fakit seq -n$f | sort | uniq | wc -l ; done
ids_A.txt.bbmap.fa
33932
ids_A.txt.fakit.fa
33814
ids_A.txt.seqtk.fa
33814


Well, you may have noticed that, bbmap seems wrongly extracted more sequences, it should be checked. And here's a advice, command argument format like "arg=value in=file" is not convenient for "auto-complete" of shell.

0
Entering edit mode

Hi Shenwei,

"faskit" sounds fine to me. Thanks for doing that test! Your program seems to be extremely fast.

Is there somewhere I can manually download the input files you used, to see if I can determine what is causing the discrepancy?

Thanks, -Brian

1
Entering edit mode

I've changed the name from "fakit" to "seqkit", since both "fa kit" and "fas kit" did not show it could handle FASTQ files. So I use "seq kit", well, a little like "seqtk" :)

Here's the data: seqkit-benchmark-data.tar.gz

More details, benchmark scripts and plotting scripts are also available here, and results with other tools are here.

And I also noticed that filterbyname.sh does not support multi-line FASTQ.

1
Entering edit mode

Seqkit sounds even better than faskit. It sounds a little like seqtk, but there are only so many 5-6 letter words out there that indicate toolkits for processing sequence data, so it seems fine to me. And it strongly indicates what the software package does (unlike BBMap :) ).

JGI is currently interested in computational efficiency, so I will test Seqkit. If it outperforms BBMap in my tests, then of course, my first option will be to try to make my own tools better than Seqkit =) But if I can't, I will recommend it for use at JGI, because computation is a large portion of our budget.

As for multi-line fastq... no, none of the BBMap package supports it. Personally, I consider that to be an advantage, as I have never encountered a multi-line fastq file and cannot imagine where one would appear. Because fastq control symbols like + and @ can randomly occur in quality strings, I don't consider multi-line fastq to be a valid format; I think valid fastq files should strictly have 4 lines per record. So, I intentionally designed the BBMap package to crash if it detects unexpected input, because I would rather the program crash than use corrupt input data to produce bad output data. I have yet to see a situation where this caused a problem, other than quality score autodetection, which can be resolved by specifying the ASCII offset rather than relying on autodetection. Every other time a crash was reported when processing fastq files, it was because the input file actually was corrupt.

I'll get back to you after I test Seqkit.

-Brian

0
Entering edit mode

Brian, thanks you for testing and sharing. Thank you so much!

I used my lightweight bioinformatics library/package bio for FASTA/Q parsing. The time efficient is close to and even better (in one case) than kseq.h of @lh3 . I improved it a lot by many optimizations, see change log. Wish you improve bbmap soon.

0
Entering edit mode

The link in top post is now stale. You should update it (along with the new name).

0
Entering edit mode

Sorry. It's updated.