Question: Extracting specific sequences from a big fasta file using ids of the sequences to be excluded
0
gravatar for hasche89
3.6 years ago by
hasche890
hasche890 wrote:

I have a huge fasta file of around 20 GB size. I also have some sequence IDS from the same fasta file in text format. Now, I want to retrieve those sequences which don't have those particular ids in the text file.

How shall I proceed? I use Ubuntu 12. I am a novice and have very little knowledge of bash, shell or perl. Any Linux or Samtools or Bioperl command will be helpful.

Thanks.

rna-seq bioperl samtools faidx perl • 2.4k views
ADD COMMENTlink modified 3.6 years ago by Brian Bushnell16k • written 3.6 years ago by hasche890
2
gravatar for thackl
3.6 years ago by
thackl2.6k
MIT
thackl2.6k wrote:

This would work:

git clone https://github.com/BioInf-Wuerzburg/SeqFilter.git
cd SeqFilter
make  # just fetches some libraries, no root or anything required

bin/SeqFilter big.fasta --ids idx.txt --ids-exclude --out big-filtered.fasta
ADD COMMENTlink written 3.6 years ago by thackl2.6k
1
gravatar for geek_y
3.6 years ago by
geek_y9.4k
Barcelona/CRG/London/Imperial
geek_y9.4k wrote:

Simple way is to get a list of IDs that you would like to fetch from fasta. This could be done with 'grep' .

grep "^>" input.fasta | sed 's/>//' | grep -v - -f Ids.txt > retreive_IDs.txt

Then you could use something like pyfaidx or samtools

samtools faidx input.fasta `cat retreive_IDs.txt` 
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by geek_y9.4k

and also faSomeRecords

./faSomeRecords input.fa retreive_IDs.txt output.fa
ADD REPLYlink written 3.6 years ago by venu6.1k

Thanks for the commands.

I am a beginner in this field. Can you please tell me what does each component of your command does?

Thanks.
 

ADD REPLYlink written 3.6 years ago by hasche890

Execute each command on your own, then you will understand very easily what each command is doing. 

ADD REPLYlink written 3.6 years ago by venu6.1k
1
gravatar for Brian Bushnell
3.6 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Boy, this really comes up a lot.  Using the BBMap package:

filterbyname.sh in=file.fasta out=filtered.fasta names=names.txt include=f

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Brian Bushnell16k

Always important to keep busy ;)

ADD REPLYlink written 3.6 years ago by thackl2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1142 users visited in the last hour