Question: Extract sequences from a list of ID
0
gravatar for doinelpierrot
27 days ago by
doinelpierrot0 wrote:

Hello everybody,

Here i my issue, I have a Trinity.fasta file like this

>TRINITY_DN5631_c0_g2_i1 len=947 path=[0:0-946]
TACAACTTGAACATCAACAATGGTTGCGCAGCTATTGCCATCCGCGACGTTCGAGGACTGCGTGCGAA
>TRINITY_DN62279_c1_g1_i1 len=298 path=[0:0-297]
TATTACCATTATTATTATTATCATATTTATGTTCATTATTATCATTATCATAATCATTATCATCTTGATA
 ...

And I also have a list of id in a id.txt file :

TRINITY_DN16359_c0_g1_i4
TRINITY_DN62279_c1_g1_i1
...

I am trying to extract for my fasta file the sequences that have their id in my txt.file. I am using seqkit for that, but with no success :

 seqkit grep -n -f id.txt Trinity.fasta -o result.fa

Does anyone know how to fix it ?

sequence software error • 168 views
ADD COMMENTlink modified 27 days ago by hxj50 • written 27 days ago by doinelpierrot0

This has been asked multiple times. Here is one thread: Extract fasta sequnces not matching a list Answers include extracting both matching and non-matching sequences. If you want to keep using seqkit try: C: Extract specific sequence from FASTA file

ADD REPLYlink modified 27 days ago • written 27 days ago by genomax92k

Yes I have checked biostars first that's how I found the seqkit command I am trying to use. My goal is not to remove sequence from my fasta but to have a file with all the sequences that are in my list

ADD REPLYlink modified 27 days ago • written 27 days ago by doinelpierrot0
1

Did you check the links I included above? Second link has seqkit command, if you want to stay with that program. -nrif option mentioned there should cover your use case, I would think.

ADD REPLYlink modified 27 days ago • written 27 days ago by genomax92k

seqkit grep -nrif ids.txt test.fa totally did the job thank you very much @genomax !

ADD REPLYlink written 27 days ago by doinelpierrot0

finally it does not work so great since as shenwei says, " it may produce some unwanted results. For example, seq_1 matches seq_10 with -nri"

ADD REPLYlink written 27 days ago by doinelpierrot0

If your sequences are only one line you can use this command:

cat id.txt | while read line ; do grep -A 1 ${line} file.fasta >> outputfile.fasta ; done

If your sequences are multi-lines you can convert them to a one-liner fasta file first and then use the above command:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  Trinity.fasta >> file.fasta
ADD REPLYlink modified 27 days ago • written 27 days ago by Fatima820

Thank for your answer I have tried this but there is 6835 sequences in outputfile.fasta whereas I have a liste of 6750 ID. Is it possible to fix that ?

ADD REPLYlink written 27 days ago by doinelpierrot0

I usually use this for accession numbers that are unique, and ${line} works well for them, but maybe you can try "${line} ", because this one requires a space after ID, and 10 won't be picked when looking for 1.

Alternatively, you can use grep --perl-regex "${line}\s", but I haven't tried mixing this with -A1 or -A 1 option. You can try!

ADD REPLYlink modified 27 days ago • written 27 days ago by Fatima820

Option -w should be added to grep for exact matching.

ADD REPLYlink written 27 days ago by shenwei3565.6k
2
gravatar for shenwei356
27 days ago by
shenwei3565.6k
China
shenwei3565.6k wrote:

Referring to the usage, you should not switch on -n , just use seqkit grep -f id.txt seqs.fa.

-n, --by-name               match by full name instead of just id
-i, --ignore-case           ignore case
-r, --use-regexp            patterns are regular expression

For -nrif, it's partly matching full FASTA header by regular expression and case-ignored, with patterns from file. This may produce some unwanted results. For example, seq_1 matches seq_10 with -nri.

ADD COMMENTlink modified 27 days ago • written 27 days ago by shenwei3565.6k

Hmm unwanted results, it's exactly what I got since I ended up with 20 mysterious sequences more than expected (out of 7000) wit -nrif.

I have tried seqkit grep -f id.txt Trinity.fasta > contaminants.fasta but unsuccesfully ...

ADD REPLYlink modified 27 days ago • written 27 days ago by doinelpierrot0
-1

Hmm... I've tested with your sample data, everything is OK.

You may check the input data (check extra whitespace or tab at front/end of IDs, file encoding, even manually copy and search in the file with text editor.), and try with additional options, e.g., switch on -i to ignore cases.

ADD REPLYlink written 27 days ago by shenwei3565.6k

My bad it did the job, thanks again !

ADD REPLYlink written 26 days ago by doinelpierrot0
0
gravatar for Renesh
27 days ago by
Renesh1.9k
United States
Renesh1.9k wrote:

You can also try Python package bioinfokit (v1.0.2 or later) for extracting the sequences from the FASTA file (check how to install https://reneshbedre.github.io/blog/howtoinstall.html#how-to-install)

from bioinfokit.analys import fasta

fasta.extract_seq(file='Trinity.fasta', id='id.txt')

Check detailed usage at https://reneshbedre.github.io/blog/howtoinstall.html#extract-sequences-from-fasta-file

ADD COMMENTlink modified 27 days ago • written 27 days ago by Renesh1.9k
0
gravatar for hxj5
27 days ago by
hxj50
HK
hxj50 wrote:

Could also use seqtk or AWK from an old thread Tool: Retrieve a subset of FASTA from large Illumina multi-FASTA file:
seqtk subseq large.fa id.txt
or
awk 'BEGIN{while((getline<"id.txt")>0)l[">"$1]=1}/^>/{f=l[$1]}f' large.fa

ADD COMMENTlink modified 27 days ago • written 27 days ago by hxj50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2283 users visited in the last hour