Question

parse a file to seqkit tool

0

Entering edit mode

20 months ago

Mohamed • 0

i have a file named gene.txt contain sequence coordinates such as :

orf1=456-12512
orf2=2869-6898

another file contain multiple fasta sequence called project.fasta

i want to extract header with sequence for each one in a separate file, i tried the following seqkit command but it gives invalid std in

awk '{print $0}' gene.txt | seqkit grep  -r -n -p $orf1 project.fasta

command seqkit awk • 1.8k views

ADD COMMENT • link updated 20 months ago by shenwei356 8.4k • written 20 months ago by Mohamed • 0

0

Entering edit mode

Are you looking to pull out entire sequences or just the coordinate boundaries that are in the examples above? grep would not be the correct tool for the former.

You may want to use samtools faidx instead.

Usage: samtools faidx <file.fa|file.fa.gz> [<reg> [...]]
Option: 
 -r, --region-file FILE   File of regions.  Format is chr:from-to. One per line.

ADD REPLY • link 20 months ago by GenoMax 141k

GenoMax · Answer 1 · 2022-08-22

1

Entering edit mode

20 months ago

shenwei356 8.4k

Please use seqkit faidx which accepts a file containing a list of regions in the format of seqid:start-end similar to samtools faidx.

$ cat gene.txt 
orf1=5-7
orf2=1-10

$ cat project.fasta
>orf1
AAAAAACCCCCC
>orf2
TTTTTTGGGGGG


$ seqkit faidx project.fasta -l <(sed 's/=/:/g' gene.txt)
[INFO] 2 patterns loaded from file
>orf1:5-7
AAC
>orf2:1-10
TTTTTTGGGG

ADD COMMENT • link 20 months ago by shenwei356 8.4k

0

Entering edit mode

actually the multifasta file contain header like:

>NC_00001.1:456-12512
ATAGATAT
>NC_00001.1:2869-6898
TTTTTTGGGG

SO, it looks like something in the command line needs modification because it gives me [WARN] sequence not found

ADD REPLY • link updated 20 months ago by GenoMax 141k • written 20 months ago by Mohamed • 0

0

Entering edit mode

For this to work, fasta headers need to match in both files. e.g. orf1 part in @shenwei's example.

ADD REPLY • link 20 months ago by GenoMax 141k

0

Entering edit mode

when i type the content of gene.txt on terminal without streaming the file in the pipeline and then run the seqkit command line it works without changing the header of fasta file, but i need to pipe all content of gene.txt because it contain a lot of files it will be difficult to change header in fasta file to match both files

ADD REPLY • link 20 months ago by Mohamed • 0

0

Entering edit mode

Have you tried to use seqkit in the way @shenwei showed in his answer?

ADD REPLY • link 20 months ago by GenoMax 141k

0

Entering edit mode

yes for @shanwei showed this example it works well beacuse both header matched, but because of header in multifasta file isnot the same so it gives [WARN] sequence not found. as illustrated bellow

$ cat gene.txt 
orf1=5-7
orf2=1-10

$ cat project.fasta
>NC_00001.1:456-12512
ATAGATAT
>NC_00001.1:2869-6898
TTTTTTGGGG

so i need to pickup each sequence with header in separate file

ADD REPLY • link updated 20 months ago by GenoMax 141k • written 20 months ago by Mohamed • 0

1

Entering edit mode

but because of header in multifasta file isnot the same so it gives [WARN] sequence not found.

Basic requirement for @shenwei's solution is that gene.txt should have fasta headers that fully match what is in your fasta file. Otherwise there is no point in using his solution because it will not work.

so i need to pickup each sequence with header in separate file

I am not 100% certain what you are saying here. If you have more than one fasta file to pick these sequences from then you need to run the solution for each file (unless you want to cat the files into one giant source file).

ADD REPLY • link 20 months ago by GenoMax 141k

0

Entering edit mode

ok i will work on both files today and will inform you thanks so much

ADD REPLY • link 20 months ago by Mohamed • 0

0

Entering edit mode

Then delete the locations before using seqkit faidx.

$ seqkit seq -i --id-regexp '^(.+?):' project.fasta -o tmp.fasta

or

$ sed -E 's/:.+//' project.fasta -o tmp.fasta

ADD REPLY • link 20 months ago by shenwei356 8.4k

0

Entering edit mode

Ideally, just add --id-regexp '^(.+?):' to the seqkit faidx command.

seqkit faidx --id-regexp '^(.+?):' project.fasta -l <(sed 's/=/:/g' gene.txt)

I've just fixed the code, the --id-regexp did not work for faidx before. You can compile from the source or directly download the binary file

ADD REPLY • link 20 months ago by shenwei356 8.4k