Question

Extracting fasta fsequences based on ID headlists

0

Entering edit mode

22 months ago

cats123 • 0

I have multiple txt files containing headlists of multiple gene IDs. I have another file containing all the coding sequences along with the gene IDs. I am trying to loop over each file and extract the sequence for the IDs. For some reason, it is not working.

An ID file, of which there are ~500:

Gene.1070__Species1@61356__g.1070__m.1070 Gene.93__Species2@2041__g.93__m.93

The coding sequence file contains thousands of sequences, including:

Gene.1070__Species1@61356__g.1070__m.1070 ACATTC... Gene.93__Species2@2041__g.93__m.93 GGGG...

I have tried:

for file in *.txt; do
   seqkit grep -f "$file" cds.cds -o "$file.fa"
done

This returns empty files. I have successfully used seqkit in the past and I cant figure out why this time it does not work.

I have tried to use the perl script subset_fasta.pl but I get an error and this returns empty files also. I have checked and the gene names in the headlist match the gene names in the coding sequences file. Can anybody help me figure out what is going wrong?

I have also tried:

seqtk subseq in.fq name.lst > out.fq but I cant get this to work on a single file.

seqkit fasta grep script • 1.3k views

ADD COMMENT • link updated 22 months ago by lieven.sterck 15k • written 22 months ago by cats123 • 0

1

Entering edit mode

hi cats123,

Welcome to the forum.

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 22 months ago by lieven.sterck 15k

0

Entering edit mode

Please paste several lines of the text file, and make sure you're using the latest version of seqkit.

Tips: > symbol is not part of the FASTA head.

ADD REPLY • link 22 months ago by shenwei356 8.4k

0

Entering edit mode

Please post what's inside both the files.

There is no error in the loop. Try looking at your input files cds.cds and txt files. Check if the headers are matching. Make sure that IDs in ID files do not have ">" before IDs. In addition, from your code, your output would have .txt.fa extension. Try this:

for file in *.txt; do echo seqkit grep -f $file cds.cds -o ${file/.txt/}.fa; done

if echo is okay, remove echo. You can also take a line from echo and see if it works fine.

You can avoid loop by using parallel:

$ parallel --plus --dry-run seqkit grep -f {} cds.cds -o {.}.fa ::: *.txt

Remove dry-run if dry-run is okay.

ADD REPLY • link 22 months ago by cpad0112 21k

score 0 · Answer 1 · 2022-06-10

0

Entering edit mode

22 months ago

lieven.sterck 15k

What you are doing in your script there is to loop over all files with extension .txt .

What you likely want to do is to loop over all those files but then use the content of each of those files as input to seqkit.

Is there a single ID or multiple in each of those files? If only one you can expand your loop to get for each file the first line for instance (which thus contains the single ID). If you have multiple IDs in each file you need to add a loop that will go over each of the entries in that file and pass that on to seqkit

ADD COMMENT • link 22 months ago by lieven.sterck 15k

0

Entering edit mode

There are multiple IDs in each file. How can I do a loop like that? A loop in a loop?

ADD REPLY • link 22 months ago by cats123 • 0

0

Entering edit mode

yes exactly. something like this: (pseudo-code)

for file in *.txt
  for ID in $( cat "$file" )
    seqkit ...
  done
done

ADD REPLY • link 22 months ago by lieven.sterck 15k

0

Entering edit mode

for file in *.fa_headlist.txt; do for line in $( cat "$file" ); do seqkit grep -f "$line" cds.cds -o "$file.fa"; done; done

This doesnt run and says no such file regarding each ID. Do you know what is wrong with this code? Thanks for all your help.

ADD REPLY • link 22 months ago by cats123 • 0

1

Entering edit mode

add echo statements in your loop to see what is in each variable. (eg. echo $file or echo $line )

ADD REPLY • link 22 months ago by lieven.sterck 15k