Question: Fasta file filtering
0
gravatar for Janey
20 months ago by
Janey30
USA
Janey30 wrote:

Hi my friends

Maybe my question is very simple, but, I'm not familiar with the programming language. I use the following script to extract sequence from fasta file. How can I write similar command to remove the sequences by sequence IDs from fasta file.

cut -c 2- ID.text | xargs -n 1 samtools faidx in.fasta > out.fasta

Thanks for your help

rna-seq • 1.6k views
ADD COMMENTlink modified 20 months ago by erwan.scaon750 • written 20 months ago by Janey30

are these fasta files flattened i.e is the sequence in a single line after each ID? Then you can use: grep -A 1 -w <ID> input.fasta

eg: output:

$ grep -A 1 -w 'cde' test.fa
>cde
atgcatgcNNN

input:

$ cat test.fa
>abc
agtgcNNNN
>cde
atgcatgcNNN
ADD REPLYlink modified 20 months ago • written 20 months ago by cpad011212k

hi cpad0112

my fasta file is like this:

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

What is the solution to my problem??

ADD REPLYlink written 20 months ago by Janey30

do you want to remove the duplicate sequences? Botht the sequences look duplicate to me: if you want to remove dups, $ seqkit rmdup --quiet -i -n test.fa. If it is an example file, then try $ grep -A 1 -w '> c50249_g1_i3' input.fa. This assumes that sequence is in a single row next to id.

ADD REPLYlink modified 20 months ago • written 20 months ago by cpad011212k

I have one IDs list that want to remove their related sequences. and my fasta file is like that with several lines. now, do you have command for me??

ADD REPLYlink written 20 months ago by Janey30
$ seqkit grep -iv -f IDs.txt input.fa
ADD REPLYlink written 20 months ago by cpad011212k
1
gravatar for finswimmer
20 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Hello Janey,

you could use seqkit for this task.

$ seqkit grep -v -n -f id_list.txt in.fasta > out.fasta

fin swimmer

ADD COMMENTlink written 20 months ago by finswimmer13k

hi finswimmer

I love your name. I downloaded the seqkit and unzied it but did not worked. You do not have an idea to activate it??

ADD REPLYlink written 20 months ago by Janey30

Does anyone have a simpler solution to this problem ????? Does anyone hear my voice ?????

ADD REPLYlink written 20 months ago by Janey30
1

Hello Janey,

please be more patient. We all doing this here in our free time. So don't expect to get a ready-to-use-solution within some minutes.

Which file do you downloaded from seqkit? What platform are you using (windows, linux distribution ...)? What are you meaning with "but did not worked"? What have you done and what was the result of your action?

fin swimmer

ADD REPLYlink written 20 months ago by finswimmer13k

i download "seqkit_linux_amd64 (1).tar.gz" file for linux and unzip it.

I run this command:

./seqkit grep -v -n -f Tran_Cod.txt Totalassembly.fasta > out.fasta

my output file was empty.

ADD REPLYlink written 20 months ago by Janey30
1

If your output file is empty and you didn't get any error message means that all id's in your fasta matches to the id's in your Tran_Cod.txt. Could this be?

What happens if you remove the -v?

Could please post the output of head Tran_Cod.txt?

ADD REPLYlink modified 20 months ago • written 20 months ago by finswimmer13k

@janey: what is your OS? which version did you download? One doesn't have to activate this program.

ADD REPLYlink written 20 months ago by cpad011212k
0
gravatar for erwan.scaon
20 months ago by
erwan.scaon750
Nantes - France
erwan.scaon750 wrote:

You can easily achieve this with seqtk.

Quoting from the manual :

Extract sequences with names in file name.lst, one sequence name per line:
seqtk subseq in.fq/fa name.lst > out.fq/fa

ADD COMMENTlink written 20 months ago by erwan.scaon750
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour