Question: Fasta file filtering
0
gravatar for Janey
10 months ago by
Janey20
USA
Janey20 wrote:

Hi my friends

Maybe my question is very simple, but, I'm not familiar with the programming language. I use the following script to extract sequence from fasta file. How can I write similar command to remove the sequences by sequence IDs from fasta file.

cut -c 2- ID.text | xargs -n 1 samtools faidx in.fasta > out.fasta

Thanks for your help

rna-seq • 813 views
ADD COMMENTlink modified 10 months ago by erwan.scaon670 • written 10 months ago by Janey20

are these fasta files flattened i.e is the sequence in a single line after each ID? Then you can use: grep -A 1 -w <ID> input.fasta

eg: output:

$ grep -A 1 -w 'cde' test.fa
>cde
atgcatgcNNN

input:

$ cat test.fa
>abc
agtgcNNNN
>cde
atgcatgcNNN
ADD REPLYlink modified 10 months ago • written 10 months ago by cpad011211k

hi cpad0112

my fasta file is like this:

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

What is the solution to my problem??

ADD REPLYlink written 10 months ago by Janey20

do you want to remove the duplicate sequences? Botht the sequences look duplicate to me: if you want to remove dups, $ seqkit rmdup --quiet -i -n test.fa. If it is an example file, then try $ grep -A 1 -w '> c50249_g1_i3' input.fa. This assumes that sequence is in a single row next to id.

ADD REPLYlink modified 10 months ago • written 10 months ago by cpad011211k

I have one IDs list that want to remove their related sequences. and my fasta file is like that with several lines. now, do you have command for me??

ADD REPLYlink written 10 months ago by Janey20
$ seqkit grep -iv -f IDs.txt input.fa
ADD REPLYlink written 10 months ago by cpad011211k
1
gravatar for finswimmer
10 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello Janey,

you could use seqkit for this task.

$ seqkit grep -v -n -f id_list.txt in.fasta > out.fasta

fin swimmer

ADD COMMENTlink written 10 months ago by finswimmer11k

hi finswimmer

I love your name. I downloaded the seqkit and unzied it but did not worked. You do not have an idea to activate it??

ADD REPLYlink written 10 months ago by Janey20

Does anyone have a simpler solution to this problem ????? Does anyone hear my voice ?????

ADD REPLYlink written 10 months ago by Janey20
1

Hello Janey,

please be more patient. We all doing this here in our free time. So don't expect to get a ready-to-use-solution within some minutes.

Which file do you downloaded from seqkit? What platform are you using (windows, linux distribution ...)? What are you meaning with "but did not worked"? What have you done and what was the result of your action?

fin swimmer

ADD REPLYlink written 10 months ago by finswimmer11k

i download "seqkit_linux_amd64 (1).tar.gz" file for linux and unzip it.

I run this command:

./seqkit grep -v -n -f Tran_Cod.txt Totalassembly.fasta > out.fasta

my output file was empty.

ADD REPLYlink written 10 months ago by Janey20
1

If your output file is empty and you didn't get any error message means that all id's in your fasta matches to the id's in your Tran_Cod.txt. Could this be?

What happens if you remove the -v?

Could please post the output of head Tran_Cod.txt?

ADD REPLYlink modified 10 months ago • written 10 months ago by finswimmer11k

@janey: what is your OS? which version did you download? One doesn't have to activate this program.

ADD REPLYlink written 10 months ago by cpad011211k
0
gravatar for erwan.scaon
10 months ago by
erwan.scaon670
Nantes - France
erwan.scaon670 wrote:

You can easily achieve this with seqtk.

Quoting from the manual :

Extract sequences with names in file name.lst, one sequence name per line:
seqtk subseq in.fq/fa name.lst > out.fq/fa

ADD COMMENTlink written 10 months ago by erwan.scaon670
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1790 users visited in the last hour