Question: how to filter sequence database with bash?
0
gravatar for Lisa Prudnikow
8 weeks ago by
Germany
Lisa Prudnikow0 wrote:

Hello, I am trying to filter a FASTA sequence database using bash.

>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT
>49689252 Fumaria officinalis
ACGCACCGAGTCGCCCCCACCCGCCCCCCAAGAGGTGCCGCGGGAGGGAGCGGAGAATGGCCCCCCGTGCCCCAGCGCGCGGCCGGCCCAAACACAGGCCCCGGGAGGCCGGCGTCACGAT
...

It's a plant database and I want to filter it with a list of plants:

Abies alba  
Acer campestre  
Achillea millefolium subsp. sudetica
...

This would be the result, I need:

>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT

I already tried

grep -Ff list.txt database.txt > filtered.txt

Therefore I created a list with the ID lines from the database and aligned it with the list of the plants. With this command I appended the matching sequences to the result.

grep -x -F -A 1 -f 'filtered.txt' 'database.fasta' > filtered_database.fasta

As it is very huge databse that I want to filter and some plants have to occur multiple times due to the various tax-IDs (e.g. Acer campestre), I am not sure, if that is the right way and if I got all the sequences from the list...

Are there any other possibilities to filter this FASTA database with a list of binary nomenclature names with bash?

Thank you very much!

Greetings, Lisa

ADD COMMENTlink modified 8 weeks ago by Jorge Amigo12k • written 8 weeks ago by Lisa Prudnikow0
1
gravatar for Jorge Amigo
8 weeks ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

If you are completely sure that your sequence database contains pairs of id+sequence lines, you could combine your both greps in 1. You'll get all the IDs that match your query even if they are repeated, plus you'll have all the next-match-line sequences at once, getting the desired results in a single and fast step:

grep --no-group-separator -F -w -A1 -f list.txt database.txt > filtered_database.fasta

I've added the -w option to find your patterns not only as fixed (-F), but as whole words too. Also, since -- lines will appear to separate groups of matches by default, you may avoid them with --no-group-separator.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Jorge Amigo12k

Thank you very much! This helps a lot. And yes, the -- lines appeared before, so thank you for your advice :-)

ADD REPLYlink written 8 weeks ago by Lisa Prudnikow0
0
gravatar for Pierre Lindenbaum
8 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

linearize, grep, convert back to fasta.

ADD COMMENTlink written 8 weeks ago by Pierre Lindenbaum134k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2082 users visited in the last hour
_