Filter fasta sequences based on a list with grep or BBmap
3
0
Entering edit mode
4.0 years ago
endretoth ▴ 30

Hi,

I would like to filter sequences (in command line Unix) with grep or BBmap based on a list of names stored in a separate file. The list has the names but not the full names, just part of the full sequence name.

The name list looks like:

cre
cln
pab
pde
pta
ppt
smo
atr
seu
pgi
cca
han

The sequences look like this (the names, from the list I have, are at the beggining (first three characters)):

>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p
CUAUGCAAUUUUCUACCUUACC
>cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p
UCCCUGAGACCUCAAGUGUGA
>pad-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p
ACACCUGGGCUCUCCGGGUACC
>pad-miR-1-5p MIMAT0020301 Caenorhabditis elegans miR-1-5p
CAUACUUCCUUACAUGCCCAUA
>cel-miR-1-3p MIMAT0000003 Caenorhabditis elegans miR-1-3p
UGGAAUGUAAAGAAGUAUGUA
>cel-miR-2-5p MIMAT0020302 Caenorhabditis elegans miR-2-5p
CAUCAAAGCGGUGGUUGAUGUG
>cca-miR-2-3p MIMAT0000004 Caenorhabditis elegans miR-2-3p
UAUCACAGCCAGCUUUGAUGUGC
>cca-miR-34-5p MIMAT0000005 Caenorhabditis elegans miR-34-5p
AGGCAGUGUGGUUAGCUGGUUG

My BBmap code is the following:

./bbmap/filterbyname.sh in=mature.fa out=filtered.fa include=t names=names.txt substring

I don't have an idea for grep.

The problem is that this code filters other sequences too (wrong sequences) because I don't know how to tell to filter only those where name present at the beginning. Maybe 'grep' would be better?

Please help.

Best wishes,
thend

fasta sequence unix linux BBmap • 2.4k views
ADD COMMENT
1
Entering edit mode

adding a - to your input name list might already help in your example (putting a > in front will even help more)

also the substring parameter requires a value, no? I would set it to substring=name

ADD REPLY
0
Entering edit mode

Dear lieven.sterck,

The original command does not require '-', please look:

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

I tried also with substring=name, but unfortunately, it doesn't change the outcome.

ADD REPLY
0
Entering edit mode

I meant adding the dash to your names in the name list (not to the options), just like you did with your grep approach below ;-)

ADD REPLY
1
Entering edit mode
4.0 years ago
endretoth ▴ 30

Dear All ( lieven.sterck, nlehmann, genomax),

While I was waiting (I didn' see the last posts) I tried the following code:

grep -F -f names2.txt -A 1  mature2.fa > filtered_new.fa

This worked perfectly. (I only changed the names to be more specific):

cre-
cln-
pab-
pde-

...

I greatly appreciate the help, I have learned a lot :) Maybe this experience is useful for other Biostar members :)

ADD COMMENT
1
Entering edit mode

Not to dampen your enthusiasm but this may not work reliably if you have fasta sequences that span multiple lines etc.

ADD REPLY
0
Entering edit mode

These are microRNAs, only "few" bases and less than a line. :)

(PS: Why I never get votes? Can you explain how this works?)

ADD REPLY
0
Entering edit mode

as genomax pointed out, it will work in this specific case but do keep it in mind (you would not be the first one having issues with this). and that your sequences are short is kinda irrelevant, it's not that hard to format your fasta file using a line length of 20 bases and then you're in trouble already.

What do you mean you "never got votes"? you already have some? community members can vote on your posts/questions/answers ... if they feel it's useful or they confirm it's an OK answers or such

ADD REPLY
0
Entering edit mode

The voting system is as simple as it gets, if people find your comments and answers helpful (or like them, or press "thumb up" for whatever reason) then you get +10 reputation, well and if not, then not. Your goal should not be reputation, it is a meaningless metric. There are users with thousands of reps who never really contributed to the community (imho) and are just around for a long time and ask a lot of more or less elaborate questions, and then collect some upvotes over time just "by chance". You should take this here as an opportunity to improve yourself. If you feel like you can add an answer somewhere to a question then try to make it a solid, elaborate answers which you are sure is correct. Take it as a challenge to improve yourself as good answers often require that you really think about a problem or topic.

ADD REPLY
0
Entering edit mode

Reputation points are added for votes on comments/answers as well as people bookmarking your answers or questions. There is no 'automatic' system, people just have to like the content you upload.

The best way to gain reputation is to add thorough/clever answers, provide quality code snippets, or ask insightful questions that are useful to others.

Even then it's not guaranteed.

ADD REPLY
1
Entering edit mode
4.0 years ago
nlehmann ▴ 140

Hi,

If you want to keep the patterns 'cre, cln..', you can use:

awk '/^>cre/ || /^>cln/ {print; getline; print}' mature.fa > filtered.fa

If you want to filter them out, you could try:

sed -e '/^>cre/,+1 d' -e '/^>cln/,+1 d' mature.fa > filtered.fa

If you don't want to write down all the patterns, you can just make a loop or pipe the patterns to sed.

Nathalie

ADD COMMENT
1
Entering edit mode
4.0 years ago
GenoMax 141k

It is working for me:

$ more nam.txt
cel
pad

$ filterbyname.sh -Xmx10g in=mu.fa out=stdout.fa include=t names=nam.txt substring=t 

Executing driver.FilterReadsByName [-Xmx10g, in=mu.fa, out=stdout.fa, include=t, names=nam.txt, substring=t]

Input is being processed as unpaired
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p
CUAUGCAAUUUUCUACCUUACC
>cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p
UCCCUGAGACCUCAAGUGUGA
>pad-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p
ACACCUGGGCUCUCCGGGUACC
>pad-miR-1-5p MIMAT0020301 Caenorhabditis elegans miR-1-5p
CAUACUUCCUUACAUGCCCAUA
>cel-miR-1-3p MIMAT0000003 Caenorhabditis elegans miR-1-3p
UGGAAUGUAAAGAAGUAUGUA
>cel-miR-2-5p MIMAT0020302 Caenorhabditis elegans miR-2-5p
CAUCAAAGCGGUGGUUGAUGUG
Time:                           0.086 seconds.
Reads Processed:           9    0.10k reads/sec
Bases Processed:         197    0.00m bases/sec
Reads Out:          7
Bases Out:          152
ADD COMMENT

Login before adding your answer.

Traffic: 2699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6