Question: Filtering Fasta sequences by length (only one must prevail)
0
gravatar for Solowars
15 months ago by
Solowars40
Brazil/Porto Alegre/UFRGS
Solowars40 wrote:

Hello everyone!

I'm working with fasta files containing sequences from different organisms, and for some of them I have more than one sequence. I would like to have only one representative sequence per organism, and I'd like it to be the longest one in each case. I've spent some time looking for an answer and learning to use some command line tools, but I couldn't get it right. My file kinda looks like this

>Mouse01
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG
>Mouse02
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTG
>Chimpanzee
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human01
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human02
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA

In this case, I'd like to keep Mouse01, Chimpanzee, and Human02.

The workflow, I think, would be:

1) Identify sequences of the same species by regex (e.g. Mouse, Human)

2) Count sequence length for species with more than one match

3) Keep only the longest sequence in species with more than one match, leave the rest (e.g. Chimpanzee) untouched.

I bet there must be some magical recipe or one-liner to do this using command line, but how would it look like?

Thanks from a very very rookie bioinformatic tools learner.

sequence alignment fasta • 699 views
ADD COMMENTlink modified 15 months ago by Pierre Lindenbaum118k • written 15 months ago by Solowars40

What about matching sequences with the same length?

ADD REPLYlink written 15 months ago by Hussain Ather910

What will be your downstream analyses? Selecting the longest sequence may not be the best approach, depending on how your data has been generated and what you want to do.

ADD REPLYlink written 15 months ago by h.mon24k

Well, I intend to perform tests for positive selection among a set of ortholog genes. In some cases I know that the orthology is not 1-to-1, so I thought of keeping the longest sequence in species with no 1-to-1 homology. Perhaps this is not the best approach, in which case I'd appreciate your suggestions.

ADD REPLYlink written 14 months ago by Solowars40

Probably PhyloTreePruner is a better option then.

ADD REPLYlink written 14 months ago by h.mon24k
2
gravatar for Pierre Lindenbaum
15 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

linearize fasta. create a new column for length and duplicate title, normalize duplicated title with sed.

sort on lentgth (column 1)

sort on normalized name , with stable+uniq

convert back to fasta

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa  |\
awk '{printf("%d\t%s\t%s\t%s\n",length($2),$1,$1,$2);}'  |\
sed 's/>\(Mouse\|Human\)[0-9]*/>\1/' |\
sort -t $'\t'  -k1,1nr  |\
sort -t $'\t' -k2,2 --stable -u |\
cut -f 3,4 |\
tr "\t" "\n"


>Chimpanzee
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human02
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA
>Mouse01
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG
ADD COMMENTlink written 15 months ago by Pierre Lindenbaum118k

Thank you Pierre! Is it possible to apply the same code but using a file with the list of identifier patterns rather than literal terms in the sed command?

ADD REPLYlink written 15 months ago by Solowars40
1

you can use the option -f of sed....

ADD REPLYlink written 15 months ago by Pierre Lindenbaum118k

It worked pretty well! Thank you so much

ADD REPLYlink written 15 months ago by Solowars40
0
gravatar for Hussain Ather
15 months ago by
Hussain Ather910
National Institutes of Health, Bethesda, MD
Hussain Ather910 wrote:

Python:

f = open("input.txt", "r")
o = open("output.txt", "w")

d = {}

for line in f.readlines():
        sline = line.replace("\n", "")
        if line.startswith(">"):
                if sline[-2:].isdigit():
                        key=sline[1:-2]
                else:
                        key=sline[1:]
        else:
                if  key in d:
                        if len(d[key])<len(sline):
                                d[key] = sline
                else:
                        d[key] = sline

for key in d:
        o.write(">" + str(key) + "\n")
        o.write(str(d[key] + "\n"))
ADD COMMENTlink modified 15 months ago • written 15 months ago by Hussain Ather910

Sorry to say that it didn't work. It didn't reduce the total amount of sequences, and it kept only the first line of each sequence (my sequences span several lines each). I think that a list of species identifiers should be used somewhere, since sequence names are not exactly identical. Perhaps some regex somewhere?

ADD REPLYlink written 15 months ago by Solowars40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1601 users visited in the last hour