Question: Filtering Fasta sequences by length (only one must prevail)
0
gravatar for Solowars
22 months ago by
Solowars50
Brazil/Porto Alegre/UFRGS
Solowars50 wrote:

Hello everyone!

I'm working with fasta files containing sequences from different organisms, and for some of them I have more than one sequence. I would like to have only one representative sequence per organism, and I'd like it to be the longest one in each case. I've spent some time looking for an answer and learning to use some command line tools, but I couldn't get it right. My file kinda looks like this

>Mouse01
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG
>Mouse02
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTG
>Chimpanzee
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human01
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human02
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA

In this case, I'd like to keep Mouse01, Chimpanzee, and Human02.

The workflow, I think, would be:

1) Identify sequences of the same species by regex (e.g. Mouse, Human)

2) Count sequence length for species with more than one match

3) Keep only the longest sequence in species with more than one match, leave the rest (e.g. Chimpanzee) untouched.

I bet there must be some magical recipe or one-liner to do this using command line, but how would it look like?

Thanks from a very very rookie bioinformatic tools learner.

sequence alignment fasta • 867 views
ADD COMMENTlink modified 22 months ago by Pierre Lindenbaum123k • written 22 months ago by Solowars50

What about matching sequences with the same length?

ADD REPLYlink written 22 months ago by Hussain Ather940

What will be your downstream analyses? Selecting the longest sequence may not be the best approach, depending on how your data has been generated and what you want to do.

ADD REPLYlink written 21 months ago by h.mon27k

Well, I intend to perform tests for positive selection among a set of ortholog genes. In some cases I know that the orthology is not 1-to-1, so I thought of keeping the longest sequence in species with no 1-to-1 homology. Perhaps this is not the best approach, in which case I'd appreciate your suggestions.

ADD REPLYlink written 21 months ago by Solowars50

Probably PhyloTreePruner is a better option then.

ADD REPLYlink written 21 months ago by h.mon27k
2
gravatar for Pierre Lindenbaum
22 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

linearize fasta. create a new column for length and duplicate title, normalize duplicated title with sed.

sort on lentgth (column 1)

sort on normalized name , with stable+uniq

convert back to fasta

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa  |\
awk '{printf("%d\t%s\t%s\t%s\n",length($2),$1,$1,$2);}'  |\
sed 's/>\(Mouse\|Human\)[0-9]*/>\1/' |\
sort -t $'\t'  -k1,1nr  |\
sort -t $'\t' -k2,2 --stable -u |\
cut -f 3,4 |\
tr "\t" "\n"


>Chimpanzee
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human02
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA
>Mouse01
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG
ADD COMMENTlink written 22 months ago by Pierre Lindenbaum123k

Thank you Pierre! Is it possible to apply the same code but using a file with the list of identifier patterns rather than literal terms in the sed command?

ADD REPLYlink written 22 months ago by Solowars50
1

you can use the option -f of sed....

ADD REPLYlink written 22 months ago by Pierre Lindenbaum123k

It worked pretty well! Thank you so much

ADD REPLYlink written 22 months ago by Solowars50
0
gravatar for Hussain Ather
22 months ago by
Hussain Ather940
National Institutes of Health, Bethesda, MD
Hussain Ather940 wrote:

Python:

f = open("input.txt", "r")
o = open("output.txt", "w")

d = {}

for line in f.readlines():
        sline = line.replace("\n", "")
        if line.startswith(">"):
                if sline[-2:].isdigit():
                        key=sline[1:-2]
                else:
                        key=sline[1:]
        else:
                if  key in d:
                        if len(d[key])<len(sline):
                                d[key] = sline
                else:
                        d[key] = sline

for key in d:
        o.write(">" + str(key) + "\n")
        o.write(str(d[key] + "\n"))
ADD COMMENTlink modified 22 months ago • written 22 months ago by Hussain Ather940

Sorry to say that it didn't work. It didn't reduce the total amount of sequences, and it kept only the first line of each sequence (my sequences span several lines each). I think that a list of species identifiers should be used somewhere, since sequence names are not exactly identical. Perhaps some regex somewhere?

ADD REPLYlink written 22 months ago by Solowars50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1816 users visited in the last hour