Filtering Fasta sequences by length (only one must prevail)
2
0
Entering edit mode
6.3 years ago
Solowars ▴ 70

Hello everyone!

I'm working with fasta files containing sequences from different organisms, and for some of them I have more than one sequence. I would like to have only one representative sequence per organism, and I'd like it to be the longest one in each case. I've spent some time looking for an answer and learning to use some command line tools, but I couldn't get it right. My file kinda looks like this

>Mouse01
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG
>Mouse02
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTG
>Chimpanzee
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human01
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human02
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA

In this case, I'd like to keep Mouse01, Chimpanzee, and Human02.

The workflow, I think, would be:

1) Identify sequences of the same species by regex (e.g. Mouse, Human)

2) Count sequence length for species with more than one match

3) Keep only the longest sequence in species with more than one match, leave the rest (e.g. Chimpanzee) untouched.

I bet there must be some magical recipe or one-liner to do this using command line, but how would it look like?

Thanks from a very very rookie bioinformatic tools learner.

alignment sequence fasta • 2.5k views
ADD COMMENT
0
Entering edit mode

What about matching sequences with the same length?

ADD REPLY
0
Entering edit mode

What will be your downstream analyses? Selecting the longest sequence may not be the best approach, depending on how your data has been generated and what you want to do.

ADD REPLY
0
Entering edit mode

Well, I intend to perform tests for positive selection among a set of ortholog genes. In some cases I know that the orthology is not 1-to-1, so I thought of keeping the longest sequence in species with no 1-to-1 homology. Perhaps this is not the best approach, in which case I'd appreciate your suggestions.

ADD REPLY
0
Entering edit mode

Probably PhyloTreePruner is a better option then.

ADD REPLY
4
Entering edit mode
6.3 years ago

linearize fasta. create a new column for length and duplicate title, normalize duplicated title with sed.

sort on lentgth (column 1)

sort on normalized name , with stable+uniq

convert back to fasta

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa  |\
awk '{printf("%d\t%s\t%s\t%s\n",length($2),$1,$1,$2);}'  |\
sed 's/>\(Mouse\|Human\)[0-9]*/>\1/' |\
sort -t $'\t'  -k1,1nr  |\
sort -t $'\t' -k2,2 --stable -u |\
cut -f 3,4 |\
tr "\t" "\n"


>Chimpanzee
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human02
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA
>Mouse01
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG
ADD COMMENT
0
Entering edit mode

Thank you Pierre! Is it possible to apply the same code but using a file with the list of identifier patterns rather than literal terms in the sed command?

ADD REPLY
1
Entering edit mode

you can use the option -f of sed....

ADD REPLY
0
Entering edit mode

It worked pretty well! Thank you so much

ADD REPLY
0
Entering edit mode
6.3 years ago
Hussain Ather ▴ 990

Python:

f = open("input.txt", "r")
o = open("output.txt", "w")

d = {}

for line in f.readlines():
        sline = line.replace("\n", "")
        if line.startswith(">"):
                if sline[-2:].isdigit():
                        key=sline[1:-2]
                else:
                        key=sline[1:]
        else:
                if  key in d:
                        if len(d[key])<len(sline):
                                d[key] = sline
                else:
                        d[key] = sline

for key in d:
        o.write(">" + str(key) + "\n")
        o.write(str(d[key] + "\n"))
ADD COMMENT
0
Entering edit mode

Sorry to say that it didn't work. It didn't reduce the total amount of sequences, and it kept only the first line of each sequence (my sequences span several lines each). I think that a list of species identifiers should be used somewhere, since sequence names are not exactly identical. Perhaps some regex somewhere?

ADD REPLY

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6