Question: filtering some sequences from txt file in python
0
gravatar for ashkan
3.6 years ago by
ashkan110
ashkan110 wrote:

I have too many lines like this:

>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
TTAACTATAATCCCACTGCCTATTTTTTTATTTCTAAAAATATCATAAAAAGACACAAAA

the first line(starting with >) is identifier and other lines are sequence and also each identifier has its own sequence. in the mentioned example, ENSG00000100206 is name and ENST00000216024 is isoform. in my file there are some identifier lines with the same name but everything else is different. I would like to get the longest sequence for each name and make a new file. meaning there would be only one repeat of each name (but with the longest sequence). for the above example the results would be like this:

>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC

I count the number of identifiers using looping over each identifier

file = open ("file.txt", "r") 
count = 0 
for line in file:
    if line.startswith(">")
        count +=1

but I don't know how to do filtering do you guys know how to do that in python?

sequence • 2.0k views
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by ashkan110

You'll need to use a dictionary for this. Do you know a dictionary? How familiar are you with (bio)python?

I can write the script you want, but I think it's more interesting for you if you do some thinking yourself. I don't mind assisting and steering your efforts :-)

It would make more sense to call a file like this a fasta file... it's not just a txt file.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1774 users visited in the last hour