Question

How can I count the symbols in a given string / text and, as a result of that count, remove the characters

0

Entering edit mode

4.8 years ago

USER • 0

f = open('Denv4-X-gb_AY947539.txt', 'r')
z = f.read()
count_inicio = sum(map(lambda x : 1 if '-' in x else 0, z)) 
count_fim = sum(map(lambda x : 1 if '-' in x else 0, reversed(z))) 
print(count_inicio, count_fim)
Output>
479 479

file contents:

lcl|NC_002640.1_cds_NP_073286.1_1 [gene=POLY] [locus_tag=DV4_gp1]
     [db_xref=GeneID:5075729] [protein=polyprotein]
     [protein_id=NP_073286.1] [location=102..10265] [gbkey=CDS]
     ------------------------------------------------------------ ---------------------------------atgaaccaacgaaaaaaggtggttaga ccacctttcaatatgctgaaacgcgagagaaaccgcgtatcaacccctcaagggttggtg
     aagagattctcaaccggacttttttctgggaaaggacccttacggatggtgctagcattc
     atcacgtttttgcgagtcctttccatcccaccaacagcagggattctgaagagatgggga
     cagttgaagaaaaataaggccatcaagatactgattggattcaggaaggagataggccgc
     ------------------------------------------------------------ 

gb:AY947539|Organism:Dengue virus 4|Strain
     Name:H241|Segment:null|Subtype:4|Host:Human
     ggtcgtgtggaccgacaaggacagttccaaatcggaagcttgcttaacacagttctaaca
     gtttgtttagatagagagcagatctctggaaaaatgaaccaacgaaaaaaggtggttaga
     ccacctttcaatatgctgaaacgcgagagaaaccgcgtatcaacccctcaagggttggtg
     aagagattctcaaccggacttttttccgggaaaggacccttacggatggtgctagcattc
     atcacgtttttgcgagtcctttccatcccaccaacagcagggattctgaaaagatgggga
     cagttgaagaaaaacaaggccatcaaaatactgactggattcaggaaggagataggccgc
     atgctgaacatcttgaatggaagaaaaaggtcaacaatgacattgctgtgcttgattccc

For example I need to take the sequence lcl | NC_002640.1_cds_NP_073286.1_1> --- AATG-GG ---- and count the number of "-" at the beginning and end

And then cut into Myseq1 gb: AY947539 | Organism: Dengue virus 4 | GGGAATG-GGAAAA characters according to the amount of "-"

TALE 3 "-" in Myseq start and 3 at the end 4 ... So the output I want is AATF-GG. But first I need to make this "-" count from the beginning and the end.

How do I count symbols in a given string / text and as a result of that count remove characters from another string / text in the same file?

genome alignment gene sequence software error • 1.2k views

ADD COMMENT • link 4.8 years ago by USER • 0

0

Entering edit mode

1) first understand your format, looks like some multiple alignment format, so you can check if BioPython has a module to read it

2) if not, you need to read your sequences, you have a header of 3 lines in the first sequence (is it not a single line? that facilitates reading it), and a header of 2 lines in sequence 2, then each block has the nucleotide sequence, so add the sequence 1 in a string and iterate over it to get "-"

3) load the sequence 2 in another string and remove the blocks (use string ranges)

ADD REPLY • link 4.8 years ago by JC 13k

0

Entering edit mode

The header is on the same line in my FASTA file. And I turned it into txt because I couldn't read it with biopython. The alignment was done with mafft

arquivo.fasta.aln ou arquivo.aln turned into txt

Could you give an example

Input:

lcl | NC_002640.1_cds_NP_073286.1_1>
 --- AATG-GG ----
gb: AY947539 | Organism: Dengue virus 4 |
GGGAATG-GGAAAA

output:

 gb: AY947539 | Organism: Dengue virus 4 |
 AATF-GG

I need to count the number of "-" of the first string and cut the characters of the second string according to that amount find a second string ... To be only with the CDs in the second string

ADD REPLY • link 4.8 years ago by USER • 0

1

Entering edit mode

Por favor cara, precisamos um Input e Output esperado;

Per exemplo:

Input:

> header
------AAAA---BBBBB-----
-----------ATGCATGC---
---ATGCATGCCCCC

> GB proteinA proteinB
aactgtgactgcatgcatgactgactg
tacactactgcatgcatgactgactgc

Desired output:

> GB proteinA ----- proteinB
aactgtgactgcatgcatgactgactg
tacactactgcatgcatgactgactgc