Question: (Closed) How to count a sequence and cut in the other sequence?
gravatar for projetoic
24 days ago by
projetoic0 wrote:

I have a reference sequence (CDS) and an aligned sequence in the same file. Format fasta.aln or aln.The alignment was done with MAFFT.


RefSeq - - - - - - AAGCTGC


Output I would like:


It would be a removal from sequence 1 according to the symbol "-" of RefSeq. I would like to extract only the CDS after the alignment. Is there any way to do this from the command line or some programming language? I tried to do it with biopython but was not successful!

f = open('Denv4-X-gb_AY947539.txt', 'r')
con = f.readlines()
con = [i.strip() for i in con]
length = len(con[0].split("-")[0])
result = f'{con[0].split("-")[0]} {con[0].split("-")[0][length:]}'
f = open('Denv4cds.txt', 'a') f.write(f'\n{result}')

Do you have a script or module or library that can do this?

I'm using windows wsl.

I'm a beginner in bioinformatics

the generated file just printed the first line ...

ADD COMMENTlink modified 24 days ago • written 24 days ago by projetoic0

I think you need to check the formatting of the post here. It's not clear to me from your description how GGGGG relates to the RefSeq sequence?

I interpret this to mean you want to trim off sequence in your query which corresponds to gap characters in the RefSeq aligned string?

ADD REPLYlink modified 24 days ago • written 24 days ago by Joe18k
> >lcl|NC_001477.1_cds_NP_059433.1_1/1-10179 [gene=POLY] [locus_tag=DV1_gp1] [db_xref=GeneID:5075725] [protein=polyprotein]
> [protein_id=NP_059433.1] [location=95..10273] [gbkey=CDS]
> ------------------------------------------------------------------------
> ----------------------atgaacaaccaacggaaaaagacgggtcgaccgtctttcaatatgctgaa acgcgcgagaaaccgcgtgtcaactgtttcacagttggcgaagagattctcaaaaggattgctttcaggcca
> aggacccatgaaattggtgatggcttttatagcattcctaagatttctagccatacctccaacagcaggaat
> tttggctagatggggctcattcaagaagaatggagcgcttaacagcacgagccacctgggccac
> caacatacaagtggccataaaccaagtgagaaggctcattgggaatgagaattatctagacttcatgacatc
> aatgaagagattcaaaaacgagagtgatcccgaaggggcactctggtaa-----------------------
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> >gb:AB189120|Organism:Dengue/1-10735 virus 1|Strain Name:98901518 DHF DV-1|Segment:null|Subtype:1-IV|Host:Human
> agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtagttctaacagttt
> tttattggagagcagatctctgatgaacaaccaacggaaaaagacgggtcgaccgtctttcaatatgctgaa
> acgcgcgagaaaccgcgtgtcaactgtttcacagttggcgaagagattctcaaaaggattgctttcaggtca
> aggacccatgaaactggtgatggcttttatagcattcctaagatttctagccatacctccaacagcaggaat
> tttggctagatggggctcattcaagaagaatggagcgattaaagtgttacggggtttgtaagagctatgctgcct
> gtgagccccgtccaaggacgtaaaatgaagtcaggccgaaagccacggtttgagcaagccgtgctgcctgta
> gctccatcgtggggatgtaaaaacctgggaggctgcaacccatggaagctgtacgcatggggtagcagacta
> gtggttagaggagacccctcccgaaacataacgcagcagcggggcccaacaccaggggaagctgtaccctgg
> tggtaaggactagaggttagaggagaccccccgcataacaataaacagcatattgacgctgggagagaccag
> agatcctgctgtctctacagcatcattccaggcacagaacgccagaaaatggaatggtgctgttgaatcaac
> aggttct-----------------------------------------------------------------
> ------------------------------------------------------------------------

it's something like this.The GGG part relates to the result of the sequence after cutting. I tried to use a fasta editor but there are many sequences
ADD REPLYlink modified 24 days ago • written 24 days ago by projetoic0

Please use the code formatting button (looks like 101010). I tried to format your output myself but it's all over the place.

ADD REPLYlink written 24 days ago by Joe18k

I have done the changes

ADD REPLYlink written 24 days ago by projetoic0

Hello projetoic!

This question has been asked again. Please do not open multiple threads for the same question. How do I count a number of ("-") in a reference string and remove it in a second string?

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink modified 24 days ago by genomax91k • written 24 days ago by Joe18k
gravatar for Mensur Dlakic
24 days ago by
Mensur Dlakic6.9k
Mensur Dlakic6.9k wrote:

You could get this done by using the -nogaps option of trimAl.

ADD COMMENTlink written 24 days ago by Mensur Dlakic6.9k

But I would not like to remove all gaps ... And according to the gaps of the reference sequence (lcl|NC_001477.1_cds_NP_059433.1_1/1-10179) , remove the characters or sequence of the second sequence (gb:AB189120|Organism:Dengue/1-10735 virus 1) of the organism or species. Is this option the most correct?

ADD REPLYlink written 24 days ago by projetoic0
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 781 users visited in the last hour