Question: multiple DNA file Sliced to a fixed length substring by index/location
0
gravatar for jamesdong
6 months ago by
jamesdong0
China
jamesdong0 wrote:

I have a file containg multiple DNA fasta sequences(fasta format),like this :

>XM_123456
 ACTGTATGC

>XM_298778
 ATACACA
...

I want to get a fixed length of the DNA sequences, for example, the defined length is 5~9 nucleic acids from each DNA,with the output file like this (fasta format):

>XM_123456(1-5)
 ACTGT
>XM_123456(1-6)
 ACTGTA
...
 >XM_123456(1-9)
ACTGTATGC


 >XM_123456(2-6)
CTGTA
 ...

>XM_123456(2-9)
 CTGTATGC


>XM_123456(3-7)
TGTATG
...

>XM_123456(3-9)
TGTATGC

>XM_123456(4-8)
GTATG
>XM_123456(4-9)
GTATGC

....

>XM_298778(1-5)
ATACA
>XM_298778(1-6)
ATACAC
>XM_298778(1-7)
ATACACA

>XM_298778(2-6)
TACAC
>XM_298778(2-7)
TACACA

anyone can help me? Thanks for the help in advance.

awk bash python • 243 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by jamesdong0

What input files do you have? You mentioned you have a fasta file, but do you also have a file which maps the names to the lengths you need? What does this file look like?

ADD REPLYlink written 6 months ago by jrj.healey13k

Reasonably certain that seqkit should be able to do this ( https://github.com/shenwei356/seqkit ). You can take a look at the manual.

ADD REPLYlink written 6 months ago by genomax70k

@ jrj.healey, thank you for reply. I didn't have the file maps the names to the lengths, I only have the fasta file and the specified or defined lengths (5~9 nucleic acids) required, these are start point, I want to get the results.

ADD REPLYlink written 6 months ago by jamesdong0

Are these lengths in the correct corresponding order to the sequences in the fasta?

Basically what I'm getting at is there is no way to match up what length you require with the sequence you need it from based on your question at the moment.

ADD REPLYlink written 6 months ago by jrj.healey13k

Thanks for your good question about map parameter, if you use seqkit sliding -s 1 -W 5, it will also achieve that goal partially.

ADD REPLYlink modified 6 months ago • written 6 months ago by jamesdong0

@ genomax,many thanks, I will try it. I use piped seqkit sliding -s 1 -W 5, it works very well, though I must to define the length each time, thanks buddy.

ADD REPLYlink modified 6 months ago • written 6 months ago by jamesdong0

@jamesdong: Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized. Using Chrome browser helps if you are posting from china and can't access these buttons in a different browser.

ADD REPLYlink written 6 months ago by genomax70k

Yes, you are right, thank you for your help!

ADD REPLYlink written 6 months ago by jamesdong0

Please avoid terms such as "buddy". This is a professional forum where a certain etiquette needs to be followed.

ADD REPLYlink written 6 months ago by RamRS23k

Thank you for your reminding me of this matter.

ADD REPLYlink written 6 months ago by jamesdong0

It seems like a homework, or XY Problem. You should at least tell us what have you tried, or why do you want to this, what's the original purpose.

ADD REPLYlink written 6 months ago by shenwei3564.7k

Thank you for reply. It is simple problem and from an original idea, that there are many metabolic fragment of DNA or proteins in body, I want to get the predicted DNA or protein s fragments before I evaluate the properties of these fragments.

ADD REPLYlink modified 6 months ago • written 6 months ago by jamesdong0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1579 users visited in the last hour