Question

multiple DNA file Sliced to a fixed length substring by index/location

0

Entering edit mode

5.3 years ago

jamesdong • 0

I have a file containg multiple DNA fasta sequences(fasta format),like this :

>XM_123456
 ACTGTATGC

>XM_298778
 ATACACA
...

I want to get a fixed length of the DNA sequences, for example, the defined length is 5~9 nucleic acids from each DNA,with the output file like this (fasta format):

>XM_123456(1-5)
 ACTGT
>XM_123456(1-6)
 ACTGTA
...
 >XM_123456(1-9)
ACTGTATGC


 >XM_123456(2-6)
CTGTA
 ...

>XM_123456(2-9)
 CTGTATGC


>XM_123456(3-7)
TGTATG
...

>XM_123456(3-9)
TGTATGC

>XM_123456(4-8)
GTATG
>XM_123456(4-9)
GTATGC

....

>XM_298778(1-5)
ATACA
>XM_298778(1-6)
ATACAC
>XM_298778(1-7)
ATACACA

>XM_298778(2-6)
TACAC
>XM_298778(2-7)
TACACA

anyone can help me? Thanks for the help in advance.

python bash awk • 1.3k views

ADD COMMENT • link 5.3 years ago by jamesdong • 0

0

Entering edit mode

What input files do you have? You mentioned you have a fasta file, but do you also have a file which maps the names to the lengths you need? What does this file look like?

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Reasonably certain that seqkit should be able to do this ( https://github.com/shenwei356/seqkit ). You can take a look at the manual.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

@ jrj.healey, thank you for reply. I didn't have the file maps the names to the lengths, I only have the fasta file and the specified or defined lengths (5~9 nucleic acids) required, these are start point, I want to get the results.

ADD REPLY • link 5.3 years ago by jamesdong • 0

0

Entering edit mode

Are these lengths in the correct corresponding order to the sequences in the fasta?

Basically what I'm getting at is there is no way to match up what length you require with the sequence you need it from based on your question at the moment.

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Thanks for your good question about map parameter, if you use seqkit sliding -s 1 -W 5, it will also achieve that goal partially.

ADD REPLY • link 5.3 years ago by jamesdong • 0

0

Entering edit mode

@ genomax,many thanks, I will try it. I use piped seqkit sliding -s 1 -W 5, it works very well, though I must to define the length each time, thanks buddy.

ADD REPLY • link 5.3 years ago by jamesdong • 0

0

Entering edit mode

@jamesdong: Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized. Using Chrome browser helps if you are posting from china and can't access these buttons in a different browser.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Yes, you are right, thank you for your help!

ADD REPLY • link 5.3 years ago by jamesdong • 0

0

Entering edit mode

Please avoid terms such as "buddy". This is a professional forum where a certain etiquette needs to be followed.

ADD REPLY • link 5.3 years ago by Ram 43k

0

Entering edit mode

Thank you for your reminding me of this matter.

ADD REPLY • link 5.3 years ago by jamesdong • 0

0

Entering edit mode

It seems like a homework, or XY Problem. You should at least tell us what have you tried, or why do you want to this, what's the original purpose.

ADD REPLY • link 5.3 years ago by shenwei356 8.4k

0

Entering edit mode

Thank you for reply. It is simple problem and from an original idea, that there are many metabolic fragment of DNA or proteins in body, I want to get the predicted DNA or protein s fragments before I evaluate the properties of these fragments.

ADD REPLY • link 5.3 years ago by jamesdong • 0