How to cut a 15000 sequence file into multiple files of 1000nt each and save it in new files like F1,F2 and so on?
3
0
Entering edit mode
6 weeks ago

I have a file with more than 15000nt sequence and i want it to be separated into 1000nt new files like F1,F2 .....

Example

• MainFile

  ATGCATATGCCCAGTAGCGAATGATGATCA
ATGCATA_TCCCAGTAGTGAATGATAATCA
_CAT_TGCC_ATTAGAGAATGATGATCA_C


INTO 3 FILES OF 10nt each

• FILE1

  ATGCATATGC
ATGCATA_TC
_CAT_TGCCA

• File2

  CCAGTAGCGA
CCAGTAGCAG
AGAGAATGAG

• File3

  ATGATGATCA
ATGATAATCA
GATGATCA_C

awk FASTA • 895 views
2
Entering edit mode
6 weeks ago

https://bioinf.shenwei.me/seqkit/usage/#split or faSplit from kentutils

1
Entering edit mode

Thanks, but seqkit split is for splitting many sequences into several parts, rather than splitting sequences into fragments which are job of seqkit subseq or seqkit sliding.

1
Entering edit mode

input: test.fa

$seq 1 10 30 | while read line; do (echo ">seq_"$line"_"expr $line + 9 && cut -c$line-expr $line + 9 test.fa | sed '/^$/d') > "seq_"$line"_"expr$line + 9".fa" ; done

$tail -n+1 test.fa seq_*.fa ==> test.fa <== >seq ATGCATATGCCCAGTAGCGAATGATGATCA ATGCATA_TCCCAGTAGTGAATGATAATCA _CAT_TGCC_ATTAGAGAATGATGATCA_ ==> seq_1_10.fa <== >seq_1_10 >seq ATGCATATGC ATGCATA_TC _CAT_TGCC_ ==> seq_11_20.fa <== >seq_11_20 CCAGTAGCGA CCAGTAGTGA ATTAGAGAAT ==> seq_21_30.fa <== >seq_21_30 ATGATGATCA ATGATAATCA GATGATCA_  Remove original header from very first split fasta file (>seq here) ADD REPLY 0 Entering edit mode Thankyou it worked. Can you please explain this code. ADD REPLY 1 Entering edit mode $ seq 1 10 30 | while read line; do  (echo ">seq_"$line"_"expr$line + 9 && cut -c $line-expr$line + 9 test.fa | sed '/^$/d') > "seq_"$line"_"expr line + 9".fa" ; done  1. seq 1 10 30 - prints numbers from 1 to 30 with a window of 10 without overlaps 2. bash loop has following logic: a) Echo the new file name with "seq", "number" from step 1 and "a second number generated by adding 9 to number from step1'. b) use cut to extract characters column wise from test.fa. Ranges are provided by number from step and another number generated by adding 9 to number from step 1 c) using sed remove the empty lines d) > outputs the file to a file named "seq" with number from step 1 and a second number generated by adding 9 to the number from step 1. Logic of adding 9 to step 1 number is window size is 10 and step 1 numbers are 1, 11,21 and you would require 1-10, 11-20, 21-30 for cutting characters. ADD REPLY 0 Entering edit mode Thank you for your valuable time and explanation. Can you tell me how can i get the output in a format like seq0001 seq 00002 ...... and so on or seq_000001_001000.fas. Because when i run sort command in these file names. It sorts based on First character like seq_110001_111000.fas seq_11001_12000.fas. seq_111001_112000.fas and so on. So its difficult to sort based on the file no. ADD REPLY 1 Entering edit mode man sort  -V, --version-sort natural sort of (version) numbers within text  ADD REPLY 0 Entering edit mode Thankyou . ADD REPLY 2 Entering edit mode 6 weeks ago • For the format you paste (one line for a sequence ), use the commands below (change 10 to 1000, change 30 to 15000, change 9 to 999). • For the FASTA format (usually produced by multiple sequence alignment software), change the awk command with other tools, like seqkit subseq -rs:$e msa.fasta > msa.$s-$e.txt. Just use awk: for s in$(seq 1 10 30); do \
e=$(expr$s + 9); \
echo $s$e; \
done
1 10
11 20
21 30

for s in $(seq 1 10 30); do \ e=$(expr $s + 9); \ awk -v s=$s '{print substr($0, s, 10)}' msa.txt > msa.$s-\$e.txt
done

more msa.*.txt
::::::::::::::
msa.1-10.txt
::::::::::::::
ATGCATATGC
ATGCATA_TC
_CAT_TGCC_
::::::::::::::
msa.11-20.txt
::::::::::::::
CCAGTAGCGA
CCAGTAGTGA
ATTAGAGAAT
::::::::::::::
msa.21-30.txt
::::::::::::::
ATGATGATCA
ATGATAATCA
GATGATCA_C

0
Entering edit mode

Thankyou this code also worked.Can you explain the code.

1
Entering edit mode
0
Entering edit mode

The main file is not a single line fasta file its a multiple sequence aligned file . i WANT TO CUT THEM BASED ON COLUMN i.e. horizontally 1000 chatacters each.

Example
ATGCATATGCCCAGTAGCGAATGATGATCA
ATGCATA_TCCCAGTAGTGAATGATAATCA
_CAT_TGCC_ATTAGAGAATGATGATCA_ -MainFile

INTO 3 FILES OF 10nt each

ATGCATATGC-FILE1
ATGCATA_TC
_CAT_TGCCA

CCAGTAGCGA-File2
CCAGTAGCAG
AGAGAATGAG

ATGATGATCA-File3
ATGATAATCA
GATGATCA_C

0
Entering edit mode

Try with the above solutions. If they do not work, come back, post what you tried, the error you are having and any other relevant details.

0
Entering edit mode

I misunderstood your query.. You want the sequences to be vertical cut. Above solutions may not work