Question: how to change the sequence ID and order of sequences in a file to match an alignment output file
0
gravatar for Kay
4 weeks ago by
Kay0
Liverpool, UK
Kay0 wrote:

Hi, I have two files, one is protein sequences of a group of genes, and the other is their corresponding nucleotide sequences.

>maker-scaffold10x_338_pilon-snap-gene-0.71-mRNA-1
MHLKNGDPKPTIKPNQCTLFGFRFCPYVDRVRMVLQYYNVPHDNVWIHLYSKPDWYLELY
PVGKVPLLITKEGKTIVESDAIIRYLDETIGNKSLMSLCGEAEFERAGKLASKLMAQSHG
ILFGASVAEANASAYRDVCQEINDTIKGPYLLGDKLTLADFLLFSHVNHFEPIMARLDGL
APSDVHDLKATDQYRTKWPRLTTFLDVMRRLPCVLTVREPSQKLALFAETYRQGQPNPDL
>augustus_masked-scf7180006947290-processed-gene-0.5-mRNA-1
MSEIRSLNIFDANSQNSSEFRRNIPDFLRPYECYRCVIGKKKPDDVEYICRYSLSCLGDC
AKEKDYARYLEMKPCIFLQVNKVYGWIPDIVGENLLVKCFGKVGLIKIILNSITPEIVFN
YFGIKYDKVLINLQDKPEWFLKMYPEGKVPFIIDKQRQLGDSEIIIRDYDSKNNNKLITA
CGEEKFSETKDLISSFFGLCYTILFKDNKISDENADLFLKALEKVEAKIVGPFMWGDQLS
LADVILFTHLNMFECSLSRIEGIHPDQVKDGYPNAAREASFVKIPAYLKQMRNHSAVKDV
YVHPNDISKYAVGLRIGKPNPEGDN
>DILT_0000424901-mRNA-1
MGWVLGGDGSFLPTGCANHGDPEPSVNPENVTLYDMQFCPYCQRVRYTLDYHKIPYDRIL
IDLMSKPSWYLKMYPVGKVPLLLYRGKTMAESDVIMKYCDQMKGAKASLLSVCGEEGFKR
ALNLTSSVSLLLIALLRYKLLFSPDVTRADADSLKAALSNLDKAIQGPYLMDLLPFLTFE
GKELSLADLALFPFLHAWDLLISRLKDVGDDSDESAEPVAPRWPNVLKYCQLMNQKPFIM
KTAFRDDEFSKYMDTRLQAARP
>MS3_04642.1
MHLKRSDPKPLIDPNRLTLIGFRFCPYVDRVRLILSYYKIDYDLINVSLASKPEWFLKMY
PIGKVPLLLLPNEQKLPESDEIIRHIDKLYGSETLLSHCGIEEFEKVKELITGISRPSYM
IMCVQEINLCDVSLYRAACNKINDAIKGPYFTGSELSLADLILFPHLHRFEVVMGRITGK
KPEEINELNINDELRKEFPKLTEFLDTMRKQSFVIDVTIPYRIHVQYAASVLSGHANPDI
E

Here's the nucleotide sequences, I have deleted the remaining part of the last 3 sequences

>maker-scaffold10x_338_pilon-snap-gene-0.71-mRNA-1
ATGCATCTGAAAAACGGTGACCCAAAACCTACCATCAAGCCTAATCAATGTACTCTATTT
GGTTTTCGATTCTGTCCCTATGTGGATCGTGTCAGAATGGTACTCCAATATTACAACGTC
CCGCATGATAATGTTTGGATACATTTATACTCAAAACCGGATTGGTATCTGGAATTATAT
CCGGTCGGCAAAGTACCTCTTTTGATTACCAAAGAGGGGAAGACAATTGTGGAATCGGAT
GCGATTATACGGTATTTGGACGAAACGATCGGAAACAAGTCTCTGATGTCTTTGTGTGGT
GAAGCGGAGTTTGAGCGGGCCGGGAAATTGGCGTCTAAACTCATGGCTCAATCGCATGGT
ATTTTATTCGGCGCCAGTGTCGCGGAAGCTAATGCGTCTGCGTATCGTGACGTCTGTCAA
GAAATAAATGATACAATCAAGGGACCATACTTGTTGGGCGACAAGTTGACATTGGCCGAT
TTTCTGTTATTCTCTCATGTGAACCACTTCGAACCGATCATGGCTCGTTTAGACGGTCTA
GCACCCAGTGACGTTCATGATCTGAAAGCGACCGATCAGTACAGGACGAAATGGCCCCGG
TTGACCACCTTCTTGGATGTTATGCGTCGTTTGCCCTGTGTGCTTACCGTACGTGAGCCG
TCCCAAAAGCTTGCCCTTTTTGCGGAAACATATCGTCAAGGTCAACCAAATCCGGATCTA
TGA
>augustus_masked-scf7180006947290-processed-gene-0.5-mRNA-1
ATGAGTGAAATACGGAGTTTAAACATTTTCGATGCCAACAGCCAGAACTCA.......
>DILT_0000424901-mRNA-1
ATGGGCTGGGTATTAGGTGGCGACGGCTCCTTCTTACCCACCGGTTGTGCTAA.......
>MS3_04642.1
ATGCACCTCAAACGAAGTGACCCTAAACCACTGATTGATCCTAATC..........

After aligning my protein sequences, my output looks something likes (I have deleted the remaining part to reduce space):

CLUSTAL W (1.81) multiple sequence alignment

augustus_masked-scf7180006947290      MSEIRSLNIFDANSQNSSEFRRNIPD-FLRPYECYRCVIGKKKPDDVEYICRYSLSCLGD
DILT_0000424901-mRNA-1                ---MGWVLGGDGSFLPTGCANHGDPEPSVNPENV--------------------------
maker-scaffold10x_338_pilon-snap      -----------------MHLKNGDPKPTIKPNQC--------------------------
MS3_04642.1                           -----------------MHLKRSDPKPLIDPNRL--------------------------
                                                          ... *.  : *                             

augustus_masked-scf7180006947290      CAKEKDYARYLEMKPCIFLQVNKVYGWIPDIVGENLLVKCFGKVGLIKIILNSITPEIVF
DILT_0000424901-mRNA-1                --------TLYDMQFCPYCQRVR----------------------------------YTL
maker-scaffold10x_338_pilon-snap      --------TLFGFRFCPYVDRVR----------------------------------MVL
MS3_04642.1                           --------TLIGFRFCPYVDRVR----------------------------------LIL
                                                  :. * : :  .                                    :

augustus_masked-scf7180006947290      NYFGIKYDKVLINLQDKPEWFLKMYPEGKVPFIIDK-QRQLGDSEIIIRDYDSKNNNK--
DILT_0000424901-mRNA-1                DYHKIPYDRILIDLMSKPSWYLKMYPVGKVPLLLYR-GKTMAESDVIMKYCDQMKGAKAS
maker-scaffold10x_338_pilon-snap      QYYNVPHDNVWIHLYSKPDWYLELYPVGKVPLLITKEGKTIVESDAIIRYLDETIGNK-S
MS3_04642.1                           SYYKIDYDLINVSLASKPEWFLKMYPIGKVPLLLLPNEQKLPESDEIIRHIDKLYGSE-T
                                      .*. : :* : : * .**.*:*::** ****:::    . : :*: *:.  *.  . :

The challenge I'm having is generating a new file containing the nucleotide sequences matching the sequence ID (the alignment software shortens the sequence ID to 32 characters I think) and the order newly assigned to them the alignment software (especially if I have loads of sequences to align). The nucleotide sequences should now look like (I've deleted some of the nucleotide sequences):

>augustus_masked-scf7180006947290
ATGAGTGAAATACGGAGTTTAAACATTTTCGATGCCAACAGCCAGAACTCATCAGAATTT
AGACGTAATATTCCAGATTTCCTGAGACCCTATGAGTGTTATCGCTGTGTTATCGGGAAA
AAGAAGCCGGATGATGTTGAATACATTTGCAGATATTCTCTGTCATGTTTAGGTGATTGT
GCAAAAGAAAAGGACTATGCAAGGTATCTGGAAATGAAACCCTGCATTTTTCTTCAAGTC
AATAAAGTTTATGGCTGGATTCCAGACATTGTTGGTGAAAATTTACTCGTGAAATGTTTC
GGAAAGGTCGGTTTAATTAAAATTATATTAAATAGTATAACACCTGAAATTGTATTCAAC
TACTTCGGGATCAAATATGACAAGGTTCTAATAAATCTACAGGATAAACCTGAATGGTTT
CTCAAAATGTACCCTGAAGGCAAGGTTCCATTCATCATTGATAAACAGAGACAACTTGGT
GACTCTGAGATTATCATTCGAGACTATGACTCAAAGAACAATAATAAATTGATTACTGCC
TGTGGCGAAGAAAAGTTTTCTGAAACTAAAGATCTCATCTCAAGCTTCTTTGGCCTTTGC
TATACCATTCTCTTCAAGGATAATAAAATTTCCGATGAGAATGCTGATCTCTTCTTGAAA
GCTCTCGAGAAGGTTGAAGCGAAAATTGTTGGCCCCTTCATGTGGGGAGATCAACTATCT
CTAGCCGATGTAATTCTCTTCACACATTTGAACATGTTCGAGTGCTCTTTATCGAGAATC
GAGGGAATTCATCCTGACCAAGTGAAAGATGGTTATCCCAATGCCGCAAGGGAAGCTAGC
TTCGTCAAGATTCCCGCCTATCTGAAGCAAATGAGAAATCACTCTGCAGTTAAAGATGTC
TACGTTCATCCTAATGATATTTCCAAGTACGCTGTCGGTTTAAGAATTGGAAAGCCAAAT
CCGGAAGGCGATAACTAG
>DILT_0000424901-mRNA-1
ATGGGCTGGGTATTAGGTGGCGACGGCTCCTTCTTACCCACCGG.......
>maker-scaffold10x_338_pilon-snap
ATGCATCTGAAAAACGGTGACCCAAAACCTACCA..........
>MS3_04642.1
ATGCACCTCAAACGAAGTGACCCTAAACCACTGATTGATCCTAAT.........

I need the final nucleotide sequences file and the alignment file for another analysis, but I am stuck. How can I do this? been thinking of python, but my scripting skills aren't best yet. Any advice? Apologies for the long post. Thanks kay

script python • 173 views
ADD COMMENTlink modified 4 weeks ago by shenwei3564.5k • written 4 weeks ago by Kay0
0
gravatar for shenwei356
4 weeks ago by
shenwei3564.5k
China
shenwei3564.5k wrote:

It seems that you want sort sequences in FASTA format in a custom order. Here's a quick solution without writing script.

seqkit sort can sort FASTA, but it lacks the ability of sorting in custom order, luckily, which can be provided by csvtk sort:

An example:

$ cat seqs.fa 
>a
actg
>b
ACTG
>c
cagt
>d
CAGT

Using seqkit fx2tab to convert FASTA/Q into tabular format, and seqkit tab2fx convert it back to FASTA/Q.

$ seqkit fx2tab seqs.fa 
a       actg
b       ACTG
c       cagt
d       CAGT

csvtk sort can sort in user-defined order, defined by order.txt:

$ cat order.txt 
c
b
d
a

All in one:

$ seqkit fx2tab seqs.fa  | csvtk sort -H -t -L 1:order.txt -k 1:u | seqkit tab2fx 
>c
cagt
>b
ACTG
>d
CAGT
>a
actg
ADD COMMENTlink written 4 weeks ago by shenwei3564.5k

shenwei356 : Do you need to add anything to the command to recognize fasta names that are truncated to first 32 characters? It looks like the alignment software is doing that.

ADD REPLYlink written 4 weeks ago by genomax62k

thanks for your reply, but would for few sequences, I'm talking of when I have like 50 sequences (or more).

Also, Im not sure if truncating the sequence ID truncated to 32 characters makes Identifying unique, it appears it just truncates them, meaning in case of sequences where sequences have similar ID within the 32 characters, identifying them may be an issue.

thanks

ADD REPLYlink written 4 weeks ago by Kay0

According what you mentioned in another similar post, the MSA software can save aligment result in FASTA file, where sequence ids are not truncted. I also remember some MSA tools indeed support this, like clustalo/clustalw.

So, you can just output the sequence headers

seqkit seq -n alignment.fasta > order.txt
ADD REPLYlink written 4 weeks ago by shenwei3564.5k

thanks, figured another way to do it using a script.

ADD REPLYlink written 28 days ago by Kay0

Assuming alignment file is also in fasta file, and the sequence ID is not truncated. Is it possible to sort my nucleotides then to have something like:

>augustus_masked-scf7180006947290-processed-gene-0.5-mRNA-1
ATGAGTGAAATACGGAGTTTAAACATTTTCGATGCCAACAGCCAGAACTCATCAGAATTT
AGACGTAATATTCCAGATTTCCTGAGACCCTATGAGTGTTATCGCTGTGTTATCGGGAAA
AAGAAGCCGGATGATGTTGAATACATTTGCAGATATTCTCTGTCATGTTTAGGTGATTGT
GCAAAAGAAAAGGACTATGCAAGGTATCTGGAAATGAAACCCTGCATTTTTCTTCAAGTC
AATAAAGTTTATGGCTGGATTCCAGACATTGTTGGTGAAAATTTACTCGTGAAATGTTTC
GGAAAGGTCGGTTTAATTAAAATTATATTAAATAGTATAACACCTGAAATTGTATTCAAC
TACTTCGGGATCAAATATGACAAGGTTCTAATAAATCTACAGGATAAACCTGAATGGTTT
CTCAAAATGTACCCTGAAGGCAAGGTTCCATTCATCATTGATAAACAGAGACAACTTGGT
GACTCTGAGATTATCATTCGAGACTATGACTCAAAGAACAATAATAAATTGATTACTGCC
TGTGGCGAAGAAAAGTTTTCTGAAACTAAAGATCTCATCTCAAGCTTCTTTGGCCTTTGC
TATACCATTCTCTTCAAGGATAATAAAATTTCCGATGAGAATGCTGATCTCTTCTTGAAA
GCTCTCGAGAAGGTTGAAGCGAAAATTGTTGGCCCCTTCATGTGGGGAGATCAACTATCT
CTAGCCGATGTAATTCTCTTCACACATTTGAACATGTTCGAGTGCTCTTTATCGAGAATC
GAGGGAATTCATCCTGACCAAGTGAAAGATGGTTATCCCAATGCCGCAAGGGAAGCTAGC
TTCGTCAAGATTCCCGCCTATCTGAAGCAAATGAGAAATCACTCTGCAGTTAAAGATGTC
TACGTTCATCCTAATGATATTTCCAAGTACGCTGTCGGTTTAAGAATTGGAAAGCCAAAT
CCGGAAGGCGATAACTAG
>DILT_0000424901-mRNA-1
ATGGGCTGGGTATTAGGTGGCGACGGCTCCTTCTTACCCACCGG.......
>maker-scaffold10x_338_pilon-snap-gene-0.71-mRNA-1
ATGCATCTGAAAAACGGTGACCCAAAACCTACCA..........
>MS3_04642.1
ATGCACCTCAAACGAAGTGACCCTAAACCACTGATTGATCCTAAT.........
ADD REPLYlink written 4 weeks ago by Kay0

using this approach I will have to extract the new sequence ID from the alignment file and write it to a new file:

augustus_masked-scf7180006947290      
DILT_0000424901-mRNA-1                
maker-scaffold10x_338_pilon-snap
MS3_04642.1

and then use csvtk sort, what if I have many sequences?

ADD REPLYlink written 4 weeks ago by Kay0

The desired output looks like it is just sorted sorted by sequence name. Shouldn't be seqkit sort enough to do the job? Truncating the names could than be done by a little awk script.

fin swimmer

ADD REPLYlink written 4 weeks ago by finswimmer10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1401 users visited in the last hour