How to separate sub-families from transposons sequence based fasta files?
1
0
Entering edit mode
2.8 years ago
ANAM • 0

I'm working on the classification of transposable elements. I want to retrieve sequences of their sub-classes in separate files. Is there any code or tool present to separate their sub-families because dataset contains thousands of sequence entries for different species.

I really appreciate any help or suggestion!

DATASET SOURCE: https://pgsb.helmholtz-muenchen.de/plant/recat/index.jsp

For example:

I want to separate RLC Sequences in separate files and so far for other entries like for RLX & TXX

>RLC_163294|LTR_Gr_chr_04_982|LTR/Copia|02.01.01.05|29730|Gossypium
tgttagagtagttagtaaagttgttagtagttaaaactgttgtacgttcagttaacagttgagctgttaaatagttgacctgttagttatgcattcatttgagtataaaactatgagaagtctgtacttaaagatatgagttttataatgaagaaattctaagtctttgtttttaagctgcttgtttagcttaacatggtatcag
>RLX_163369|LTR_Gr_chr_10_2326|LTR|02.01.01|29730|Gossypium
tgtcacgggcaaaagtgcaaagcccgtgaccatggcataagatgtgccccatggaggtctatcgattagacaaggaacatttagcccacgagaacttgcccgattcaaaaaactgttggagaagcctgtcagattgaagcctggttggcccgataatgaagacgtggcaacttaggccaattttggt
>TXX_174935|TXX_Gr_DX404975.1_8351|MobileElement|02|29730|Gossypium
atccgtgcccatgccatgtcccagacatggtcttatgggggactctcatctcggtgccaacgccatatcccagacatggtcttacatgggacctctcataatctcaattatgccaatgccatgtcccagacatggtcttacatgggatctctttacccaaatatcatgacatttgtatccattacattcccaatgtttcaacggggcttttatcactgattctctgtcatctcatacttgagttaacattagatattttcatgaaataaatacataattgctggaaaatagcagcattaa
fasta transposons • 1.0k views
ADD COMMENT
0
Entering edit mode

with awk (not tested on large dataset, please take a back up of your data and assumes that sequences are in single line):

$ awk -F '[>_]' '/>/{getline seq; print $0"\n"seq > $2".fa"}' test.fa

with seqkit (works for multi line sequences and outputs sequences in single line):

$ seqkit -w 0 split -i --id-regexp '(^[A-Z]{3})_*' test.fa -O new_files

Output files will be stored in a new directory "new_files" and files are named test.id_RLC.fa, test.id_RLX.fa, test.id_TXX.fa . Remove test_id from all the files running rename -n 's/test\.id_//' *.fa in new_files folder.

Please note that the regex is tightened, to be safe while running regex. If subfamily IDs star with more than 3 letters, at the start, please change accordingly.

ADD REPLY
2
Entering edit mode
2.8 years ago
GenoMax 142k

You can linearize the fasta file (code by @Pierre), search for the pattern you want and then reformat back to fasta.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  your.fa | grep "^>RLC" | tr "\t" "\n" > RLC.fa
ADD COMMENT
0
Entering edit mode

It works for me. Thank you !

ADD REPLY

Login before adding your answer.

Traffic: 1071 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6