Rename fasta file based on coordinates
1
0
Entering edit mode
2.2 years ago
Mason • 0

Hi,

I have a multifasta file with one sequence per line that is sorted by scaffold then coordinate within the scaffold. Example below:

>scaffold1:1851005..1851521_LTR#LTR/unknown
TGTGACAGCGCCATCGT
>scaffold1:2063846..2064928_LTR#LTR/Gypsy
TGATAAGACAAGCTCTACGCTTG
>scaffold1:2064929..2073244_INT#LTR/Gypsy
TAAACTTTGGAGTGCGTTCAGACA
>scaffold1:2088220..2090221_LTR#LTR/Gypsy
TGTCATGAAATATTCTAATCCACAT
>scaffold2:1003216..1004021_LTR#LTR/Gypsy
TGTTACGGTGTTTTTATTGAGG
>scaffold2:1022539..1026480_INT#LTR/unknown
GAGGTAACTCTTTTGAAAGAAAAGATTACTAAAC

I am looking to rename the sequences so they each have their own unique identifer, except for those that are flanking one another on the same scaffold. Flanking sequences will have the same identifier but will be distinguished by _LTR or _INT. Given the code above, it would look like this.

>ID1_LTR#LTR/unknown
TGTGACAGCGCCATCGT
>ID2_LTR#LTR/Gypsy
TGATAAGACAAGCTCTACGCTTG
>ID2_INT#LTR/Gypsy
TAAACTTTGGAGTGCGTTCAGACA
>ID3_LTR#LTR/Gypsy
TGTCATGAAATATTCTAATCCACAT
>ID4_LTR#LTR/Gypsy
TGTTACGGTGTTTTTATTGAGG
>ID5_INT#LTR/unknown
GAGGTAACTCTTTTGAAAGAAAAGATTACTAAAC

Does anyone have a solution to this problem? I have been trying to write my own shell and perl script to little success.

M.

rename fasta sort • 828 views
ADD COMMENT
0
Entering edit mode

Are they in order? Or does one have to do coordinate math to confirm/detect flanking? (It looks like the latter, which makes it a potentially much harder problem).

ADD REPLY
0
Entering edit mode

All of them are in order but not all are flanking. I have made some progress using multiple bash commands and intermediate files but not a full blown solution. Posting my jank code in case it helps another.

I first grabbed the sorted fasta headers with and filtered it into a tab delimited file.

grep '>' scaffolds.fa | sed 's/:/\t/g' | sed  's/\../\t/g'  | sed  's/_/\t/g' | cut -d '#' -f 1 > sorted_head

Then I found all the flanking terms and their corresponding names in the original fasta using the following:

TE=1
{ read scaffold start stop type; while read scaffold1 start1 stop1 type1; do 
if [[ $((start1-stop)) == 1 ]]
    then 
    echo -e "${scaffold}:${start}..${stop}_\tTE${TE}_${type}"
    echo -e "${scaffold}:${start1}..${stop1}\tTE${TE}_${type1}"
    TE=$((TE+1))
    fi
scaffold=$scaffold1
start=$start1
stop=$stop1
type=$type1
done } < sorted_head > TE_flank_table

I have then replaced the names of the fasta file with the TE_flank_table using awk after splitting using cut. I still need to rename the rest of the elements but it is not an issue for my downstream process.

M.

ADD REPLY
0
Entering edit mode
$ awk -v OFS="\t" '/>/ {getline seq; print $0,seq}'  test.fa | nl | sed -re 's/^\s+/ID/'|  awk -F '[:|..|_|\t]' '{ ($3-1==prev)?$1=q:$1=$1; prev=$5; q=$1; print ">"$1"_"$6"\n"$7}'

>ID1_LTR#LTR/unknown
TGTGACAGCGCCATCGT
>ID2_LTR#LTR/Gypsy
TGATAAGACAAGCTCTACGCTTG
>ID2_INT#LTR/Gypsy
TAAACTTTGGAGTGCGTTCAGACA
>ID4_LTR#LTR/Gypsy
TGTCATGAAATATTCTAATCCACAT
>ID5_LTR#LTR/Gypsy
TGTTACGGTGTTTTTATTGAGG
>ID6_INT#LTR/unknown
GAGGTAACTCTTTTGAAAGAAAAGATTACTAAAC

Flanking here is defined as difference of 1 between previous entry end and current entry start. You can define this length.

ADD REPLY
0
Entering edit mode
2.2 years ago
jkim ▴ 170

Python3 script

import os


def parse_fasta(fa):
    name, seq = None, []
    for line in fa:
        line = line.rstrip()
        if line.startswith(">"):
            if name:
                yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name:
        yield (name, ''.join(seq))


def main(fa):
    assert os.path.isfile(fa), f"{fa} is not a file!"
    with open(fa)as fin:
        for id, seq in parse_fasta(fin):
            new_id = id.split("_")[1] ## you can manipulate the identifiers.
            print(new_id)
            print(seq)


if __name__ == '__main__':
    fa = "tmp.fa"
    main(fa)

output

/usr/local/bin/python3 /Users/jkim/tmp/test.py
LTR#LTR/unknown
TGTGACAGCGCCATCGT
LTR#LTR/Gypsy
TGATAAGACAAGCTCTACGCTTG
INT#LTR/Gypsy
TAAACTTTGGAGTGCGTTCAGACA
LTR#LTR/Gypsy
TGTCATGAAATATTCTAATCCACAT
LTR#LTR/Gypsy
TGTTACGGTGTTTTTATTGAGG
INT#LTR/unknown
GAGGTAACTCTTTTGAAAGAAAAGATTACTAAAC

Hope it helps.

ADD COMMENT

Login before adding your answer.

Traffic: 2540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6