Removing ">" before coordinates and separate the coordinates in a single string format into a bed file
1
0
Entering edit mode
3.3 years ago
Kai_Qi ▴ 130

I have a fasta file that contains the coordinates and the sequence of the coordinates:

head my.fasta
>16:23107820-23108019(+)
GTACGGCGCTCCCGGGGCGGCCGGTGGCCTGTAGTCAAGGTCACTAGGACCCGCGTTGAGGTGGGTTGCTTGGCGGCCACACTGCAGGTATGCGGGCTTTTTCTTAGGGCACACACTTCTCCTTGTGCCCTTCGAGAAGCTTCCATGATGGTAAGACTCCAGATGTTGGGGAGACAGGACGGATACAAGAACGGAGTAT
>14:54909471-54909670(-)
GTAAGTGGCACCCTGCCAGAGATCCCTCTCTGCCCTGGGTCTCATGCCTTCCTTTCTGCACCTCCAGACAATTTCTGCTGCCCCTAGGTCCCAGATTTCAGCTGTCCAGATGTCCAGGCCTTTTAAAGGGTCTAGGCAGGGGGTCCTACTGCTCACACAGTCCTCCCACTGGCTGTTATGTTTAAAATCCTAACCTGGC
>7:127020805-127021004(-)
GTAGGTGTGGACGACAGACAGCTGGGTGGCATGAGAATGCAGGTGCCAGGCGAACTAGAGGGTGGTGCTGGGTGCGTCGTACCATCGGGAGAAGATCCCCTCCCCCTCAGCCTCTGCTGAAAGCAACAAGGGAACCCCTAAAAGAAGGGCTAAGAAGGTATGCACAAGATACTGGGTCTTCCCCAAGAATGGGGCTGGA
>X:20848619-20848818(+)
GTGAGGGCAGGCCCGGTAGGGTTCGGGTTTTGGAGCGGCTGCGGGACCCGGGTATGAAGTCCAGACCGAAAGCTCAGCTCCAAGATGCTTCCGTCTGAATCTCAGCGTTCTCCCGCCCGGAACCAAAGGAGTGGTTTGACCAGGGCGAGACCGTCGTCATCGACCGTGGGAGTGGATGGAGGAGTCGGCCTGCAGGCTG
>1:75547398-75547597(+)
GTGGGTAGCCTGGGGACCCCTAGCACCCCAGCCTTCACCACCATCACCTTCATCGCCACCATTACTGCGCTCACCTCCGGCTTGATCACTCAGTGTCATCCTGTGCTGGACGCTGTGCTGGGCCACCATGCCATGTTAAGTCATCCTGCCTCTCATACCATCATCACCTTGTTCACCTGTCAGGGGAGATGTAGGGGAG

I used grep "^>" my.fasta > mycoord.csv and grep "^>" my.fasta > mycoord.bed to extract the coordinates. Now I have seen them there:

$head mycoord.csv (or head mycoord.bed)

>16:23107820-23108019(+)
>14:54909471-54909670(-)
>7:127020805-127021004(-)
>X:20848619-20848818(+)
>1:75547398-75547597(+)
>11:102777648-102777847(+)
>7:25314905-25315104(+)
>2:180025312-180025511(+)
>7:30533903-30534102(-)
>X:8128769-8128968(-)

My question is how to remove the ">" before each coordinates and how can I make the coordinates into several columns so that I can get the gene name using the coordinates and strand information (I don't know how to express the ">" stuff so that I when I searched how to remove ">" in coordinates I almost got nothing)?

Thanks,

rna-seq gene next-gen sequence • 721 views
ADD COMMENT
0
Entering edit mode

I found the answer to the first part: I used $ sed 's/>//' mycoord.csv > mycoord_1.csv to remove ">"

$ head mycoord_1.csv
16:23107820-23108019(+)
14:54909471-54909670(-)
7:127020805-127021004(-)
X:20848619-20848818(+)
1:75547398-75547597(+)
11:102777648-102777847(+)
7:25314905-25315104(+)
2:180025312-180025511(+)
7:30533903-30534102(-)
X:8128769-8128968(-)
ADD REPLY
1
Entering edit mode
3.3 years ago

several ways: with grep:

$ grep -oP '(?<=\>).*' test.txt

16:23107820-23108019(+)
14:54909471-54909670(-)
7:127020805-127021004(-)
X:20848619-20848818(+)
1:75547398-75547597(+)

with seqkit (https://github.com/shenwei356/seqkit):

$ seqkit seq -n test.txt

16:23107820-23108019(+)
14:54909471-54909670(-)
7:127020805-127021004(-)
X:20848619-20848818(+)
1:75547398-75547597(+)

To convert to space separated coordinates:

$ seqkit replace -ip '(\w+):(\w*)-(\w*).([+,-]).' -r '$1 $2 $3 $4' test.txt | seqkit seq -n

16 23107820 23108019 +
14 54909471 54909670 -
7 127020805 127021004 -
X 20848619 20848818 +
1 75547398 75547597 +

To separate tab separated coordinates:

$ sed -n '/^>/ s/^>\(\w\+\)\W\(\w\+\)\W\(\w\+\).\([+,-]\)./\1\t\2\t\3\t\4/p' test.txt

16      23107820        23108019        +
14      54909471        54909670        -
7       127020805       127021004       -
X       20848619        20848818        +
1       75547398        75547597        +

Output is in 1 indexed, non-bed format. Convert to bed format (0 based start) and intersect with appropriate gtf.

ADD COMMENT
0
Entering edit mode

The solutions worked. Thanks a lot.

ADD REPLY

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6