Parse results of predicted miRNAs
2
0
Entering edit mode
6.7 years ago
s_bio ▴ 10

Hi BioStar Community,

I have the following text file where I have predicted miRNAs for numerous sequences. I would appreciate if someone can help me out with these questions.

1) miRNA that targets maximum number of sequences (most common miRNA).
2) miRNA that targets minimum number of sequences
3) Generate a list of unique miRNAs and prints the accession numbers which it targets.

Thanks :-)

>NC_002.1
hsa-miR-3
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
hsa-miR-13
hsa-miR-44
hsa-miR-445
hsa-miR-467
hsa-miR-454
hsa-miR-698
hsa-miR-421
>NC_003
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
>NC_04
hsa-miR-86
hsa-miR-94
hsa-miR-31
hsa-miR-328
>NC_06
hsa-miR-467
hsa-miR-454
hsa-miR-698
hsa-miR-421
>NC_008
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
>NC_009
hsa-miR-43
hsa-miR-18
hsa-miR-13
hsa-miR-44
>NC_11
hsa-miR-445
hsa-miR-467
hsa-miR-454
hsa-miR-698
hsa-miR-421
>NC_012
hsa-miR-467
hsa-miR-454
>NC_023
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
hsa-miR-13
hsa-miR-44
hsa-miR-86
hsa-miR-94
hsa-miR-31
hsa-miR-328
hsa-miR-445
hsa-miR-467
>NC_045
hsa-miR-31
hsa-miR-328
hsa-miR-445
hsa-miR-467
>NC_00455
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
>NC_0875
hsa-miR-86
hsa-miR-94
hsa-miR-31
hsa-miR-328
genome sequence • 2.3k views
ADD COMMENT
1
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

This is not the first time that I have to tell you this. Please put some effort when formatting posts.
Interesting guidelines for posting can be found in the following posts:

I also modified your title to make it more specific. "Processing txt file" doesn't mean anything at all.

When you want help on a Q&A forum it's good practice to show what you tried and what didn't work. We will be more eager to point out your mistakes or put you back on the right track. We don't like open questions such as "please write a script for me".

ADD REPLY
0
Entering edit mode

Hi Carlo and WouterDeCoster,

Thanks for your suggestions. @WouterDeCoster, you are right that we need to put in efforts to write code and thats how a person learns... however, as a beginner, its difficult to write ever small chunks of code. As a faculty in Biology, I have some ideas in my mind but completely lacks computational expertise. Hence, I think that this platform should also promote project based collaborations which might lead to publication.

ADD REPLY
0
Entering edit mode

you are right that we need to put in efforts to write code and thats how a person learns... however, as a beginner, its difficult to write ever small chunks of code. As a faculty in Biology, I have some ideas in my mind but completely lacks computational expertise.

This one isn't too hard, and the only way to learn coding is by trying.

Hence, I think that this platform should also promote project based collaborations which might lead to publication.

I'm not sure how that would work.

ADD REPLY
1
Entering edit mode

@WouterDeCoster, Thanks for the insights. However, I think that technology has not only transformed biology (Big Data) but has omitted physical boundaries for collaboration. For example https://www.nature.com/articles/ncomms12846 this research article performs data analysis using crowd where co-authors are across the globe and hardly know each other.

ADD REPLY
0
Entering edit mode

There is an issue with the formatting of your text file. Please use "code formatting" (the 101010 box) to improve its readability.

ADD REPLY
2
Entering edit mode
6.7 years ago

1 and 2) Remove the targets name (starting by '>') then count the number of each miRNA. As you can see, hsa-miR-3 targets only one sequence while hsa-miR-43 and hsa-miR-467 targets both 6 sequences.

grep -v '^>' test.txt | sort | uniq -c | sort -k1 -n
   1 hsa-miR-3
   3 hsa-miR-13
   3 hsa-miR-421
   3 hsa-miR-44
   3 hsa-miR-698
   3 hsa-miR-86
   3 hsa-miR-94
   4 hsa-miR-31
   4 hsa-miR-32
   4 hsa-miR-328
   4 hsa-miR-445
   4 hsa-miR-454
   4 hsa-miR-6
   5 hsa-miR-18
   5 hsa-miR-4
   5 hsa-miR-46
   6 hsa-miR-43
   6 hsa-miR-467

3) is a bit more complicated. I would first save all unique miRNA IDs in a temporary file, then linearize your pseudo fasta file, then use the unique identifiers to retrieve the sequence name using grep.

# save unique miRNA IDs
grep -v '^>' test.txt | sort | uniq > temp
#linearize fasta file
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s\t",$0);} END {printf("\n");}' test.txt > temp2
# iterate through unique IDs and find headers
while read i; do echo "$i";grep -E "$i\t" temp2|cut -f1 ; done < temp

 hsa-miR-13
>NC_002.1
>NC_009
>NC_023
hsa-miR-18
>NC_002.1
>NC_003
>NC_008
>NC_009
>NC_023
hsa-miR-3
>NC_002.1
hsa-miR-31
>NC_04
>NC_023
>NC_045
>NC_0875
hsa-miR-32
>NC_002.1
>NC_008
>NC_023
>NC_00455
hsa-miR-328
>NC_04
>NC_023
>NC_045
>NC_0875
hsa-miR-4
>NC_002.1
>NC_003
>NC_008
>NC_023
>NC_00455
hsa-miR-421
>NC_002.1
>NC_06
>NC_11
hsa-miR-43
>NC_002.1
>NC_003
>NC_008
>NC_009
>NC_023
>NC_00455
hsa-miR-44
>NC_002.1
>NC_009
>NC_023
hsa-miR-445
>NC_002.1
>NC_11
>NC_023
>NC_045
hsa-miR-454
>NC_002.1
>NC_06
>NC_11
>NC_012
hsa-miR-46
>NC_002.1
>NC_003
>NC_008
>NC_023
>NC_00455
hsa-miR-467
>NC_002.1
>NC_06
>NC_11
>NC_012
>NC_023
>NC_045
hsa-miR-6
>NC_002.1
>NC_008
>NC_023
>NC_00455
hsa-miR-698
>NC_002.1
>NC_06
>NC_11
hsa-miR-86
>NC_04
>NC_023
>NC_0875
hsa-miR-94
>NC_04
>NC_023
>NC_0875
ADD COMMENT
1
Entering edit mode

below code is modification of Carlo Yague's code above.

as @carlo said above, store the unique miRNAs in one file (miRNA.txt contains list pasted by OP as it is):

grep -v ">" miRNA.txt | sort | uniq | sort -V > temp2.txt

This would store the miRNAs in sorted order and stores in temp2.txt. Transform the original miRNA.txt into two column table and store in a file:

seqkit fx2tab miRNA.txt | awk '{gsub("hsa",",hsa"); sub("\t,","\t");print}' > temp.txt

With few modifications with Carlo's code, store the output in tab format:

cat temp2.txt |while read i; do echo ">$i" ; grep  -w $i temp.txt; done  | cut -f1 | seqkit fx2tab | awk '{gsub("NC",", NC"); sub(", ","\t"); print}'

output would be:

hsa-miR-3       NC_002.1    
hsa-miR-4       NC_002.1, NC_003, NC_008, NC_023, NC_00455  
hsa-miR-6       NC_002.1, NC_008, NC_023, NC_00455  
hsa-miR-13      NC_002.1, NC_009, NC_023    
hsa-miR-18      NC_002.1, NC_003, NC_008, NC_009, NC_023    
hsa-miR-31      NC_04, NC_023, NC_045, NC_0875  
hsa-miR-32      NC_002.1, NC_008, NC_023, NC_00455  
hsa-miR-43      NC_002.1, NC_003, NC_008, NC_009, NC_023, NC_00455  
hsa-miR-44      NC_002.1, NC_009, NC_023    
hsa-miR-46      NC_002.1, NC_003, NC_008, NC_023, NC_00455  
hsa-miR-86      NC_04, NC_023, NC_0875  
hsa-miR-94      NC_04, NC_023, NC_0875  
hsa-miR-328     NC_04, NC_023, NC_045, NC_0875  
hsa-miR-421     NC_002.1, NC_06, NC_11  
hsa-miR-445     NC_002.1, NC_11, NC_023, NC_045 
hsa-miR-454     NC_002.1, NC_06, NC_11, NC_012  
hsa-miR-467     NC_002.1, NC_06, NC_11, NC_012, NC_023, NC_045  
hsa-miR-698     NC_002.1, NC_06, NC_11

This in turn would answer first and two questions:

cat temp2.txt |while read i; do echo ">$i" ; grep  -w $i temp.txt; done  | cut -f1 | seqkit fx2tab | awk '{gsub("NC",", NC"); sub(", ","\t"); print}'  | awk  '{print $1,"\t", NF-1}' | sort -k2

output:

hsa-miR-3    1
hsa-miR-13   3
hsa-miR-421      3
hsa-miR-44   3
hsa-miR-698      3
hsa-miR-86   3
hsa-miR-94   3
hsa-miR-31   4
hsa-miR-32   4
hsa-miR-328      4
hsa-miR-445      4
hsa-miR-454      4
hsa-miR-6    4
hsa-miR-18   5
hsa-miR-4    5
hsa-miR-46   5
hsa-miR-43   6
hsa-miR-467      6

Seqkit can be downloaded from here

ADD REPLY
1
Entering edit mode

This certainly improves the output formatting. But wy do you use uniq -d ? As a consequence, you ignore all miRNAs with less than two targets (such as hsa-miR-3 in this case). I would rather use uniq without the -d.

ADD REPLY
0
Entering edit mode

that is true. Code is updated. Thanks

ADD REPLY
0
Entering edit mode

@cpad0112, Thanks for the codes.

ADD REPLY
0
Entering edit mode

@Carlo, Thanks for the code. However, in case of huge file, if one particular miRNA is targeting 100-200 sequences then it will be difficult to fish out these sequences. Hence, it would be great if we can get the list of sequences also (Accession number) that miRNA is targeting.

ADD REPLY
0
Entering edit mode

@Carlo, Thanks a lot for answering all queries :-)

ADD REPLY
0
Entering edit mode
6.7 years ago
Buffo ★ 2.4k

Something like this? micro_rna.org or this? mirtarbase or this? mirdb.. or mirwalk, etc, etc, You can just download .xls files and sort as you wish..

ADD COMMENT

Login before adding your answer.

Traffic: 3298 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6