Question: Parse results of predicted miRNAs
0
gravatar for s_bio
20 months ago by
s_bio10
s_bio10 wrote:

Hi BioStar Community,

I have the following text file where I have predicted miRNAs for numerous sequences. I would appreciate if someone can help me out with these questions.

1) miRNA that targets maximum number of sequences (most common miRNA).
2) miRNA that targets minimum number of sequences
3) Generate a list of unique miRNAs and prints the accession numbers which it targets.

Thanks :-)

>NC_002.1
hsa-miR-3
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
hsa-miR-13
hsa-miR-44
hsa-miR-445
hsa-miR-467
hsa-miR-454
hsa-miR-698
hsa-miR-421
>NC_003
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
>NC_04
hsa-miR-86
hsa-miR-94
hsa-miR-31
hsa-miR-328
>NC_06
hsa-miR-467
hsa-miR-454
hsa-miR-698
hsa-miR-421
>NC_008
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
>NC_009
hsa-miR-43
hsa-miR-18
hsa-miR-13
hsa-miR-44
>NC_11
hsa-miR-445
hsa-miR-467
hsa-miR-454
hsa-miR-698
hsa-miR-421
>NC_012
hsa-miR-467
hsa-miR-454
>NC_023
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
hsa-miR-18
hsa-miR-13
hsa-miR-44
hsa-miR-86
hsa-miR-94
hsa-miR-31
hsa-miR-328
hsa-miR-445
hsa-miR-467
>NC_045
hsa-miR-31
hsa-miR-328
hsa-miR-445
hsa-miR-467
>NC_00455
hsa-miR-6
hsa-miR-32
hsa-miR-4
hsa-miR-46
hsa-miR-43
>NC_0875
hsa-miR-86
hsa-miR-94
hsa-miR-31
hsa-miR-328
sequence genome • 844 views
ADD COMMENTlink modified 20 months ago by Buffo1.5k • written 20 months ago by s_bio10
1

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

This is not the first time that I have to tell you this. Please put some effort when formatting posts.
Interesting guidelines for posting can be found in the following posts:

I also modified your title to make it more specific. "Processing txt file" doesn't mean anything at all.

When you want help on a Q&A forum it's good practice to show what you tried and what didn't work. We will be more eager to point out your mistakes or put you back on the right track. We don't like open questions such as "please write a script for me".

ADD REPLYlink modified 20 months ago • written 20 months ago by WouterDeCoster38k

Hi Carlo and WouterDeCoster,

Thanks for your suggestions. @WouterDeCoster, you are right that we need to put in efforts to write code and thats how a person learns... however, as a beginner, its difficult to write ever small chunks of code. As a faculty in Biology, I have some ideas in my mind but completely lacks computational expertise. Hence, I think that this platform should also promote project based collaborations which might lead to publication.

ADD REPLYlink written 20 months ago by s_bio10

you are right that we need to put in efforts to write code and thats how a person learns... however, as a beginner, its difficult to write ever small chunks of code. As a faculty in Biology, I have some ideas in my mind but completely lacks computational expertise.

This one isn't too hard, and the only way to learn coding is by trying.

Hence, I think that this platform should also promote project based collaborations which might lead to publication.

I'm not sure how that would work.

ADD REPLYlink written 20 months ago by WouterDeCoster38k
1

@WouterDeCoster, Thanks for the insights. However, I think that technology has not only transformed biology (Big Data) but has omitted physical boundaries for collaboration. For example https://www.nature.com/articles/ncomms12846 this research article performs data analysis using crowd where co-authors are across the globe and hardly know each other.

ADD REPLYlink written 20 months ago by s_bio10

There is an issue with the formatting of your text file. Please use "code formatting" (the 101010 box) to improve its readability.

ADD REPLYlink written 20 months ago by Carlo Yague4.4k
2
gravatar for Carlo Yague
20 months ago by
Carlo Yague4.4k
Belgium
Carlo Yague4.4k wrote:

1 and 2) Remove the targets name (starting by '>') then count the number of each miRNA. As you can see, hsa-miR-3 targets only one sequence while hsa-miR-43 and hsa-miR-467 targets both 6 sequences.

grep -v '^>' test.txt | sort | uniq -c | sort -k1 -n
   1 hsa-miR-3
   3 hsa-miR-13
   3 hsa-miR-421
   3 hsa-miR-44
   3 hsa-miR-698
   3 hsa-miR-86
   3 hsa-miR-94
   4 hsa-miR-31
   4 hsa-miR-32
   4 hsa-miR-328
   4 hsa-miR-445
   4 hsa-miR-454
   4 hsa-miR-6
   5 hsa-miR-18
   5 hsa-miR-4
   5 hsa-miR-46
   6 hsa-miR-43
   6 hsa-miR-467

3) is a bit more complicated. I would first save all unique miRNA IDs in a temporary file, then linearize your pseudo fasta file, then use the unique identifiers to retrieve the sequence name using grep.

# save unique miRNA IDs
grep -v '^>' test.txt | sort | uniq > temp
#linearize fasta file
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s\t",$0);} END {printf("\n");}' test.txt > temp2
# iterate through unique IDs and find headers
while read i; do echo "$i";grep -E "$i\t" temp2|cut -f1 ; done < temp

 hsa-miR-13
>NC_002.1
>NC_009
>NC_023
hsa-miR-18
>NC_002.1
>NC_003
>NC_008
>NC_009
>NC_023
hsa-miR-3
>NC_002.1
hsa-miR-31
>NC_04
>NC_023
>NC_045
>NC_0875
hsa-miR-32
>NC_002.1
>NC_008
>NC_023
>NC_00455
hsa-miR-328
>NC_04
>NC_023
>NC_045
>NC_0875
hsa-miR-4
>NC_002.1
>NC_003
>NC_008
>NC_023
>NC_00455
hsa-miR-421
>NC_002.1
>NC_06
>NC_11
hsa-miR-43
>NC_002.1
>NC_003
>NC_008
>NC_009
>NC_023
>NC_00455
hsa-miR-44
>NC_002.1
>NC_009
>NC_023
hsa-miR-445
>NC_002.1
>NC_11
>NC_023
>NC_045
hsa-miR-454
>NC_002.1
>NC_06
>NC_11
>NC_012
hsa-miR-46
>NC_002.1
>NC_003
>NC_008
>NC_023
>NC_00455
hsa-miR-467
>NC_002.1
>NC_06
>NC_11
>NC_012
>NC_023
>NC_045
hsa-miR-6
>NC_002.1
>NC_008
>NC_023
>NC_00455
hsa-miR-698
>NC_002.1
>NC_06
>NC_11
hsa-miR-86
>NC_04
>NC_023
>NC_0875
hsa-miR-94
>NC_04
>NC_023
>NC_0875
ADD COMMENTlink modified 20 months ago • written 20 months ago by Carlo Yague4.4k
1

below code is modification of Carlo Yague's code above.

as @carlo said above, store the unique miRNAs in one file (miRNA.txt contains list pasted by OP as it is):

grep -v ">" miRNA.txt | sort | uniq | sort -V > temp2.txt

This would store the miRNAs in sorted order and stores in temp2.txt. Transform the original miRNA.txt into two column table and store in a file:

seqkit fx2tab miRNA.txt | awk '{gsub("hsa",",hsa"); sub("\t,","\t");print}' > temp.txt

With few modifications with Carlo's code, store the output in tab format:

cat temp2.txt |while read i; do echo ">$i" ; grep  -w $i temp.txt; done  | cut -f1 | seqkit fx2tab | awk '{gsub("NC",", NC"); sub(", ","\t"); print}'

output would be:

hsa-miR-3       NC_002.1    
hsa-miR-4       NC_002.1, NC_003, NC_008, NC_023, NC_00455  
hsa-miR-6       NC_002.1, NC_008, NC_023, NC_00455  
hsa-miR-13      NC_002.1, NC_009, NC_023    
hsa-miR-18      NC_002.1, NC_003, NC_008, NC_009, NC_023    
hsa-miR-31      NC_04, NC_023, NC_045, NC_0875  
hsa-miR-32      NC_002.1, NC_008, NC_023, NC_00455  
hsa-miR-43      NC_002.1, NC_003, NC_008, NC_009, NC_023, NC_00455  
hsa-miR-44      NC_002.1, NC_009, NC_023    
hsa-miR-46      NC_002.1, NC_003, NC_008, NC_023, NC_00455  
hsa-miR-86      NC_04, NC_023, NC_0875  
hsa-miR-94      NC_04, NC_023, NC_0875  
hsa-miR-328     NC_04, NC_023, NC_045, NC_0875  
hsa-miR-421     NC_002.1, NC_06, NC_11  
hsa-miR-445     NC_002.1, NC_11, NC_023, NC_045 
hsa-miR-454     NC_002.1, NC_06, NC_11, NC_012  
hsa-miR-467     NC_002.1, NC_06, NC_11, NC_012, NC_023, NC_045  
hsa-miR-698     NC_002.1, NC_06, NC_11

This in turn would answer first and two questions:

cat temp2.txt |while read i; do echo ">$i" ; grep  -w $i temp.txt; done  | cut -f1 | seqkit fx2tab | awk '{gsub("NC",", NC"); sub(", ","\t"); print}'  | awk  '{print $1,"\t", NF-1}' | sort -k2

output:

hsa-miR-3    1
hsa-miR-13   3
hsa-miR-421      3
hsa-miR-44   3
hsa-miR-698      3
hsa-miR-86   3
hsa-miR-94   3
hsa-miR-31   4
hsa-miR-32   4
hsa-miR-328      4
hsa-miR-445      4
hsa-miR-454      4
hsa-miR-6    4
hsa-miR-18   5
hsa-miR-4    5
hsa-miR-46   5
hsa-miR-43   6
hsa-miR-467      6

Seqkit can be downloaded from here

ADD REPLYlink modified 20 months ago • written 20 months ago by cpad011211k
1

This certainly improves the output formatting. But wy do you use uniq -d ? As a consequence, you ignore all miRNAs with less than two targets (such as hsa-miR-3 in this case). I would rather use uniq without the -d.

ADD REPLYlink written 20 months ago by Carlo Yague4.4k

that is true. Code is updated. Thanks

ADD REPLYlink written 20 months ago by cpad011211k

@cpad0112, Thanks for the codes.

ADD REPLYlink written 20 months ago by s_bio10

@Carlo, Thanks for the code. However, in case of huge file, if one particular miRNA is targeting 100-200 sequences then it will be difficult to fish out these sequences. Hence, it would be great if we can get the list of sequences also (Accession number) that miRNA is targeting.

ADD REPLYlink written 20 months ago by s_bio10

@Carlo, Thanks a lot for answering all queries :-)

ADD REPLYlink written 20 months ago by s_bio10
0
gravatar for Buffo
20 months ago by
Buffo1.5k
Buffo1.5k wrote:

Something like this? micro_rna.org or this? mirtarbase or this? mirdb.. or mirwalk, etc, etc, You can just download .xls files and sort as you wish..

ADD COMMENTlink written 20 months ago by Buffo1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1396 users visited in the last hour