Without your dashed lines:

Question

bash and awk code

1

Entering edit mode

6.5 years ago

Sam ▴ 150

Hello all, I'm new in bash. could you help me how could I solve this small technical problem ? I want to write a small script for copying the miRNAs name on the left of each sequence until a new one is found. file is in CSV and tab delimited format. Thanks

mir162
,ATAAACCTCTGCATCCAG
----------
,CGATAAACCTCTGCATCC
----------
,CGATAAACCTCTGCATCCAG
----------
mir172
----------
,AGAATCTTGATGATGCTGC

convert to :

----------
mir162,ATAAACCTCTGCATCCAG
----------
mir162,CGATAAACCTCTGCATCC
----------
mir162,CGATAAACCTCTGCATCCAG
----------
mir172,AGAATCTTGATGATGCTGC

bash awk terminal • 2.8k views

ADD COMMENT • link 16 months ago by Sam ▴ 150

2

Entering edit mode

I added markup to your post for increased readability. I hope the format is correct as displayed. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

score 4 · Answer 1 · 2017-10-08

4

Entering edit mode

6.5 years ago

Kevin Blighe 87k

Does your data have those dashed lines in it? It didn't when I first looked.

Without your dashed lines:

paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirData.txt) <(grep -v -e "^mir" MyMirData.txt)
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

...or:

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}' MyMirData.txt
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

With the dashed lines

paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirDataWithDashes.txt) <(grep -v -e "^mir" -v -e "--" MyMirDataWithDashes.txt)
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

...or:

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}' MyMirDataWithDashes.txt
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey kevin, im not so good in awk and i would like to improve, could you explain this?

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}'

ADD REPLY • link 6.5 years ago by lessismore ★ 1.3k

3

Entering edit mode

Sure, my friend. AWK is a very powerful program. It was/is its own programming language.

The awk code above will be executed on each line in the input file.

The first part, /^mir/{mir=$0}, is saying: if the line begins with 'mir' (the carat ^ is a wild character that indicates beginning of a line), then assign the value of the line ($0) to a variable called 'mir'.

The second part, /,[ATGC]*/{printf mir $0 "\n"}, is saying: if the line contains a comma followed by any number and combination of ATGC (indicated by ,[ATGC]*), then print the value held in the 'mir' variable followed by the entre line and an end-line ('\n'). I probably could have also used ^,[ATGC]*$ here, with the ^ and $ being wild characters for the beginning and end of the lines. The square brackets indicate any of the letters in the brackets, and in any order.

Due to the way that this is structured, the 'mir' variable will only change when the net mir is encountered in the input file.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

hi Kevin could you modify first part to have organism and mir name in same line ?

ADD REPLY • link 6.5 years ago by Sam ▴ 150

0

Entering edit mode

Sure, can you let me see the exact input?

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

input file format:

Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC

    convert to :

    Organism: aau, mir162,,ATAAACCTCTGCATCCAG
    Organism: aau, mir162,,CGATAAACCTCTGCATCC

ADD REPLY • link 6.5 years ago by Sam ▴ 150

1

Entering edit mode

cat MyMirData.txt 
Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC
Organism: aau,
    ,mir182
    ,,CCTCTGCATCCAG
    ,,AACCTCTGCATCC

sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g'
Organism: aau, mir162, ATAAACCTCTGCATCCAG
Organism: aau, mir162, CGATAAACCTCTGCATCC
Organism: aau, mir182, CCTCTGCATCCAG
Organism: aau, mir182, AACCTCTGCATCC

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thats very helpful man! thanks! so the two conditions are separated by ";" ?? Whats the use of "printf" here? you could have had the same output with

 awk '/^mir/{mir=$0};/,[ATGC]*/{print mir $0}'

ADD REPLY • link 6.5 years ago by lessismore ★ 1.3k

0

Entering edit mode

I have posted the answer above, which should work with the 'Organism' included. It just involved an extra part to the awk command, but I also remove all leading whitespace from the file with sed before piping into awk.

Yes, you are quite correct regarding print and printf. I did not have to use printf.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin , how I can revise this code to consider "let" and "bantam" microRNAs ? because this code is just for mir and I have other miRNA such as let-7 and bantam-3p. Thanks for your help

ADD REPLY • link 6.5 years ago by Sam ▴ 150

2

Entering edit mode

Hi Sam,

Assuming that everything is in the same format, this should function correctly:

sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir|^let|^bantam/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g'

The only part that I changed is the matching pattern for awk: /^mir|^let|^bantam/ i.e. match 'mir' or 'let' or 'bantam' at the beginning of the line

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin

is it possible to revise the code according this criteria?

input file

Organism: hsa,
,let-7a-2-3p
,,CTGTACAGCCTCCTAGCTTTCC,
,,Totals: ,
,mir-7a-3p
,,CTATACAATCTACTGTC,
,,CTATACAATCTACTGTCT,

output

Organism: hsa,let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC,
Organism: hsa,let-7a-2-3p,Totals: ,
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTC,
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTCT,

ADD REPLY • link 16 months ago by Sam ▴ 150

score 3 · Answer 2 · 2017-10-08

3

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

awk '/^-/ {next;} /^,/ {printf("%s%s\n----------\n",M,$0);next;}{M=$0;}' input.txt

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

could you help about this ?

  Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC
convert to :
Organism: aau, mir162,,ATAAACCTCTGCATCCAG
Organism: aau, mir162,,CGATAAACCTCTGCATCC

ADD REPLY • link 6.5 years ago by Sam ▴ 150

1

Entering edit mode

I'm not very good at awk, but you can always do it step by step if you don't know how to do it in one command, for example something like this

awk '/,mir*/{mir=$0}; /,[ATGC]*$/{printf mir $0 "\n"}' test.data > test1.data 
#adding the mir162 to the sequences
awk '/^Organism/{org=$0" " }; /^mir*/ {printf org  $0 "\n"}' test1.data > test2.data
#adding the organism to the lines
sed 's/    ,,/,,/g' test2.data > test3.data
#modifing the lines the way you want them to be    sed 's/find/replace/g' filename  : finds and replaces for example here it removes the extra spaces before the commas

ADD REPLY • link 6.5 years ago by Fatima ▴ 1000

1

Entering edit mode

Hey Afagh, I posted a solution in my answer (above), in case you wanted to take a look. It only takes one line.

AWK can be difficult to learn, but it is worth the time to learn it! :) It is very powerful.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Your code works fine, but why awk '/^Organism/{org=$0" "}; /,mir/{mir=$0}; /,[ATGC]*$/ {print org mir $0 "\n"}' test.data | sed 's/ ,/,/g' doesn't work?

Outputs:

Organism: aau, Organism: aau,

Organism: aau, ,mir162,,ATAAACCTCTGCATCCAG

Organism: aau, ,mir162,,CGATAAACCTCTGCATCC

ADD REPLY • link 6.5 years ago by Fatima ▴ 1000

1

Entering edit mode

You need to remove all leading whitespace with sed 's/[[:space:]]\+,//' MyMirData.txt first, and then pipe this into awk.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Keep working with it and you'll get it! :) We love AWK

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Amazing Kevin! Thanks to people like you!

ADD REPLY • link 6.5 years ago by lessismore ★ 1.3k

0

Entering edit mode

Pierre is expert at awk too - better than me I think.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

2

Entering edit mode

you are real biostarS :)

ADD REPLY • link 6.5 years ago by lessismore ★ 1.3k