Question: bash and awk code
1
gravatar for Sam
6 weeks ago by
Sam20
Sam20 wrote:

Hello all, I'm new in bash. could you help me how could I solve this small technical problem ? I want to write a small script for copying the miRNAs name on the left of each sequence until a new one is found. file is in CSV and tab delimited format. Thanks

mir162
,ATAAACCTCTGCATCCAG
----------
,CGATAAACCTCTGCATCC
----------
,CGATAAACCTCTGCATCCAG
----------
mir172
----------
,AGAATCTTGATGATGCTGC

convert to :

----------
mir162,ATAAACCTCTGCATCCAG
----------
mir162,CGATAAACCTCTGCATCC
----------
mir162,CGATAAACCTCTGCATCCAG
----------
mir172,AGAATCTTGATGATGCTGC
awk terminal bash • 456 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Sam20
2

I added markup to your post for increased readability. I hope the format is correct as displayed. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 6 weeks ago by WouterDeCoster23k
4
gravatar for Kevin Blighe
6 weeks ago by
Kevin Blighe6.7k
Republic of Ireland (Éire)
Kevin Blighe6.7k wrote:

Does your data have those dashed lines in it? It didn't when I first looked.

Without your dashed lines:

paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirData.txt) <(grep -v -e "^mir" MyMirData.txt)
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

...or:

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}' MyMirData.txt
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

With the dashed lines

paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirDataWithDashes.txt) <(grep -v -e "^mir" -v -e "--" MyMirDataWithDashes.txt)
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

...or:

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}' MyMirDataWithDashes.txt
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe6.7k

Hey kevin, im not so good in awk and i would like to improve, could you explain this?

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}'
ADD REPLYlink written 6 weeks ago by lessismore370
3

Sure, my friend. AWK is a very powerful program. It was/is its own programming language.

The awk code above will be executed on each line in the input file.

The first part, /^mir/{mir=$0}, is saying: if the line begins with 'mir' (the carat ^ is a wild character that indicates beginning of a line), then assign the value of the line ($0) to a variable called 'mir'.

The second part, /,[ATGC]*/{printf mir $0 "\n"}, is saying: if the line contains a comma followed by any number and combination of ATGC (indicated by ,[ATGC]*), then print the value held in the 'mir' variable followed by the entre line and an end-line ('\n'). I probably could have also used ^,[ATGC]*$ here, with the ^ and $ being wild characters for the beginning and end of the lines. The square brackets indicate any of the letters in the brackets, and in any order.

Due to the way that this is structured, the 'mir' variable will only change when the net mir is encountered in the input file.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe6.7k

hi Kevin could you modify first part to have organism and mir name in same line ?

ADD REPLYlink written 6 weeks ago by Sam20

Sure, can you let me see the exact input?

ADD REPLYlink written 6 weeks ago by Kevin Blighe6.7k

input file format:

Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC

    convert to :

    Organism: aau, mir162,,ATAAACCTCTGCATCCAG
    Organism: aau, mir162,,CGATAAACCTCTGCATCC
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Sam20
1
cat MyMirData.txt 
Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC
Organism: aau,
    ,mir182
    ,,CCTCTGCATCCAG
    ,,AACCTCTGCATCC

sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g'
Organism: aau, mir162, ATAAACCTCTGCATCCAG
Organism: aau, mir162, CGATAAACCTCTGCATCC
Organism: aau, mir182, CCTCTGCATCCAG
Organism: aau, mir182, AACCTCTGCATCC
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe6.7k

Thats very helpful man! thanks! so the two conditions are separated by ";" ?? Whats the use of "printf" here? you could have had the same output with

 awk '/^mir/{mir=$0};/,[ATGC]*/{print mir $0}'
ADD REPLYlink written 6 weeks ago by lessismore370

I have posted the answer above, which should work with the 'Organism' included. It just involved an extra part to the awk command, but I also remove all leading whitespace from the file with sed before piping into awk.

Yes, you are quite correct regarding print and printf. I did not have to use printf.

ADD REPLYlink written 6 weeks ago by Kevin Blighe6.7k

Hi Kevin , how I can revise this code to consider "let" and "bantam" microRNAs ? because this code is just for mir and I have other miRNA such as let-7 and bantam-3p. Thanks for your help

ADD REPLYlink modified 28 days ago • written 28 days ago by Sam20
2

Hi Sam,

Assuming that everything is in the same format, this should function correctly:

sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir|^let|^bantam/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g'

The only part that I changed is the matching pattern for awk: /^mir|^let|^bantam/ i.e. match 'mir' or 'let' or 'bantam' at the beginning of the line

ADD REPLYlink written 28 days ago by Kevin Blighe6.7k
3
gravatar for Pierre Lindenbaum
6 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:
awk '/^-/ {next;} /^,/ {printf("%s%s\n----------\n",M,$0);next;}{M=$0;}' input.txt
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Pierre Lindenbaum101k

could you help about this ?

  Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC
convert to :
Organism: aau, mir162,,ATAAACCTCTGCATCCAG
Organism: aau, mir162,,CGATAAACCTCTGCATCC
ADD REPLYlink written 6 weeks ago by Sam20
1

I'm not very good at awk, but you can always do it step by step if you don't know how to do it in one command, for example something like this

awk '/,mir*/{mir=$0}; /,[ATGC]*$/{printf mir $0 "\n"}' test.data > test1.data 
#adding the mir162 to the sequences
awk '/^Organism/{org=$0" " }; /^mir*/ {printf org  $0 "\n"}' test1.data > test2.data
#adding the organism to the lines
sed 's/    ,,/,,/g' test2.data > test3.data
#modifing the lines the way you want them to be    sed 's/find/replace/g' filename  : finds and replaces for example here it removes the extra spaces before the commas
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Afagh70
1

Hey Afagh, I posted a solution in my answer (above), in case you wanted to take a look. It only takes one line.

AWK can be difficult to learn, but it is worth the time to learn it! :) It is very powerful.

ADD REPLYlink written 6 weeks ago by Kevin Blighe6.7k

Hi Kevin,

Your code works fine, but why awk '/^Organism/{org=$0" "}; /,mir/{mir=$0}; /,[ATGC]*$/ {print org mir $0 "\n"}' test.data | sed 's/ ,/,/g' doesn't work?

Outputs:

Organism: aau, Organism: aau,

Organism: aau, ,mir162,,ATAAACCTCTGCATCCAG

Organism: aau, ,mir162,,CGATAAACCTCTGCATCC

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Afagh70
1

You need to remove all leading whitespace with sed 's/[[:space:]]\+,//' MyMirData.txt first, and then pipe this into awk.

ADD REPLYlink written 6 weeks ago by Kevin Blighe6.7k
1

Keep working with it and you'll get it! :) We love AWK

ADD REPLYlink written 6 weeks ago by Kevin Blighe6.7k
1

Amazing Kevin! Thanks to people like you!

ADD REPLYlink written 28 days ago by lessismore370

Pierre is expert at awk too - better than me I think.

ADD REPLYlink written 27 days ago by Kevin Blighe6.7k
2

you are real biostarS :)

ADD REPLYlink modified 27 days ago • written 27 days ago by lessismore370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1093 users visited in the last hour