Question: How to fetch those line having only similar characters.
0
gravatar for kartikayprasad
14 months ago by
kartikayprasad10 wrote:

Hello friends, I have a file having multiple nucleotides present in a comma separated format. I want to filter only those lines which have atleast one different nucleotie present in it. I dont want those lines which have only one kind of nucleotide present in it throughout. for example, i have this kind of file which is shown below, though number of nucleotides in a single line can me 100 as well.

T,T,

G,G,

T,T,

G,G,

A,A,

T,T,T,

G,G,

C,C,C,

A,A,A,

T,T,T,

A,A,

A,A,

T,T,

A,A,

G,G,G,

T,T,T,

A,A,

A,T,A,

G,G,

G,G,

C,C,

G,G,G,

T,T,

A,A,

A,A,

G,T,

A,T,A,

T,C,

C,C,C,

G,G,

C,C,C,

A,A,

G,G,

G,G,

A,A,

T,T,

C,C,

A,A,

T,T,

T,T,T,

C,C,C,

T,T,

A,A,

T,T,

T,T,

G,G,G,

T,T,

A,A,A,

T,T,

T,T,

A,A,

T,T,

G,G,G,

A,A,

G,G,G,A,

T,T,

T,T,

T,T,

C,C,

A,A,

A,A,

T,T,

A,A,A,

T,T,

T,T,T,

C,C,

A,A,.

A,T,G,C

I want answer be like:

A,T,A,

A,T,A,

G,G,G,A,

A,T,G,C

It would be very helpful if there is any one liner is present for it and please explain the code as well so that i would properly understand the code and can use it in future as well.

Thanks in advance for all the helpers.

snp rna-seq next-gen assembly • 476 views
ADD COMMENTlink modified 14 months ago by steve1.9k • written 14 months ago by kartikayprasad10

As a side comment: you can shorten your post a lot, by adding only some example lines of your file that are enough for us to understand the problem. Answer is coming right away (in the answers below).

ADD REPLYlink written 14 months ago by Macspider2.8k

Macspider thanks for the comment, i will keep it in mind. Waiting for the answer :)

ADD REPLYlink written 14 months ago by kartikayprasad10

Quite a few people have made a good effort here. Could you take the time to test each and then up-vote or accept the answers that have helped?

Thanks!

ADD REPLYlink written 14 months ago by Kevin Blighe41k

Yes i will try all of the codes and i will surely upvote the answer. Thanks once again for the help.

ADD REPLYlink written 14 months ago by kartikayprasad10
2
gravatar for mxs
14 months ago by
mxs530
mxs530 wrote:
perl -lne '$a[0] = ($_ =~ tr/A//);$a[1]= ($_ =~ tr/T//);$a[2]=($_ =~ tr/C//);$a[3]=($_ =~ tr/G//); my $b =0; foreach(@a){$b++ if $_>0} print $_ if $b >1 ' myfile

Under the assumption only ATCG are under the investigation

ADD COMMENTlink modified 14 months ago • written 14 months ago by mxs530
1

Another Perl solution:

perl -lne '%a = (); @b = split /,/, $_; foreach $b (@b) { $a{$b}++; }; @b = keys %a; print if ($#b > 0)' < in.txt

A,T,A,
G,T,
A,T,A,
T,C,
G,G,G,A,
A,T,G,C
ADD REPLYlink modified 14 months ago • written 14 months ago by JC7.7k

hey thanks for the code, can you please help me a little more. i was trying to add something in your code but it was a fail so can you please help. what i was trying that this code also print \n for those lines which is having similar nucleotides in it. for example: the code will also print \n for A,A and for T,T along with the result which this code is providing already.

ADD REPLYlink written 14 months ago by kartikayprasad10

hi, not sure what do you want to print, just a "\n" if the line contains a homozygous base? or just for A,A, T,T?

what you need is to extend the last if, something like:

perl -lne '%a = (); @b = split /,/, $_; foreach $b (@b) { $a{$b}++; }; @b = keys %a; if ($#b > 0) { print } else { print "\n" } ' < in.txt

ADD REPLYlink written 14 months ago by JC7.7k

hi mxs, thanks for the reply. Can you please explain the code? Thanks

ADD REPLYlink written 14 months ago by kartikayprasad10

the idea is U count the characters separately and if you have only one type then occurrence of that character will be > 0 while the rest will have 0 so if you count the number of times you have occurrence > 0 if that number is > 1 than you don't have poly-something, thus this is the line you print... learn perl oneliners or awk. Don't waste your time on such trivial tasks . it adds up as you start doing bioinfo professionally :)

PS there is a shorter version of this solution, can you figure it out ? :)

ADD REPLYlink modified 14 months ago • written 14 months ago by mxs530
2
gravatar for Kevin Blighe
14 months ago by
Kevin Blighe41k
Guy's Hospital, London
Kevin Blighe41k wrote:

Assuming that each line in your file does not actually end in a comma:

awk '{strPrevBase=$1; boolDiff=0; for (i=2; i<=NF; i++) {if ($(i)!=strPrevBase) {boolDiff=1}} if (boolDiff==1) {print $0}} ' FS="," test
A,T,A
G,T
A,T,A
T,C
G,G,G,A
A,T,G,C

Note that I identified 2 extra lines in your pasted data where at least one base differs.

Kevin

ADD COMMENTlink written 14 months ago by Kevin Blighe41k
1

thank you very much for the help

ADD REPLYlink written 14 months ago by kartikayprasad10

hey thanks for the code, can you please help me a little more. i was trying to add something in your code but it was a fail so can you please help. what i was trying that this code also print \n in those lines which ih having similar nucleotides in it. for example: the code will print \n for A,A and for T,T.

what i edited in your code is one else condition in the last but it is not working can you pls help. awk '{strPrevBase=$1; boolDiff=0; for (i=2; i<=NF; i++) {if ($(i)!=strPrevBase) {boolDiff=1}} if (boolDiff==1) {print $0} else {print "\n"}} ' FS="," test

ADD REPLYlink written 14 months ago by kartikayprasad10

Sure, can you try this (I think that this is what you want):

awk '{strPrevBase=$1; boolDiff=0; for (i=2; i<=NF; i++) {if ($(i)!=strPrevBase) {boolDiff=1}} if (boolDiff==1) {print $0} else print "\n"} ' FS="," test
ADD REPLYlink written 14 months ago by Kevin Blighe41k
2
gravatar for Macspider
14 months ago by
Macspider2.8k
Vienna - BOKU
Macspider2.8k wrote:

Here's a oneliner that will do the trick. It's 95% based on python, so I hope it's good. Substitute "test.txt" with your file name.

command

cat test.txt | python2.7 -c 'import sys; lst=[[line.rstrip("\n"), list(set(line.rstrip("\b\r\n, ").split(",")))] for line in sys.stdin]; tmp=[x[0] for x in lst if len(x[1])>1]; sys.stdout.write("\n".join(tmp) + "\n")'

explanation

We cat the file and we pipe it to python2.7, with the -c option to include a command within quotes (''). We first import the sys module just for having it easy at reading and writing to output (or at least, I like it haha). Every line of the python command is separated by a semi-colon (;).

We create a list (lst). We read the input file through the python list comprehension syntax (see the end of the command ... for line in sys.stdin. What we declare before that is our variable that is stored in the list. In this case, another list composed of two elements. The first item of this sub-list is the raw element you want to print out, the second is a processed version of it.

The first item of the sub-list is simply stripped off of the newline metacharacter (rstrip("\n")). The second is processed more. We remove the trailing metacharacters and commas (rstrip(\r\b\n,). We then split this item at every comma (split(",")). This produces an output like [A, T, A], a list where each item is one of the ones you had separated by commas. So each line here at this point looks like this:

["A,T,A", [A, T, A]]

A list of two elements: the raw line in string format and the processed line in form of list.

Since you want only the lines which contain more than one "letter", one neat way to do so is to "unique" the list and see if the final length is > 1 (i.e. there is more than one letter). To do so in python: list(set()). set() will remove the duplicates in the list, and list() will re-format the output as a list again. So each line here at this point looks like this:

["A,T,A", [A, T]]

Note that the latest A has disappeared, being a duplicate.

The following command in the python part is selecting only those lines that have a uniqued list > 1, meaning the ones that you are interested in. It does so with the list length (if len(x[1])>1). Each selected item is a list of two elements, where the first is the raw input line. We make a list, which I here call tmp, that contains only the raw input line for each selected item. That is what we now print out: with sys.stdout.write("\n".join(tmp) + "\n") we join() each element of this list with a newline character, forming the line-formatted output file, and we add a final newline to complete it (+ "\n").

ADD COMMENTlink written 14 months ago by Macspider2.8k
2
gravatar for steve
14 months ago by
steve1.9k
United States
steve1.9k wrote:

Python

import sys
input_file = "nucleotides.csv"
with open(input_file) as f:
    for line in f:
        parts = [x for x in line.strip().split(',') if x != '']
        all_equal = all( x == parts[0] for x in parts)
        if not all_equal and len(parts) > 0:
            sys.stdout.write(line)
ADD COMMENTlink written 14 months ago by steve1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1118 users visited in the last hour