Question: my file do not have duplicates, but it still shows duplicate are not allow
0
gravatar for mikysyc2016
3 months ago by
mikysyc201620
mikysyc201620 wrote:

Hi all, I check my file use which(dupilcate(file)), and i already remove duplicate with my file. But when i read my file in R, it still show as below:

 x <- read.delim("merged_6_rd.txt", row.names = 1, stringsAsFactors = FALSE)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed

I do not know how to deal with it. Thanks,

rna-seq R • 245 views
ADD COMMENTlink modified 9 weeks ago by Biostar ♦♦ 20 • written 3 months ago by mikysyc201620
1

Assuming that you are on *nix/macOS, run following command and let us know the output:

~ $ cut -f1 merged_6_rd.txt | uniq -d

Please add sort as mentioned in Pierre post, if entries in column 1 are not sorted. If they are already sorted, you don't have to sort.

ADD REPLYlink modified 3 months ago • written 3 months ago by cpad01129.0k
1

uniq needs a sorted input;

 cut -f1 merged_6_rd.txt | sort | uniq -d
ADD REPLYlink written 3 months ago by Pierre Lindenbaum112k

Why not show counting the first column?

ADD REPLYlink written 3 months ago by shenwei3564.1k

when i use

cut -f1 merged_6_rd.txt | sort | uniq -d

I get :

ID
NM_001001130
NM_001001144
NM_001001152
NM_001001160
NM_001001176
NM_001001177
NM_001001178
NM_001001180
NM_001001181
NM_001001182
NM_001001183
NM_00100118
..........
ADD REPLYlink modified 3 months ago by RamRS17k • written 3 months ago by mikysyc201620

those are duplicated entries in your data. now do grep -i -w 'NM_001001130' merged_6_rd.txt. you should get more than one row and in first column of the resultant rows, you should see duplicate entries of NM_001001130'

ADD REPLYlink written 3 months ago by cpad01129.0k

you are right i get two :

NM_001001130    22  16  14  12  25  18  2218
NM_001001130

how i can remove the second one? Thanks!

ADD REPLYlink modified 3 months ago by RamRS17k • written 3 months ago by mikysyc201620
1

well, you need to look at the other duplicate entries and see if it is the same pattern. Then one can write a script to remove empty entries. Otherwise, you need to come up with a way to handle such entries. Make a list of duplicate entries in a separate file.

If it is same pattern, see if following code works: $ awk '!a[$1]++' merged_6_rd.txt. Please validate the output for previously identified duplicates. This is on the assumption that empty lines come second when there are duplicates. If not so, try : $ awk '$2!=""' merged_6_rd.txt. This is on the assumption that duplicate lines to be removed have 2nd column empty.

ADD REPLYlink modified 3 months ago • written 3 months ago by cpad01129.0k

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 3 months ago by RamRS17k

You had good pointers on how to remove rows with duplicate names, but I feel you should investigate why you have rows with duplicate names: generally, analysis pipelines output results with unique identifiers. How was this file created?

ADD REPLYlink written 9 weeks ago by h.mon19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 811 users visited in the last hour