Question: Remove rows based on duplicates in one column
0
gravatar for thomasterTW
4.4 years ago by
thomasterTW20
RIMLS, Nijmegen, The Netherlands
thomasterTW20 wrote:

I'm doing RNAseq analysis to determine differential expression patterns. After running a custom R script to determine the cut off between expressed and unexpressed genes (kernel density plot), I get a list with 4 columns.

Ensembl ID          Short gene name     Locus                       10Log FPKM
ENSG00000201699     RNU1-59P            chr1:144534037-144534199    1.587315
ENSG00000215861     WI2-1896O14.1       chr1:144676873-144679969    0.975001
ENSG00000254539     RP4-791M13.3        chr1:144833167-144835867    -0.38691
ENSG00000225241     RP11-640M9.2        chr1:144593362-144621656    0.982576
ENSG00000203843     PFN1P2              chr1:144612265-144612683    -0.55001
ENSG00000235398     LINC00623           chr1:144275917-144311653    -0.44573 <-- dup
ENSG00000235398     LINC00623           chr1:144299757-144341756    0.524043 <-- dup
ENSG00000207106     RNVU1-4             chr1:144311212-144311376    0.949058
ENSG00000231360     AL592284.1          chr1:144339737-144521058    -0.37044
ENSG00000236943     RP11-640M9.1        chr1:144456137-144521970    0.434007

These lists averagely contain 25k genes. As you can see, there's a duplicate in this list (LINC00623). What I want to do is write a script (preferably perl or R, or maybe something like awk in the command line) to find these duplicates based on the Ensembl ID, and remove the line with the lowest FPKM (since it is the same gene averagely on the same locus but apparently cufflinks decides that it is expressed twice). I can write a script to determine this for my example, but the problem is that there can be genes in between the duplicates, so I need to find duplicates in the entire column. I haven't been able to figure this out so I really hope someone can help me.

Thanks in advance!

rna-seq duplicates • 1.3k views
ADD COMMENTlink modified 22 months ago by RamRS27k • written 4.4 years ago by thomasterTW20
1

I would first identify why cufflinks decided that it is expressed twice...

If you do in the terminal:

cut -f 2 file.txt | uniq -d > duplicates

Then you know which "short gene names" are duplicates and you could identify a pattern or something and the number of them.

ADD REPLYlink modified 22 months ago by RamRS27k • written 4.4 years ago by Floris Brenk940

I'd agree with this, I guess it's something to do with the XLOC codes produced and how your script handles those. I'd also highly suggest that you try the htseq_count -> DESeq2 method for differential gene expression, if anything just to compare results.

ADD REPLYlink modified 22 months ago by RamRS27k • written 4.4 years ago by andrew.j.skelton736.0k
2
gravatar for 5heikki
4.4 years ago by
5heikki8.9k
Finland
5heikki8.9k wrote:

In shell:

export LC_ALL=C; sort -k1,1 -k4,4gr file | sort -mu -k1,1 > out

or with proper GNU sort

export LC_ALL=C; sort --parallel $NUMCORES(max 8) -k1,1 -k4,4gr file | sort -mu -k1,1 > out
ADD COMMENTlink modified 22 months ago by RamRS27k • written 4.4 years ago by 5heikki8.9k

Thanks! This works very well

ADD REPLYlink modified 22 months ago by RamRS27k • written 4.4 years ago by thomasterTW20
0
gravatar for george.ry
4.4 years ago by
george.ry1.1k
United Kingdom
george.ry1.1k wrote:

Sort the table by the last column:

df = df[order(-df$Log10_FPKM),]

Then remove duplicates based on first column:

df[!duplicated(df$Ensembl_ID),]
ADD COMMENTlink modified 22 months ago by RamRS27k • written 4.4 years ago by george.ry1.1k
0
gravatar for Charles Plessy
4.4 years ago by
Charles Plessy2.7k
Japan
Charles Plessy2.7k wrote:

Not in R, but have a look at bedtools groupby.

ADD COMMENTlink modified 22 months ago by RamRS27k • written 4.4 years ago by Charles Plessy2.7k
0
gravatar for Veera
4.4 years ago by
Veera 90
Denmark
Veera 90 wrote:

In R you can do by splitting the data frame using ensemblID column and extracting the rows with maximum FPKM and combining them in to a data frame.

dfm <- read.table(..)

dfm.lst <- split(dfm, dfm$EnsemblID)
dfm.lst.dupRemoved <- sapply(dfm.split, function(x) x[x$10LogFPKM == max(x$10LogFPKM), ])
dfm.dupRemoved <- do.call("rbind",df.lst.dupRemoved)

write.table(dfm.dupRemoved, "youNameIt", row.names = FALSE, quote = FALSE, sep = "\t")
ADD COMMENTlink modified 22 months ago by RamRS27k • written 4.4 years ago by Veera 90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1651 users visited in the last hour