how to check if a massive csv is broken
1
0
Entering edit mode
2.1 years ago

I have a massive CSV that contains miRNA-Seq data from TCGA Biolinks. But when I use it in another script, it does not work, so it is possible that my CSV file contains an error format, most likely a misplaced comma or a white space in a column.

But I don't know how to look through it. Any suggestions or advice?

My CSV looks like this:

brokenfile csv r analysis • 950 views
ADD COMMENT
1
Entering edit mode
2.1 years ago
Mensur Dlakic ★ 27k

But when I use it in another script, it does not work

What is the exact error in your other script? It would help inform our thinking if you provided as much information as possible. Stating the exact error message is always more helpful than just saying that something doesn't work.

I doubt that it is a space of misplaced comma. What strikes me as an obvious potential error is in your header line. Many R scripts don't want or need the header entry for the first column, which is usually a sample ID, a gene name or something along those lines. In your header that column is named as an empty string (""), which strikes me as odd.

I'd start by typing something between those first two quotation lines ("sample_ID") and see if that works. If not, I would try deleting the first three characters ("",) from your header line.

ADD COMMENT
0
Entering edit mode

Thank you for your response, you are right.

This is the error message that R displays.

I am going to delete the first three characters and see what happens. Thank you so much!

ADD REPLY
0
Entering edit mode

That's not informative. Show code. The problem here is that the rownames were saved but the header for them is empty "". The easiest would be to load it into R ignoring the first row, basically all reader functions have an option to ignore leading rows, something like start=2. Then simply remove the first column which and put back the colnames manually. Or use something like data.table::fread which is usually smart enough to fill that empty colname for the first column with some dummy name.

ADD REPLY
0
Entering edit mode

I think that having an empty header entry for the first column is the default for read.csv and write.csv, write.table, etc. in R. In R, the semantics is that the first column contains the row.names. Therefore, the easiest way to check if the file is broken is to read it with read.csv(). If there was a misplaced comma, then there would be an error message saying something like: "Line X doesn't contain N entries."

ADD REPLY
0
Entering edit mode

thanks, that is what I did but there is no error message, so I do not what to do. This is my script. I am just obtaining rna-seq and mirna-seq from TCGABiolinks. When I try to use those files in another script the resulting file is broken. My second script is correct, so it must be an error in these files.

library("TCGAbiolinks")
library("SummarizedExperiment")
query_mirna <- GDCquery(project = "TCGA-COAD",
                    legacy = FALSE,
                    data.category = "Transcriptome Profiling", 
                    data.type = "Isoform Expression Quantification",
                    #workflow.type = "HTSeq - Counts", 
                    experimental.strategy = "miRNA-Seq")

GDCdownload(query_mirna, method = "API")
data <- GDCprepare(query_mirna)

write.csv(data, "data_mirnas.csv")

############# RNA data

query_rna <- GDCquery(project = "TCGA-COAD",
                  legacy = FALSE,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  experimental.strategy = "RNA-Seq")

GDCdownload(query_rna, method = "API")
data_RNA <- GDCprepare(query_rna)
data_rna_exp <- assay(data_RNA)
write.csv(data_rna_exp, "data_mrna_transcriptome.csv")
ADD REPLY

Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6