Question: Why "Parse XML clinical data" in TCGAbiolinks gives error in Rstudio ?
1
gravatar for Björn
10 months ago by
Björn40
Björn40 wrote:

The following command for TCGA - clinical data gives error in TCGAbiolinks. I am trying to parse XML clinical data

query <- GDCquery(project = "TCGA-PRAD", 
              data.category = "Clinical")


radiation<-GDCprepare_clinic(query, clinical.info = "radiation")

|==== | 4% Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Start tag expected, '<' not found [4]

I assume it is because of big file size. How can I solve the problem ?

ADD COMMENTlink modified 10 months ago by afsverissimo0 • written 10 months ago by Björn40

tag expected, '<' not found [4]

it's because it's not a xml file, or the file is corrupted.

test this with:

xmllint --stream --noout your.xml
ADD REPLYlink modified 10 months ago • written 10 months ago by Pierre Lindenbaum119k

Thanks,I have nearly 600 individual xml files in individual folders which is standard download from TCGA. So, no idea how to sort it out?

ADD REPLYlink written 10 months ago by Björn40
find /path/to/dir  -type f -name "*.xml" -exec xmllint --stream --noout '{}' ';'
ADD REPLYlink modified 10 months ago • written 10 months ago by Pierre Lindenbaum119k

facing the same problem, all of them are valid xml

edit: for KIRC project instead, something weird as there should be 627 patients, but only 621 xml files in clinical directory

edit2: all files in results from query exist, so there are multiple references for same file

ADD REPLYlink modified 10 months ago • written 10 months ago by afsverissimo0

actually, the code is trying to read a txt file as xml -_-

I'm guessing this is a problem with tcgabiolinks trying to parse wrong files or not recognizing that it is not a xml

find /path/to/dir -type f -name "*.txt"

ADD REPLYlink written 10 months ago by afsverissimo0
0
gravatar for afsverissimo
10 months ago by
afsverissimo0 wrote:

According to Bioconductor support website, you need to filter out xml files

query <- GDCquery(project = 'TCGA-KIRC', data.category = "Clinical",file.type = "xml")

Here is the link: https://support.bioconductor.org/p/110056/#110057

I haven't been able to confirm this as TCGAbiolinks is saying that GDC server is down.

edit: solution works for me. I solved the last problem by using the github version of the package

ADD COMMENTlink modified 10 months ago • written 10 months ago by afsverissimo0
1

Install latest version from GitHub. A: GDC server down???

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax65k

thanks a lot, it solved the problem!

ADD REPLYlink written 10 months ago by afsverissimo0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2123 users visited in the last hour