Why "Parse XML clinical data" in TCGAbiolinks gives error in Rstudio ?
1
1
Entering edit mode
3.3 years ago
Björn ▴ 100

The following command for TCGA - clinical data gives error in TCGAbiolinks. I am trying to parse XML clinical data

query <- GDCquery(project = "TCGA-PRAD", 
              data.category = "Clinical")


radiation<-GDCprepare_clinic(query, clinical.info = "radiation")

|==== | 4% Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Start tag expected, '<' not found [4]

I assume it is because of big file size. How can I solve the problem ?

tcgabiolinks xml clinical data • 1.4k views
ADD COMMENT
0
Entering edit mode

tag expected, '<' not found [4]

it's because it's not a xml file, or the file is corrupted.

test this with:

xmllint --stream --noout your.xml
ADD REPLY
0
Entering edit mode

Thanks,I have nearly 600 individual xml files in individual folders which is standard download from TCGA. So, no idea how to sort it out?

ADD REPLY
0
Entering edit mode
find /path/to/dir  -type f -name "*.xml" -exec xmllint --stream --noout '{}' ';'
ADD REPLY
0
Entering edit mode

facing the same problem, all of them are valid xml

edit: for KIRC project instead, something weird as there should be 627 patients, but only 621 xml files in clinical directory

edit2: all files in results from query exist, so there are multiple references for same file

ADD REPLY
0
Entering edit mode

actually, the code is trying to read a txt file as xml -_-

I'm guessing this is a problem with tcgabiolinks trying to parse wrong files or not recognizing that it is not a xml

find /path/to/dir -type f -name "*.txt"

ADD REPLY
0
Entering edit mode
3.3 years ago

According to Bioconductor support website, you need to filter out xml files

query <- GDCquery(project = 'TCGA-KIRC', data.category = "Clinical",file.type = "xml")

Here is the link: https://support.bioconductor.org/p/110056/#110057

I haven't been able to confirm this as TCGAbiolinks is saying that GDC server is down.

edit: solution works for me. I solved the last problem by using the github version of the package

ADD COMMENT
1
Entering edit mode

Install latest version from GitHub. A: GDC server down???

ADD REPLY
0
Entering edit mode

thanks a lot, it solved the problem!

ADD REPLY

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6