Why "Parse XML clinical data" in TCGAbiolinks gives error in Rstudio ?
3.3 years ago
Björn ▴ 100

The following command for TCGA - clinical data gives error in TCGAbiolinks. I am trying to parse XML clinical data

query <- GDCquery(project = "TCGA-PRAD",
data.category = "Clinical")



|==== | 4% Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Start tag expected, '<' not found [4]

I assume it is because of big file size. How can I solve the problem ?

tcgabiolinks xml clinical data • 1.4k views
it's because it's not a xml file, or the file is corrupted.

test this with:

xmllint --stream --noout your.xml

Thanks,I have nearly 600 individual xml files in individual folders which is standard download from TCGA. So, no idea how to sort it out?

find /path/to/dir  -type f -name "*.xml" -exec xmllint --stream --noout '{}' ';'

facing the same problem, all of them are valid xml

edit: for KIRC project instead, something weird as there should be 627 patients, but only 621 xml files in clinical directory

edit2: all files in results from query exist, so there are multiple references for same file

actually, the code is trying to read a txt file as xml -_-

I'm guessing this is a problem with tcgabiolinks trying to parse wrong files or not recognizing that it is not a xml

find /path/to/dir -type f -name "*.txt"

3.3 years ago

According to Bioconductor support website, you need to filter out xml files

query <- GDCquery(project = 'TCGA-KIRC', data.category = "Clinical",file.type = "xml")

I haven't been able to confirm this as TCGAbiolinks is saying that GDC server is down.

edit: solution works for me. I solved the last problem by using the github version of the package

thanks a lot, it solved the problem!