Question

grep value from html file

1

Entering edit mode

16 months ago

arshad1292 ▴ 110

I have 200 html files that contain information such as Filename, Filetype, total Sequences etc. Please see attached the screenshot enter image description here

I need to grep the Filename and Total Sequences from the Value column (in this screenshot I need IGM17-B_S162_read_1.fastq and the value 9237623) and save it in a seperate.txt file.

May be with grep or cat command. Again, these are html files.

I would really appreciate help from anyone who's expert in writing the script in the command line.

cat script commandline shell grep • 860 views

ADD COMMENT • link updated 16 months ago by dariober 14k • written 16 months ago by arshad1292 ▴ 110

1

Entering edit mode

This can be done, but it seems that you wish to aggregate FastQC reports and possibly other logfiles. So maybe you want to try MultiQC first before trying to come up with an own solution?

ADD REPLY • link 16 months ago by Matthias Zepper 4.9k

0

Entering edit mode

This may be a fun read https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

ADD REPLY • link 16 months ago by dariober 14k

score 4 · Answer 1 · 2023-05-25

html produced by fastqc in XML+HTML, so you can use a XPATH expression to extract things.

$ xmllint --xpath '//tr[td[1]/text()="Filename"]/td[2]/text()'   fastqc_report.html
jeter.fastq.gz

 xmllint --xpath '//tr[td[1]/text()="Total Sequences"]/td[2]/text()'  fastqc_report.html
147142898

fastqc also comes with a text file fastqc_data.txt

$ grep -E '(Filename|Total Sequences)'  fastqc_data.txt 
Filename    jeter.fastq.gz
Total Sequences 147142898