Hello everyone,
I am a brandnew user and dare to ask my first question here. Please bear with me as I am a total beginner with all things regarding Android and B4A. I am trying to find my way into B4A by trying a few things out. Among them is this: I want to download a HTML file from the internet and process its contents within my app. The particular website I am interested in is a nightmare, code-wise: While it may display in a browser, it's still invalid HTML code with lot of boilerplate javascript, unbalanced tags and what-have-you-not... So - quite an interesting task to deal with, haha. I want to be able to replicate the process once the website changes, i.e. it's not an option to fix the document problems manually in order to make it parse.
So far, I have managed to download the html file at runtime using OkHttpUtils2 which works ok (I followed the thread about OkHttpUtil2 with Wait For). Now I need to parse it; I want to use data in a table deeply buried in the html - ideally, I would like to just use an XPath to point to the nodes whose information I want. but I could not find a good starting point for that. So, I decided I would like to try it with a DOM parser rather than with SAX, useing the promising XOM library (specification). But since the source document is far from being xhtml, I tried to make it parseable using jTidy. Unfortunately, jTidy creates an empty file, and in the unfiltered log says, among many other things (translated by me, because jTidys log is in German on my machine): "The content of the document looks like HTML 4.01 Transitional. There have been 84 warnings and 11 errrors! This document hat errors that need to be corrected before HTML Tidy can clean it up." So, no luck here, given that manually fixing stuff beforehand at code time is not an option in order to keep it replicable once the website updates at runtime.
So, I tried using the jSoup Parser. Amazingly, it parses the unpreprocessed, faulty HTML file and lets me do this for instance:
This actually gives me all the <tr> tags including their respective <td> columns, with each <tr> being a seperate list item in tablerows, like this:
Well, that's a start... But trying to get the text out of each td node, I fail:
Line #3 results in an empty list.
So, I have (at least ?) two questions I guess:
I am a brandnew user and dare to ask my first question here. Please bear with me as I am a total beginner with all things regarding Android and B4A.
So far, I have managed to download the html file at runtime using OkHttpUtils2 which works ok (I followed the thread about OkHttpUtil2 with Wait For). Now I need to parse it; I want to use data in a table deeply buried in the html - ideally, I would like to just use an XPath to point to the nodes whose information I want. but I could not find a good starting point for that. So, I decided I would like to try it with a DOM parser rather than with SAX, useing the promising XOM library (specification). But since the source document is far from being xhtml, I tried to make it parseable using jTidy. Unfortunately, jTidy creates an empty file, and in the unfiltered log says, among many other things (translated by me, because jTidys log is in German on my machine): "The content of the document looks like HTML 4.01 Transitional. There have been 84 warnings and 11 errrors! This document hat errors that need to be corrected before HTML Tidy can clean it up." So, no luck here, given that manually fixing stuff beforehand at code time is not an option in order to keep it replicable once the website updates at runtime.
So, I tried using the jSoup Parser. Amazingly, it parses the unpreprocessed, faulty HTML file and lets me do this for instance:
B4X:
Dim html As String = LoadHtmlFromDisk(File.DirAssets, "the-document.html")
Dim js As jSoup
Dim tablerows As List
tablerows.Initialize
tablerows = js.getElementsByTag(html, "tr")
This actually gives me all the <tr> tags including their respective <td> columns, with each <tr> being a seperate list item in tablerows, like this:
tablerows.get(0) would for instance return::
<tr>
<td style="text-align: center">some text</td>
<td style="text-align: center">other text</td>
<td style="text-align: center">third column text</td>
<!-- ... etc ... -->
</tr>
Well, that's a start... But trying to get the text out of each td node, I fail:
B4X:
For i = 0 To tablerows.Size -1
Dim columns As List : columns.Initialize
columns = js.selectorElementText(tablerows.Get(i), "td") ' <-- this comes up empty
For j = 0 To columns.Size -1
Log($"${i}-${j}: ${columns.Get(j)}"$)
Next
Next
Line #3 results in an empty list.
So, I have (at least ?) two questions I guess:
- How can I correctly gather the text node out of a element node (i.e. the text inside a tag) using jSoup?
- Since I would like to use XOM's types to process everything: How can I have jSoup output a cleaned document (which I could then possibly feed jTidy and finally XOM)? I think I would want to use jSoup's "clean_HTML" method, but I can't figure out how to use it... It seems I'd want to use "relaxed" as whitelist level, but I don't know how to parametrize this.
Attachments
Last edited: