B4J Question MiniHTMLParser Error

MathiasM · Mar 19, 2021

Hello

I try to get the text in a <a> tag on a webpage.
However, I get an error:

Waiting for debugger to connect...
Program started.
Error occurred on line: 276 (MiniHtmlParser)
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
...
Program terminated (StartMessageLoop was not called).

The tags look like this:

HTML:

<div class="breadcrumbs">
         <a href="../../home.html">Home</a> &gt; <a href="../../commands.html">Commands</a> &gt; <a href="../Image.htm">Image</a> &gt; CreateRenderImage
</div>

So I try to get "Home", "Commands" and "Image".

This is my code:

B4X:

Private HtmlParser As MiniHtmlParser
    HtmlParser.Initialize
    Dim root As HtmlNode = HtmlParser.Parse(File.ReadString(File.DirAssets, "TestHTML.txt"))
    Dim breadcrumbs As HtmlNode = HtmlParser.FindNode(root, "div", HtmlParser.CreateHtmlAttribute("class", "breadcrumbs"))
    For Each n As HtmlNode In breadcrumbs.Children
        Log(HtmlParser.GetTextFromNode(n, 0))
    Next

A minimum project is added to this post.

Thanks a lot.

OliverA · Mar 19, 2021

MathiasM said:
Log(HtmlParser.GetTextFromNode(n, 0))

Your assuming that a node has children, and in some cases it may not be so

B4X:

if n.Children.Size > 0 then Log(HtmlParser.GetTextFromNode(n,0))

MathiasM · Mar 19, 2021

OliverA said:
Your assuming that a node has children, and in some cases it may not be so

Thanks for your answer OliverA. I understand what your code does, but I can't see why it is needed.

In the HTML

HTML:

<div class="breadcrumbs">
         <a href="../../home.html">Home</a> &gt; <a href="../../commands.html">Commands</a> &gt; <a href="../Image.htm">Image</a> &gt; CreateRenderImage
      </div>

In this code, I see the structure as this:
The <div> breadcrumbs has 3 childeren, the 3 <a> tags, they all have a Text value, why is it needed to check if the <a> has children to get their text value?
And if the text value is seen as a child, why would there be an out of bound exception, as they all have a text value?

I seem to miss something fundamental about these HTML tags.

Thanks for any input!

OliverA · Mar 19, 2021

MathiasM said:
<div> breadcrumbs has 3 childeren, the 3 <a>

But that is not what this library sees. Log the size and the n's (use one of the HtmlParser methods to see the content of the n's) to see what it sees.
Note: log the size of breadcrumbs children list

Erel · Mar 21, 2021

Moved to the questions forum.

B4J Question MiniHTMLParser Error

MathiasM

Active Member

Attachments

OliverA

Expert

MathiasM

Active Member

OliverA

Expert

Erel

B4X founder

Similar Threads