How to tidy up dirty HTML

Mark Read · Apr 4, 2013

Hopefully an easy question for someone.

I am reading a large html file from a server and scanning line by line to find links to image files. In each line I should find a thumbnail link and a main image link. This follows in a loop, through the lines. Everything was working well. Regex was working fine until a few days ago.

Now, checking the original web page html in notepad++, I see that the hyperlinks have a CRLF in the middle. This has no effect in a web browser but my app sees only a part of the link and therefore finds no files.

I don't really want to make two loops if I can help it as the web page could change back and be okay again.

Can I import the whole html file somehow and "clean" the code so that my app will work?

Example:

old:

B4X:

<a href="thumbs/niketta-1-8007.jpg"><img alt="" src="thumbs/niketta-1.jpg" border="2" height="316" width="200"></a><br>

new:

B4X:

<p> <i> </i> <a href="thumbs/niketta-1-8007.jpg"><img
                          alt="" src="thumbs/niketta-1.jpg" border="2"
                          height="316" width="200"></a><br>

Many thanks
Mark

Erel · Apr 4, 2013

You should use JTidy library to convert the HTML to a valid XML file and then parse it with an XML parser.

How to tidy up dirty HTML

Mark Read

Well-Known Member

Erel

B4X founder