B4A Library JTidy library - Convert HTML pages to XML

This library wraps JTidy open source project.
It is supported by B4A and B4J.

It allows you to convert a HTML page to XHTML page. XHTML can be parsed with a XML parser.

This approach is better than trying to parse HTML with regular expressions.

Usage is simple:
B4X:
Sub Process_Globals
   Dim sax As SaxParser
   Dim tid As Tidy
End Sub

Sub Activity_Create(FirstTime As Boolean)
   tid.Initialize
   'parse the Html page and create a new xml document.
   tid.Parse(File.OpenInput(File.DirAssets, "index.html"), File.DirRootExternal, "1.xml")
   sax.Initialize
   sax.Parse(File.OpenInput(File.DirRootExternal, "1.xml"), "sax")
End Sub

Tips: By default jTidy will not output anything if it encounter an error. You can see the errors in the unfiltered logs.

You can force it to always make output with:
B4X:
tid.Initialize
Dim jo As JavaObject = tid
jo.GetFieldJO("tidy").RunMethod("setForceOutput", Array(True))

If parsing of the generated document is very slow then follow this post: https://www.b4x.com/android/forum/t...seems-very-slow-long-delay.91627/#post-578641
 

Attachments

  • JTidy.zip
    245 KB · Views: 1,192
Last edited:

NJDude

Expert
Licensed User
Longtime User
This lib should be added to the Extras or Internal libraries directory?

I'm asking to keep the "official" libs where they belong in case of a B4A upgrade.

Thanks
 

Inman

Well-Known Member
Licensed User
Longtime User
Great news. Coming from VB with DOM parser, HTML parsing had been always an issue for me on B4A.

Not any more!
 

walterf25

Expert
Licensed User
Longtime User
Quick question about jtidy

Hello Erel and NJ, i was giving this library a try, i need to update one of my apps and thought i would use this library to make the parsing a little faster, anyhow i gave it a go but when it creates the file "1.xml" when i open the file is empty, is there a reason as to why this would be, attached is the html file i'm using, maybe one of you guys can tell me or help me figure this out, I know the html tags on the file may not be very well formatted, this is the only reason i can think the created xml file will be empty.

Any thoughts, or ideas!

here is the file

View attachment 1.zip
 
Last edited:

walterf25

Expert
Licensed User
Longtime User
Jtidy

Hi Erel, sorry for not following up on this sooner, I'am now back to trying to update this app, i was wondering can you maybe suggest an easy way to fix this issue, i still can't seem to make jtidy work with the html file i'm using.
 

merlin2049er

Well-Known Member
Licensed User
Longtime User
That's great. So extracting tags from xml is easier than html?

I need to extract some links to put into my download manager.
 

Cableguy

Expert
Licensed User
Longtime User
What would be the correct syntax to convert a webpage on the fly, meaning, doing something like
B4X:
   tid.Parse(WebPage, File.DirRootExternal, "1.xml")
????
 

Cableguy

Expert
Licensed User
Longtime User
Thanks Erel
 

mouhaddab

Member
Licensed User
hi erel:

I want to convert a web page "index.html" to "index.xml" using the example above, but fails parsser and it gives me this error:

"Org.apache.harmony.xml.ExpatParser $ ParseException: At line 16, column 815: not well-formed (invalid token)"

You can find attached the file xlm I had, using JTidy.

tanks
 

Attachments

  • index.xml
    44.6 KB · Views: 454

DonManfred

Expert
Licensed User
Longtime User
the paser can not parse javascript

Edit: i guess. I just saw javascript at this line
 
Top