B4J Question How to Parse HTML file

atiaust

Active Member
Licensed User
Longtime User
Hi All,

I have a small HTML file that I would like to extract data from, name address etc.

<p><strong>Client Title:</strong> Mr<br />
<strong>Client First Name:</strong> John<br />
<strong>Client Last Name:</strong> Smith<br />
<strong>Client Phone:</strong> 0400555000<br />
<strong>Client Email:</strong> johnS@yahoo.com.au<br />
<strong>Client Preferred Contact:</strong> Phone<br />
<strong>Pickup Unit Number:</strong>5<br />
<strong>Pickup Street Number:</strong> 10<br />
<strong>Pickup Street Name:</strong> Long Street<br />
<strong>Pickup Suburb:</strong> Sydney</p>

What is the best way to parse the data so that I can add it to a database?

Any and all ideas gratefully accepted.

Thanks
 

atiaust

Active Member
Licensed User
Longtime User
Thanks Erel,

After reading lots of posts I figured that was probably best.

I get the following error when I try to parse the .xml file.

B4X:
Dim t As Tidy
         t.Initialize
         t.Parse(File.OpenInput(dirData, filename), dirData, "temp.xml")
         Dim In As InputStream = File.OpenInput(dirData, "temp.xml")
             items.Initialize
               sax.Parse(In, "sax")

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 35 column 1 - Warning: inserting missing 'title' element
InputStream: Document content looks like HTML 2.0
2 warnings, no errors were found!
[Fatal Error] :1:50: White spaces are required between publicId and systemId.
Error occurred on line: 158 (Main)
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at anywheresoftware.b4a.objects.SaxParser.parse(SaxParser.java:80)
at anywheresoftware.b4a.objects.SaxParser.Parse(SaxParser.java:73)
at ati.ccsemail.main._handlemessage(main.java:272)
at ati.ccsemail.main._pop_downloadcompleted(main.java:375)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at anywheresoftware.b4a.shell.Shell.runMethod(Shell.java:612)
at anywheresoftware.b4a.shell.Shell.raiseEventImpl(Shell.java:229)
at anywheresoftware.b4a.shell.Shell.raiseEvent(Shell.java:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:93)
at anywheresoftware.b4a.ShellBA.raiseEvent2(ShellBA.java:90)
at anywheresoftware.b4a.BA$4.run(BA.java:196)
at com.sun.javafx.application.PlatformImpl.lambda$null$173(PlatformImpl.java:295)
at java.security.AccessController.doPrivileged(Native Method)

Any ideas?

Thanks
 
Upvote 0

atiaust

Active Member
Licensed User
Longtime User
Output from log.

line 35 column 1 - Warning: inserting missing 'title' element
InputStream: Document content looks like HTML 2.0
2 warnings, no errors were found!
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net">
<title></title>
</head>
<body>
<p><strong>Client Title:</strong> Mr<br>
<strong>Client First Name:</strong> Rob<br>
<strong>Client Last Name:</strong> Smith<br>
<strong>Client Phone:</strong> 0400000205<br>
<strong>Client Email:</strong> rsmith@tpg.com.au<br>
<strong>Client Preferred Contact:</strong> Email<br>
<strong>Pickup Unit Number:</strong> not required as only drop off
was selected<br>
<strong>Pickup Street Number:</strong> not required as only drop
off was selected<br>
<strong>Pickup Street Name:</strong> not required as only drop off
was selected<br>
<strong>Pickup Suburb:</strong> not required as only drop off was
selected</p>
<p><strong>Dropoff Unit Number:</strong><br>
<strong>Dropoff Street Number:</strong> 1<br>
<strong>Dropoff Street Name:</strong> Short Ave<br>
<strong>Dropoff Suburb:</strong> Blue Bay</p>
<p><strong>Travel Details</strong> - <strong>One Way</strong><br>
<strong>Pickup</strong> from/at/to <strong>Airport</strong><br>
____________________________________________________</p>
<p><strong>Departure Date:</strong> // (dd/mm/yyyy)<br>
<strong>Departure Time:</strong> :<br>
<strong>Departure Flight/Ship/Hotel Number/Name:</strong><br>
<strong>Departure Flight Type:</strong> Domestic<br>
<strong>Number of Departing adults:</strong> 0<br>
<strong>Number of Departing Children:</strong> 0<br>
_____________________________________________________</p>
<p><strong>Arrival Date:</strong> 23/November/2016 (dd/mm/yyyy)<br>
<strong>Arrival Time:</strong> 10:15 AM<br>
<strong>Arrival Flight/Ship/Hotel Number/Name:</strong> QF0144<br>
<strong>Arrival Flight Type:</strong> International<br>
<strong>Number of Arriving adults:</strong> 2<br>
<strong>Number of Arriving Children:</strong> 0<br>
_____________________________________________________</p>
<p><strong>Additional Requirements:</strong><br>
<strong>Special Requests/Comments:</strong></p>
</body>
</html>
[Fatal Error] :1:50: White spaces are required between publicId and systemId.
Error occurred on line: 159 (Main)
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at anywheresoftware.b4a.objects.SaxParser.parse(SaxParser.java:80)
at anywheresoftware.b4a.objects.SaxParser.Parse(SaxParser.java:73)
at ati.ccsemail.main._handlemessage(main.java:275)
at ati.ccsemail.main._pop_downloadcompleted(main.java:378)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

etc......
 
Upvote 0

inakigarm

Well-Known Member
Licensed User
Longtime User
For parsing Html documents I use jsoup library (quick and easy); your html document is easy to parse without any external library (there's no html tags on body content beyond <strong> and <br>)
 
Upvote 0

atiaust

Active Member
Licensed User
Longtime User
Thanks Inakigarm,
I tried to parse the file using the suggestion from Erel but I think the items in the resulting xml file don't have tags to parse.

I read in the posts not to use the jsoup library, but I was a little confused as to why.

Attached is the html & xml files.

Appreciate if you can have a look at them and advise.

Thanks
 

Attachments

  • CCSEMail.zip
    1.4 KB · Views: 231
Upvote 0

inakigarm

Well-Known Member
Licensed User
Longtime User
One question first, is the html structure always the same ?? (html tags doesn't change?) Because if it is, I think it's easier to extract the page content with string operations (As there're one or more strong tags in html source is not automatic to parse with jsoup)

If it's always the same structure, my option will be parsing the page with string functions:

1. Read 0.html from disk file into a list (each entry of the list will be a line -string)
2. Create a map with a key , value pair moving into each line and using substrings methods
3. Get the map and save to disk, write to DB, etc..

B4X:
Sub AppStart (Args() As String)

    Dim lstsource As List
    lstsource.Initialize: lstsource=File.ReadList(File.DirApp,"0.html")
    
    Dim map1 As Map
   
    map1=Createlistmap(lstsource)
   
    For i=0 To map1.Size-1
        Log(map1.GetKeyAt(i) & " " & map1.GetValueAt(i))
    Next
   
   
End Sub

B4X:
Sub Createlistmap (l As List) As Map   
   
    Dim str As String
    Dim lsttmp,lstresults As List
    Dim m As Map
   
    lsttmp.Initialize
    lstresults.Initialize
    m.Initialize
   
    For i=0 To l.Size-1
        Dim strtmp,strtmp1,strtmp2 As String
        Dim count As Int
       
        strtmp=l.Get(i)
        count=StringCount(strtmp,"<strong>",True)
        Select Case count
            Case 0
                Log("no")
            Case 1
                strtmp1=strtmp.SubString2(strtmp.IndexOf("<strong>")+8,strtmp.IndexOf("</strong>"))
                If strtmp.Contains("br") Then
                    strtmp2=strtmp.SubString2(strtmp.LastindexOf("</strong>")+9,strtmp.LastIndexOf("<br />"))
                Else
                    strtmp2=strtmp.SubString2(strtmp.LastindexOf("</strong>")+9,strtmp.LastIndexOf("</p"))
                End If
            Case 2
               
                strtmp1=strtmp.SubString2(strtmp.IndexOf("<strong>")+8,strtmp.IndexOf("</strong>"))
                strtmp2=strtmp.SubString2(strtmp.IndexOf("</strong>")+9,strtmp.LastIndexOf("<strong>"))
                strtmp1=strtmp1 & " " & strtmp2
                strtmp2=strtmp.SubString2(strtmp.lastindexOf("<strong>")+8,strtmp.LastIndexOf("</strong>"))
                strtmp1=strtmp1 & " " & strtmp2
                If strtmp.Contains("br") Then
                    strtmp2=strtmp.SubString2(strtmp.LastindexOf("</strong>")+9,strtmp.LastIndexOf("<br />"))
                Else
                    strtmp2=strtmp.SubString2(strtmp.LastindexOf("</strong>")+9,strtmp.LastIndexOf("</p"))
                End If
            Case 3
                'Other cases
               
               
        End Select
       
        Log(strtmp1 & " " &strtmp2)
        m.Put(strtmp1,strtmp2)
   
    Next
   
    Return m
   
End Sub

'Sub from user stevel05
Sub StringCount(StringToSearch As String,TargetStr As String,IgnoreCase As Boolean) As Int
    If IgnoreCase Then
        StringToSearch = StringToSearch.ToLowerCase
        TargetStr = TargetStr.ToLowerCase
    End If

    Return (StringToSearch.Length - StringToSearch.Replace(TargetStr,"").Length) / TargetStr.Length

End Sub
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
Regex to the rescue:
B4X:
Sub AppStart (Form1 As Form, Args() As String)
   Dim m As Matcher = Regex.Matcher("<strong>([^<]+)</strong>\s*([^<]+)<br />", File.ReadString(File.DirAssets, "0.html"))
   Do While m.Find
     Dim key As String = m.Group(1)
     Dim value As String = m.Group(2)
     Log(key & "=" & value)
   Loop
End Sub
 
Upvote 0

atiaust

Active Member
Licensed User
Longtime User
Thank you both for your advise.

The html file is generated from the body of an email sent from an online web enquiry.
The file format will always be constant.

Regex looks by far the simplest now that I can see how to apply.
 
Upvote 0

rwblinn

Well-Known Member
Licensed User
Longtime User
Hi,

thought about converting the content to a table, then parse to xml then build a map with fields.
See attached try using jTidy with output.

Note: Corrected a typo spotted in the source. Attached updated.

upload_2016-10-13_13-20-32.png
 

Attachments

  • htmltoxmltofields.zip
    5 KB · Views: 357
Last edited:
Upvote 0
Top