I currently have a working program that scrapes a website for details.
Downloads the page using an httpjob then saves certain lines to a file to be parsed using the saxparser.
It has been working fine till I try to parse this:
I am having problems because I want to parse the line starting with "<tr><td class="lastmonth"><div class="calendarTableDay"". This line continues all the way to the final "</p>" before the last line "</table>"
I successfully write it to the file and then when trying to parse using "parser.Parse(File.OpenInput(File.DirRootExternal, "index.html"), "parser")" I get the below error, relating that there is problems with the xml/html.
Any tips into parsing this massive long line or to converting it into a proper layout to be parsed would be helpful. Thanks!
The error I receive:
Downloads the page using an httpjob then saves certain lines to a file to be parsed using the saxparser.
It has been working fine till I try to parse this:
HTML:
<table class="calendarTable" border="0" cellspacing="0" cellpadding="0">
<tr>
<th>Monday</th>
<th>Tuesday</th>
<th>Wednesday </th>
<th>Thursday </th>
<th>Friday </th>
<th>Saturday </th>
<th>Sunday </th>
</tr>
<tr><td class="lastmonth"><div class="calendarTableDay">26</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">27</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">28</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">29</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">30</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">31</div><p class=""> </p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">1</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11247/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">2</div><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11248/result-detail.aspx">Taranaki GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">3</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11249/result-detail.aspx">Otago GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11250/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">4</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11251/result-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">5</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11252/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11253/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">6</div><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11255/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11254/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">7</div><p class=""> </p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">8</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11256/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">9</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11257/result-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">10</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11258/result-detail.aspx">Southland GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11259/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">11</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11260/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">12</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11261/result-detail.aspx">Waikato GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11262/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">13</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11263/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11264/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">14</div><p class=""> </p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">15</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11265/result-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">16</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11266/result-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">17</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11267/result-detail.aspx">Otago GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11268/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">18</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11269/result-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">19</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11270/result-detail.aspx">Waikato GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="results"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11271/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">20</div><p class="results"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11272/result-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="fields"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11273/field-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">21</div><p class=""> </p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">22</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11274/field-detail.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11275/field-detail.aspx">Ashburton GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">23</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11276/field-detail.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">24</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11277/field-detail.aspx">Southland GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11278/field-detail.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">25</div><p class="fields"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11279/field-detail.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">26</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11267/meeting-schedule.aspx">Waikato GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="schedule"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11268/meeting-schedule.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">27</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11269/meeting-schedule.aspx">Christchurch GRC</a><br/><br/></p><p style="height: 3px;"></p><p class="schedule"><img class="trophy" src="/Images/icons/night_race_icon.png" alt="Night Meeting" width="17" height="17" /><a href="/catch-the-action/11270/meeting-schedule.aspx">Wanganui GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">28</div><p class=""> </p><p style="height: 3px;"></p></td><td class="thismonth"><div class="calendarTableDay">29</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11271/meeting-schedule.aspx">Auckland GRC</a><br/><br/></p><p style="height: 3px;"></p></td></tr><tr><td class="thismonth"><div class="calendarTableDay">30</div><p class="schedule"><img class="trophy" src="/Images/icons/day_race_icon.png" alt="Day Meeting" width="17" height="17" /><a href="/catch-the-action/11272/meeting-schedule.aspx">Palmerston North GRC</a><br/><br/></p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">1</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">2</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">3</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">4</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">5</div><p class=""> </p><p style="height: 3px;"></p></td><td class="lastmonth"><div class="calendarTableDay">6</div><p class=""> </p><p style="height: 3px;"></p>
</table>
I am having problems because I want to parse the line starting with "<tr><td class="lastmonth"><div class="calendarTableDay"". This line continues all the way to the final "</p>" before the last line "</table>"
I successfully write it to the file and then when trying to parse using "parser.Parse(File.OpenInput(File.DirRootExternal, "index.html"), "parser")" I get the below error, relating that there is problems with the xml/html.
Any tips into parsing this massive long line or to converting it into a proper layout to be parsed would be helpful. Thanks!
The error I receive:
B4X:
org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 84: undefined entity
at org.apache.harmony.xml.ExpatParser.parseFragment(ExpatParser.java:515)
at org.apache.harmony.xml.ExpatParser.parseDocument(ExpatParser.java:474)
at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:321)
at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:279)
at anywheresoftware.b4a.objects.SaxParser.parse(SaxParser.java:80)
at anywheresoftware.b4a.objects.SaxParser.Parse(SaxParser.java:73)
at b4a.jtidy.main._jobdone(main.java:438)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:174)
at anywheresoftware.b4a.keywords.Common$5.run(Common.java:957)
at android.os.Handler.handleCallback(Handler.java:725)
at android.os.Handler.dispatchMessage(Handler.java:92)
at android.os.Looper.loop(Looper.java:213)
at android.app.ActivityThread.main(ActivityThread.java:5092)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:797)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:564)
at dalvik.system.NativeStart.main(Native Method)
org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 84: undefined entity