Android Question Scraping web with html

3394509365

Active Member
Licensed User
Longtime User
Hello, I made a program in VB.net that does web scraping, and pulls down data from a web page.
I would like to do the same thing though with an app in basic4android.
From what I understand there is no library that can intercept the tags and nodes from HTML, but perhaps I should first pull down the page locally in XML and then go there to intercept the tags that interest me. Did I understand correctly or is there any other way to approach the problem?

Do you have any sample programs to start from?

Thanks
Regards
 

MicroDrie

Well-Known Member
Licensed User
Longtime User
I think ? I found hopefully the reason for your problems in B4A. Add the jSoup.1.8.1.jar in the shared libraries folder.
 

Attachments

  • jsoup-1.8.1.jar
    293.8 KB · Views: 230
Upvote 0

3394509365

Active Member
Licensed User
Longtime User
ok, now it works,
Let's go back to the code:

B4X:
    '    --- Get the date for the numbers
    Dim FirstDateRow As String = js.getElementsByClass(HTML, "t1").Get(1)'


while for the extraction of the numbers I can change the .get (x) index in the code above no.

On the page there is only one occurrence of "t1"


somehow i have to use the for each .....
 
Upvote 0

OliverA

Expert
Licensed User
Longtime User
somehow i have to use the for each .....
Why? What is the issue? Show your code and, if you get some, the errors you are dealing with
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Longtime User

Sorry for the missing jar, nice that it works now.

Scrapping has a top down approach. Start looking how the HTMLcode is build. In your example the structure is one date class "t1" and many numbers under it with class "t2". So you have nothing to change how you resolve class "t1". You must change the class "t2" code change to a structure like this:
B4X:
'    --- As for example we need to resolve a year of the last "superstar-24px" numbers
    Dim x As Int = 0
    For x = 0 To 11    
        
    '    --- Get the first (op top row) of the table in t2 class
        Dim FirstTableRow As String = js.getElementsByClass(HTML, "t2").Get(x)

    '    --- We need only the td rows of the wanted table    
        Extract02 = js.getElementsByClass(FirstTableRow, "ball-24px")
        Extract03 = js.getElementsByClass(FirstTableRow, "jolly-24px")
        Extract04 = js.getElementsByClass(FirstTableRow, "superstar-24px")
        Log("Extract04: "& Extract04)
    Next

This code is not complete solution, I only pointed a possible way to go.
 
Upvote 0

3394509365

Active Member
Licensed User
Longtime User
therefore, do I have to keep version 1.8.1 or 1.13.1.?

Returning to the code, the number 11 however is not a fixed number but varies as they add extractions.
But somehow I will manage to make it dynamic.

Thanks
 
Upvote 0

3394509365

Active Member
Licensed User
Longtime User
Finally functioning as I wanted it.

below the code if anyone needs it, for now in B4J


B4X:
#Region Project Attributes
    #MainFormWidth: 600
    #MainFormHeight: 600
#End Region

Sub Process_Globals
    Private fx As JFX
    Private MainForm As Form
    
    '    --- Variable for Jsoup
    Private js As jSoup
    Private Ciclo As List
    Private Extract01 As List
    Private Extract02 As List
    Private Extract03 As List
    Private Extract04 As List
End Sub

Sub AppStart (Form1 As Form, Args() As String)
    MainForm = Form1
    '    MainForm.RootPane.LoadLayout("Layout1") 'Load the layout file.
    '    MainForm.Show

    Extract01.Initialize
    Extract02.Initialize
    Extract03.Initialize
    Extract04.Initialize
    
    ScrapeTable
    
#IF B4A
    Activity.Finish
#End If

#if B4J
    ExitApplication        ' ends the program
#End If
    
End Sub

Private Sub ScrapeTable

    '    --- Load the url
    Dim url As String = "https://www.superenalotto.com/archivio"
    
    '    ---  Get the page content
    Dim HTML As String = js.connect(url) ' riabilitare se devo leggere direttamente dal sito
''xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   
'    ' per adesso leggo solo dal file di text
'    '--- Load the raw text File
'    Dim RawHTML As String = ""
'    Dim read1 As String="miotesto.txt"
'    RawHTML = File.ReadString(File.DirAssets,read1)
'    ' Log("Raw RawHTML: " & RawHTML)  'displays the content of file
'    '    --- Remove empty lines and reformat the layout
'    Dim HTML As String = js.parse_HTML(RawHTML)
'    'Log("Clean HTML: " & HTML)
    
    '    Log (HTML)
''xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    
    '    --- Get the date for the numbers
    Dim FirstDateRow As String = js.getElementsByClass(HTML, "t1").Get(0) ' è il ciclo esterno della data 1 volta sola
    '    --- We need the t1 class for the date
    Extract01 = js.selectorElementText(FirstDateRow, "a")
    Dim Count As Int = Extract01.Size
    Log("Count: "& Count)
    '   
'    '    --- Start with the date which is the first line
'    Log($"The number from ${Extract01.Get(0)} are: ${CRLF}"$)
'    '    --- Show results during test periode
'    Log("Extract01: "& Extract01)' senza indice prensìde tutte le date
'    '    Dim Count As Int = Extract01.Size
'    '    Log("Count: "& Count)
'   
    
    
    ' determino quante volte devo fare il giro delle estrazioni
    Dim Totclicli As String = js.getElementsByClass(HTML, "t1").Get(0)
    Ciclo = js.selectorElementText(Totclicli, "a")
    Dim CountCiclo As Int = Ciclo.Size
    Log("Count: "& CountCiclo)
            
'   
    For w=0 To CountCiclo-1
        
        Log("Extract01: "& Extract01.Get(w))
        
        
            Log("xxxxxxxxxxxxxxxxx        nuovo giro " & (w+1) & "         xxxxxxxxxxxxxxxxxxxx")

'   
'       
        '    --- Get the first (op top row) of the table in t2 class
        Dim FirstTableRow As String = js.getElementsByClass(HTML, "t2").Get(w)
        
'    '    --- We need only the td rows of the wanted table
        Extract02 = js.getElementsByClass(FirstTableRow, "ball-24px")
        Extract03 = js.getElementsByClass(FirstTableRow, "jolly-24px")
        Extract04 = js.getElementsByClass(FirstTableRow, "superstar-24px")
        
        '    '    --- We need to scrape the rows and columns
        Dim columns As List
        columns.Initialize
        
        '    --- To display a counter at the variables end
        Dim x As Int = 0
        
        '    --- Scrape the first 6 numbers with the same class
        For i = 0 To Extract02.Size -1
            columns = js.selectorElementText($"<table>${Extract02.Get(i)}</table>"$, "td")
        
            For j = 0 To columns.Size -1
                x = x + 1
                Log($"Number${x}: ${columns.Get(j)}"$)
            Next
        Next
        
        '    --- And scrape the Jolly number with different class
        For i = 0 To Extract03.Size -1
            columns = js.selectorElementText($"<table>${Extract03.Get(i)}</table>"$, "td")
        
            For j = 0 To columns.Size -1
                x = x + 1
                Log($"Jolly Number${x}: ${columns.Get(j)}"$)
            Next
        Next
        
        '    --- And scrape the super number with different class
        For i = 0 To Extract04.Size -1
            columns = js.selectorElementText($"<table>${Extract04.Get(i)}</table>"$, "td")
        
            For j = 0 To columns.Size -1
                x = x + 1
                Log($"Super Number${x}: ${columns.Get(j)}"$)
            Next
            
        Next
        
        
    Next

End Sub

thank You Microdrie
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Longtime User
Thank you for helping me on the right track with jSoup. I just managed to communicate with the jSoup via inline Jave code

thank You Microdrie

It's great that I was able to help you on the right way with jSoup. I just managed to communicate with the jSoup via inline Jave code via a class module.
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Longtime User
It works with version 1.8.1, so you can make use of the provided functions. At the moment that you came to a point that you misses a function in the version 1.8.1, version 1.13.1 is then a possible solution.
 
Upvote 0
Cookies are required to use this site. You must accept them to continue using the site. Learn more…