Scrapping Jsoup
I recommend that you first develop your solution in B4J as a B4XPages a B4X solution. Development, and especially scraping, is easier to develop a B4A solution with a keyboard than on a bigger screen. If it works in B4J, you only need minimal adjustments when using it in B4A.
Because the jSoup library is old and not up to date and, as always with a library, does not contain the routine you need or want.
The following is needed:
Download the latest jSoup jar from here: https://jsoup.org/download and save it in the external libray folder
#AdditionalJar: jsoup-1.17.2 which is the same as the name of the current version of jSoup
Sub Class_Globals
Private Root As B4XView 'ignore
Private xui As XUI 'ignore
Private clv1 As CustomListView
Private txturl As B4XFloatTextField
End Sub
'You can add more parameters here.
Public Sub Initialize As Object
Return Me
End Sub
'This event will be called once, before the page becomes visible.
Private Sub B4XPage_Created (Root1 As B4XView)
Root = Root1
'load the layout to Root
Root.LoadLayout("frmDantesjSoup")
End Sub
'You can see the list of page related events in the B4XPagesManager object. The event name is B4XPage.
Private Sub txturl_EnterPressed
' Me.As(JavaObject).RunMethod("firstScrape", Array As String("https://quotes.toscrape.com/"))
Me.As(JavaObject).RunMethod("firstScrape", Array ("https://b4x.com"))
End Sub
#IF JAVA
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* --- The firstScrape is an jSoup example to start with jsoup.
* 1) Download the latest jSoup jar from here: https://jsoup.org/download
* 2) Add in Main page: #AdditionalJar: jsoup-1.17.2
* 3) Call this routine with:
* Me.As(JavaObject).RunMethod("firstScrape", Array ("https://b4x.com"))
*
* Happy scraping!
*
* One of many information on the Internet to start:
* https://www.tutorialspoint.com/jsoup/index.htm
* ---
* @param url the web page to scrape
* @throws IOException
*/
public void firstScrape(String url) throws IOException {
// Connect to the target website with an HTTP GET request
Document doc = Jsoup.connect(url).userAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
.get();
// Extract the title of the webpage
String title = doc.title();
System.out.println(title);
// Extract the textual part of the web page
String text = doc.body().text();
System.out.println(text);
System.out.println("----");
// Extract all hyperlinks in the URL
Elements links = doc.select("a[href]");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
String linkOuterHtml = link.outerHtml();
String linkInnerHtml = link.html();
System.out.println(linkHref + "\t" + linkText + "\t" + linkOuterHtml + "\t" + linkInnerHtml);
}
}
#End If
Private Sub clv1_ItemClick (Index As Int, Value As Object)
End Sub
As you can see, I use the modern B4X views to save space. By applying the hint text and usage masking from the Enterpress routine, both the button and a label become redundant. This saves space on a small screen.
Finally:
Do not hesitate to ask a question on this forum, but please add a small example project in which you demonstrate:
- What you want to achieve
- Enter the URL in the program
- Where the problem is
- What's missing.
Update: external library folder as destination added