get links from webpage

ringmyr

Member
Licensed User
Longtime User
how do i get links from a webpage to buttons on my app?
the page have about 25 links

i want the link name and the links-url

linkname-button.text
button.click-links-url

/ringmyr
 

warwound

Expert
Licensed User
Longtime User
Hi.

Can you clarify exactly what you're trying to acheive?

Here's a simple bit of javascript that can get the text and href attributes of all links on a web page:

B4X:
function getLinks(){
   var anchors=document.getElementsByTagName('a');
   for(var i=0, j=anchors.length; i<j; i++){
      console.log(anchors[i].innerHTML+' | '+anchors[i].href);
   }
}

That will output the text and href values to the browser console log.

You can use my WebViewExtras library to execute javascript in a WebView.

Update the javascript so that it gets all link text and hrefs and sends them to a B4A Sub:

B4X:
var anchors=document.getElementsByTagName('a'), string=[];
for(var i=0, j=anchors.length; i<j; i++){
   string.push({text:anchors[i].innerHTML, href:anchors[i].href});
}
B4A.CallSub('ProcessLinks', true, JSON.stringify(string));

That javascript will create an array of objects.
Each object has two properties: 'text' and 'href'.
The object will be JSON encoded to a String and sent to a B4A Sub named 'ProcessLinks'.

So add WebViewExtras to your project, create the ProcessLinks Sub and execute the javascript after the web page has loaded:

B4X:
Sub Process_Globals
   Dim MyWebViewExtras As WebViewExtras
End Sub

Sub ProcessLinks(Links As String)
   Log(Links)
   ' parse the JSON encoded String back to text and href values
End Sub

Sub WebView1_PageFinished (Url As String)
   Dim Javascript As String
   Javascript="var anchors=document.getElementsByTagName('a'), string=[];for(var i=0, j=anchors.length; i<j; i++){string.push({text:anchors[i].innerHTML, href:anchors[i].href});}B4A.CallSub('ProcessLinks', true, JSON.stringify(string));"
   MyWebViewExtras.executeJavascript(WebView1, Javascript)
End Sub

Martin.
 
Upvote 0

warwound

Expert
Licensed User
Longtime User
Hi.

Yes you could download the web page using HttpUtils.
How would you parse the HTML to find the links?

The XmlSax library sounds obvious but what if the web page is not 100% valid HTML?
I think XmlSax would fail to parse a badly formed and invalid HTML document - and there's more than a few web pages out there that are invalid/badly formed.

Can you test your web page at The W3C Markup Validation Service to see if it passes the tests?

With the WebView method i posted, the WebView will parse the HTML and handle all but the most serious errors (if any) in the web page.
You could add the WebView to your activity programmatically, make it invisible and then once you have your link data remove the WebView from your activity - no need for the user to see it at all.

Martin.
 
Upvote 0

NeoTechni

Well-Known Member
Licensed User
Longtime User
Ive made code to parse html from httputil. If you want it, pm me so ill remember to post it when I get home friday morning
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
That javascript will create an array of objects.
Each object has two properties: 'text' and 'href'.
The object will be JSON encoded to a String and sent to a B4A Sub named 'ProcessLinks'.
Martin.

Hi, Martin.
I cannot decode it well :-(

B4X:
Sub ProcessLinks(JSONLinks As String)
    Log(JSONLinks)
    ' parse the JSON encoded String back to text and href values
   Dim JSON As JSONParser, Lst1 As List
   JSON.Initialize(JSONLinks)
   Lst1 = JSON.NextArray
   Dim a As Map
   
   For i = 0 To Lst1.Size - 1
      JSON.Initialize(Lst1.Get(i))
      a = JSON.NextObject               ' Here i have error: org.json.JSONException: Unterminated object at character 11 of {href=http://www.7ya.ru/click/?rid=109554&bn=21191&u
      For j=0 To a.Size - 1
         'Log(a.GetValueAt(i) & "------" & a.GetKeyAt(i))
      Next
   Next

End Sub

First list is like:
{href=http://solnet.ee/contests/quiz.php, text=&gt;&gt;&gt;}
{href=http://www.solnet.ee/ban_counter/down.php?f=1680, text=dettvorchestvo.ru}
{href=http://solnet.ee/gallery/kormushka.html, text=<img src="gallery/pic/kormushka/kormushka.gif" width="65" height="65" border="0" alt="Фотоконкурс" align="left"> <font color="#3366FF">Экологический фотоконкурс<br><b>"Покормите птиц зимой!"</b></font>}
......

Regular href-s i get via "getHitTestResult" - so, the aim is to get JavaScript's links of the images. Later i will parse "<img src=".

Seems, the format of these JSON objects is not correct, no names of tags, no QUOTEs...
Seems, more easy to parse them splitting.
 
Last edited:
Upvote 0

warwound

Expert
Licensed User
Longtime User
Try the attached code:

B4X:
Sub ProcessLinks(Json As String)
   '   Json represents an array of javascript objects
   '   Log(Json)
   Dim JavascriptArray As List
   Dim JavascriptObject As Map
   Dim i As Int
   Dim Parser As JSONParser
   Parser.Initialize(Json)
   JavascriptArray=Parser.NextArray
   For i=0 To JavascriptArray.Size-1
      JavascriptObject=JavascriptArray.Get(i)
      Log(JavascriptObject.Get("text")&", "&JavascriptObject.Get("href"))
   Next
End Sub

Sub WebView1_PageFinished (Url As String)
   Dim Javascript As String
   Javascript="var anchors=document.getElementsByTagName('a'), string=[];for(var i=0, j=anchors.length; i<j; i++){string.push({text:anchors[i].innerHTML, href:anchors[i].href});}B4A.CallSub('ProcessLinks', true, JSON.stringify(string));"
   WebViewExtras1.executeJavascript(WebView1, Javascript)
End Sub

That works for me - i get each anchor's href and text properties.

Regular href-s i get via "getHitTestResult" - so, the aim is to get JavaScript's links of the images. Later i will parse "<img src=".

Do you mean you just want to get anchor elements that have an image instead of text to click on?

Martin.
 

Attachments

  • WebViewGetElements.zip
    5.9 KB · Views: 387
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
Do you mean you just want to get anchor elements that have an image instead of text to click on?

Martin.

Thanks, Martin.
Yes, the final aim is the LongClick "Open in new window" function for WebView - making a webrowser with next any customisation.

It's OK, but HitTestResult has other variants excepting anchor.

B4X:
Sub OnLongClick(viewtag As Object) As Boolean
   DoEvents
   Dim a As Reflector, b As String, d As Object, e As Int
   d = Obj1.RunMethod("getHitTestResult") 
   a.Target = d
   e = a.RunMethod("getType")
   b = a.RunMethod("getExtra")
   If e = 1 OR e = 7 Then   'link
   Else   'others
      b = JavaScriptLink(b)
   End If
   If b.Contains("://") Then  .....
'making the popup menu...

But JavaScript makes getting links very difficult :-( for me.

<img src="http://img.files.7ja.ru/img4.0/1x1.gif" border="0" alt="Придумано и сделано в России!">, http://www.7ya.ru/click/?rid=109555&bn=21192&url=

Thanks, the code with map is working.
Maybe JavaScript has a function to strip the HTML ?
 
Last edited:
Upvote 0

warwound

Expert
Licensed User
Longtime User
Javascript has many powerful functions to do most tasks - i doubt you'll need to resort to manually parsing a String to find the info you require.

Take a look at this javascript:

B4X:
var anchor, img, parent, imgs=document.getElementsByTagName('img'), i=imgs.length, string=[];
while(i--){
   img=imgs[i];
   parent=img.parentNode;
   if(parent.tagName.toLowerCase()==='a'){
      //   this IMG element is the child of an A (anchor) element
      string.push({imageSource:img.src, href:parent.href});
   }
}
B4A.CallSub('ProcessLinks', true, JSON.stringify(string));

And some example B4A code:

B4X:
'Activity module
Sub Process_Globals
   'These global variables will be declared once when the application starts.
   'These variables can be accessed from all modules.

End Sub

Sub Globals
   'These global variables will be redeclared each time the activity is created.
   'These variables can only be accessed from this module.
   Dim WebView1 As WebView
   Dim WebViewExtras1 As WebViewExtras
End Sub

Sub Activity_Create(FirstTime As Boolean)
   WebView1.Initialize("WebView1")
   WebViewExtras1.addJavascriptInterface(WebView1, "B4A")
   WebViewExtras1.addWebChromeClient(WebView1)
   
   Activity.AddView(WebView1, 0, 0, 100%x, 100%y)
   
   WebView1.LoadUrl("http://code.martinpearman.co.uk/deleteme/image_anchors.htm")
End Sub

Sub Activity_Resume

End Sub

Sub Activity_Pause (UserClosed As Boolean)

End Sub

Sub ProcessLinks(Json As String)
   '   Json represents an array of javascript objects
   Log(Json)
   Dim JavascriptArray As List
   Dim JavascriptObject As Map
   Dim i As Int
   Dim Parser As JSONParser
   Parser.Initialize(Json)
   JavascriptArray=Parser.NextArray
   For i=0 To JavascriptArray.Size-1
      JavascriptObject=JavascriptArray.Get(i)
      Log(JavascriptObject.Get("imageSource")&" | "&JavascriptObject.Get("href"))
   Next
End Sub

Sub WebView1_PageFinished (Url As String)
   Dim Javascript As String
   Javascript="var anchor,img,parent,imgs=document.getElementsByTagName('img'),i=imgs.length,string=[];while(i--){img=imgs[i];parent=img.parentNode;if(parent.tagName.toLowerCase()==='a'){string.push({imageSource:img.src,href:parent.href})}}B4A.CallSub('ProcessLinks',true,JSON.stringify(string));"
   WebViewExtras1.executeJavascript(WebView1, Javascript)
End Sub

The webpage is one i created just to test the script:

B4X:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Images as anchors</title>
</head>
<body>
<a href="http://www.b4x.com/forum/index.php">
   <img src="http://upload.wikimedia.org/wikipedia/commons/d/d8/Basic4android-Logo.jpg" width="336" height="270" alt="" />
</a>
<br />
<a href="http://google.co.uk">
   <img src="http://blogs.creativepool.co.uk/files/2011/01/google_logo1.jpg" width="256" height="256" alt="" />
</a>
</body>
</html>

If an image has been made into a clickable link by setting it's javascript onclick to a function then things get tricky - maybe impossible.
But if an image element is a child of an anchor element you can get the info you require.

Martin.
 

Attachments

  • WebViewGetElements.zip
    6 KB · Views: 335
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
If an image has been made into a clickable link by setting it's javascript onclick to a function then things get tricky - maybe impossible.
But if an image element is a child of an anchor element you can get the info you require.

Martin.

Yes, exactly, i mean that it's needed for any browser to get the link (under the cursor at long-tap, or right-mouse-button) that is clickable on any web-page.
SRC_IMAGE_ANCHOR_TYPE can be got and it's URL.

But there are any other variants i your first sample getting TEXT and HREFs - URLs of images and all other variants.

If it's universally impossible - seems, no fast and easy possibility to make full-functiona web-browser on b4a and WebKit, not from the beginning.

BTW - your latest code for images always gets null for "imageSource:img.src"

 
Last edited:
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
Sorry, Martin.
Seems, i it's wrong way to play with JavaScrip in my case.
It needs to play with

public void requestFocusNodeHref (Message hrefMsg)

Since: API Level 1
Request the anchor or image element URL at the last tapped point. If hrefMsg is null, this method returns immediately and does not dispatch hrefMsg to its target. If the tapped point hits an image, an anchor, or an image in an anchor, the message associates strings in named keys in its data. The value paired with the key may be an empty string.

if getHitTestResult returns UNKNOWN result

public WebView.HitTestResult getHitTestResult ()

Since: API Level 1
Return a HitTestResult based on the current cursor node. If a HTML::a tag is found and the anchor has a non-JavaScript url, the HitTestResult type is set to SRC_ANCHOR_TYPE and the url is set in the "extra" field. If the anchor does not have a url or if it is a JavaScript url, the type will be UNKNOWN_TYPE and the url has to be retrieved through requestFocusNodeHref(Message) asynchronously. If a HTML::img tag is found, the HitTestResult type is set to IMAGE_TYPE and the url is set in the "extra" field. A type of SRC_IMAGE_ANCHOR_TYPE indicates an anchor with a url that has an image as a child node. If a phone number is found, the HitTestResult type is set to PHONE_TYPE and the phone number is set in the "extra" field of HitTestResult. If a map address is found, the HitTestResult type is set to GEO_TYPE and the address is set in the "extra" field of HitTestResult. If an email address is found, the HitTestResult type is set to EMAIL_TYPE and the email is set in the "extra" field of HitTestResult. Otherwise, HitTestResult type is set to UNKNOWN_TYPE.
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
Javascript has many powerful functions to do most tasks - i doubt you'll need to resort to manually parsing a String to find the info you require.

Martin.

Hi, Martin,

Getting all HREFs from HTML is OK, but.....no help in my case, seems, i'm again wrong :)
It seems, i just need to get ONLY ONE URL (href) under latest cursor position (at LongClick).

Can JavaScript get a HREF under latest cursor position universally, i mean HREF of any tag type ?

Latest links i checked was (from http://solnet.ee):
B4X:
<a href="parents/index.html" onmouseover="iName='image2'; 
Ichange('changed2')" onmouseout="Ichange('default2')">
<img src="http://solnet.ee/pic/o22.gif" width="85" height="50" name="image2" border="0" alt="Родителям"></a>

Such link is detected by getHitTestResult as SRC_IMAGE_ANCHOR_TYPE, and lets get http://solnet.ee/pic/o22.gif URL. But i need href="parents/index.html".

Is there possibility to get HREF from such tag or any other type universally for sure (at latest cursor position), by an JavaScript function, started as you showed above ?
 
Last edited:
Upvote 0

warwound

Expert
Licensed User
Longtime User
Hi.

I'm looking at the documentation for WebView here: WebView | Android Developers

Return a HitTestResult based on the current cursor node. If a HTML::a tag is found and the anchor has a non-JavaScript url, the HitTestResult type is set to SRC_ANCHOR_TYPE and the url is set in the "extra" field. If the anchor does not have a url or if it is a JavaScript url, the type will be UNKNOWN_TYPE and the url has to be retrieved through requestFocusNodeHref(Message) asynchronously. If a HTML::img tag is found, the HitTestResult type is set to IMAGE_TYPE and the url is set in the "extra" field. A type of SRC_IMAGE_ANCHOR_TYPE indicates an anchor with a url that has an image as a child node. If a phone number is found, the HitTestResult type is set to PHONE_TYPE and the phone number is set in the "extra" field of HitTestResult. If a map address is found, the HitTestResult type is set to GEO_TYPE and the address is set in the "extra" field of HitTestResult. If an email address is found, the HitTestResult type is set to EMAIL_TYPE and the email is set in the "extra" field of HitTestResult. Otherwise, HitTestResult type is set to UNKNOWN_TYPE.

I didn't originally realise that you were creating a browser app.
I think you'll find that javascript is of little use now - you need code that can handle all possibilities.
Javascript cannot get what the documenation refers to as JavaScript url - that is a function has been assigned to an element to perform navigation.

For that example HTML you posted, this part of the documentation seems to apply:

A type of SRC_IMAGE_ANCHOR_TYPE indicates an anchor with a url that has an image as a child node.

It makes no mention of where you can find the anchor URL, but the preceding sentence says:

If a HTML::img tag is found, the HitTestResult type is set to IMAGE_TYPE and the url is set in the "extra" field.

Does that not mean that if HitTestResult is either IMAGE_TYPE or SRC_IMAGE_ANCHOR_TYPE then the URL can be found in the 'extra' field?

Is the extra field actually the source of the image and not the URL being navigated to?

Martin.
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
Yes ! I've practically tested before posting it.

Those tag with IMG and Javascript is with getType = 8 (SRC_IMAGE_ANCHOR_TYPE), and getExtra gives IMG's URL, not HREF's :-((((

But simple tap over this tag is OK for WebView, navigation is going to correct HREF's URL.
It's in 2.2 emulator and 2.3 real device the same.

So, where is the universal way to get HREF's URL for longtap operations as in a nrmal browser ?
 
Upvote 0

warwound

Expert
Licensed User
Longtime User
Have you made any progress with this then?

I've done a bit of research but not found the info you are after.

Martin.
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
No :-(
I cannot understand how browsers recognize hrefs of such tags with complex parts at simple tap.
Anyway it needs to get href under cursor, maybe manually parsing the tag. But needs to get it first.
 
Upvote 0

andrewj

Active Member
Licensed User
Longtime User
Hi @warwound,
Thanks very much for the code in #7. It works beautifully if you match the URL you have with the links on the page using a .EndsWith() check.
Andrew
 
Upvote 0
Top