Scraping a web-page

Caravelle · Apr 2, 2013

I am trying my first attempt at "scraping" and using NJDude's helpful basic example here to guide me.

What puzzles me is that the SearchString.IndexOf(whatever) construction will find some phrases on the page but not others. I know my phrases are correctly typed because I cut and paste them from the "View Source" version of the site which Google Chrome kindly gives me. And I have been making sure that all double quotes are converted to single quotes as in the example, using

B4X:

SearchString = SearchString.Replace(QUOTE, "'")

Surely my actual B4A code must be right or it wouldn't find any phrases at all ?

Is there some reason why this doesn't work with some characters, or are there "invisible" characters that Chrome's "view source" isn't showing me ?

An example of the page I'm trying to read is: Site Search 'G-EZWA' - Planespotters.net Just Aviation and I'm trying to get to the items in the table below the word "Registrations".
The result of

B4X:

SearchString.IndexOf("<td class='nowrap dt-asc'>")

is -1 but as far as I can see it's there, albeit with double quotes in the original. I just can't find it with SearchString.IndexOf().

Can anyone help please ? zip file should be attached.

Caravelle

NJDude · Apr 2, 2013

Scraping is a very unreliable process sometimes, in this case, you could do this:

B4X:

StrStart = SearchString.IndexOf("G-EZWA")

In this case, the query you entered is UPPER CASED so you could use that as a reference to scrape what you're looking for.

Erel · Apr 2, 2013

Another option is to "tidy" the html page and convert it to XML: JTidy library - Convert HTML pages to XML

Caravelle · Apr 2, 2013

Thank you both, especially to NJDude for the very helpful example code in the first place.

So apparently some web pages are not quite what they seem. I know they are OK on my own site, I coded them by hand in a plain text editor.

I'm not quite sure what NJDude is saying about capitalization. What difference does that make ? The text in the "cell" under the word "Reg" is always going to be in upper case, and match the original search term forming the end of the URL, but, so long as this registration exists in the database, the text in that position should be the third instance of it on the page (though I haven't experimented enough yet to see if this is guaranteed). I had thought it better to search for something unique that only appeared once. Moreover, it's possible that the page will return a table with more than one line - aircraft registrations can be used on many aircraft, though not at the same time - only one will be marked as "Active". Maybe I should search for the word "Active" and work back from there, if it works it should find the right line first time and kill several birds with one stone.

I will also investigate "Tidy". I think one of the main problems I have getting to grips with B4A is that do do anything practical, it seems you have to learn six other things first, but you don't discover that until you are half-way through something. Finding the time and the spare brain cells is hard for me.

Caravelle

Caravelle · May 20, 2013

I'm back. I now have a working application, sort of, using JTidy as advised, and am retrieving the specific details I want from the website.

I have a problem, however, with some entries which I have tracked down, I think, to the operation of JTidy. If you look at the attached .txt file, you will see that there are some very long lines. These are being truncated and a linefeed inserted in inappropriate (for me) places to create a new line. For example, look at line 361 in the text file, it has become lines 173 and 174 in the xml version. I can't feed the airline name
"British
Airways"
into my database without stripping the linefeed.

I thought I could easily remove the line feeds from the retrieved strings with this code:

B4X:

GoodString = BadString.Replace(CHR(10), " ")

but it doesn't make any difference, much to my surprise.

So, can someone please advise if this is the expected behaviour of JTidy, or a feature that can be modified, perhaps to allow for longer lines; and if I have to live with it, how do I remove the unnecessary linefeeds ?

Thanks for any help.

Caravelle

Caravelle · May 20, 2013

Apologies,

B4X:

GoodString = BadString.Replace(CHR(10), " ")

does work, I was looking in the wrong place for the GoodString that results and still seeing the BadString.

But I would still prefer that the LF character is not inserted in the first place. I now have to apply the code above to everything I scrape that is likely to have a space in it. Maybe I will also get LFs following a dash ("-") if JTidy deems that a line should end at that point: there are plenty of those in the data too.

Caravelle

Scraping a web-page

Caravelle

Active Member

Attachments

NJDude

Expert

Erel

B4X founder

Caravelle

Active Member

Caravelle

Active Member

Attachments

Caravelle

Active Member

Similar Threads