Android Question Regexreplace problem

rosippc64a · May 24, 2018

Hi All!
I use the regexreplace function that I found here.
I have an expression to remove unwanted tags from a html webpage:

B4X:

                szov = RegexReplace("<(img|head|nav|form|footer|style|script|noscript|aside|button|label|i\040|input)[^<>]*>((.|\n|\r\n)*?)</\1>",szov," ")

and I tried with this url: view-source:http://www.erdekesvilag.hu/a-patkanyok-temploma-indiaban/
After the replace, in the source of webpage there is a lot of <source... and <style i.e.
If I use a second regexreplace, then the mentioned tags is replaced well (I checked).

B4X:

                szov = RegexReplace("<(img|head|nav|form|footer|style|script|noscript|aside|button|label|i\040|input)[^<>]*>((.|\n|\r\n)*?)</\1>",szov," ")
                If szov.Contains("<script") Then
                    'view-source:http://www.erdekesvilag.hu/a-patkanyok-temploma-indiaban/
                    szov = RegexReplace("<script[^<>]*>((.|\n|\r\n)*?)</script>",szov," ")
                End If
                If szov.Contains("<style") Then
                    szov = RegexReplace("<style[^<>]*>((.|\n|\r\n)*?)</style>",szov," ")
                End If

Do I made any mistake in the first expression why they aren't replaced?
thanks in advance
Steven

rosippc64a · May 24, 2018

I tried with a big parenthesis, maybe regexreplace replace the group(0):

B4X:

szov = RegexReplace("(<(img|head|nav|form|footer|style|script|noscript|aside|button|label|i\040|input)[^<>]*>((.|\n|\r\n)*?)</\2>)",szov," ")

, but don't.

Erel · May 24, 2018

It is probably a mistake in your pattern. The best way to debug it is to extract the shortest possible text that is not parsed as you expect. We can then check it with your pattern.

Note that it will probably be simpler to use jTidy to parse the page.

rosippc64a · May 24, 2018

The shortests work well (except the empty scripts) but there are a lot of very complex too.
I solved with substrings...
thank you Erel!

Android Question Regexreplace problem

rosippc64a

Active Member

rosippc64a

Active Member

Erel

B4X founder

rosippc64a

Active Member

Similar Threads