Android Question remove all htlm tags from a string

luke2012 · Oct 22, 2021

Hi all,
is there a quick solution to "clean" a string removing all html tags to got a plain text without any tag (see example) ?

B4X:

<p class="text-align-justify">Multibrand giovane con i migliori marchi come....</p>

For this specific string the fastest solution is...

B4X:

'str is the above string
str.Replace($"<p class="text-align-justify">"$, "").Replace($"</p>"$, "")

But this is good only for this specific string, if the tags change it is no longer guaranteed that the string is text without HTLM.

So wich is the best solution in this case ?

1) regex (is there any pattern to remove all HTML tags) ?
2) HTML parser (parsing the tags to extract the text) ?

DonManfred · Oct 22, 2021

Maybe this Lib can help?
At least i found a reference to jsoup using google. And this is a lib about jsoup.

Use a HTML parser. Here's a Jsoup example.
String input = "some text\nanother text";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);

Result:
some text another text

jSoup HTML Parser

This is my first attempt at a wrapper so it's a work in progress. Consider it a Beta although it isn't feature complete, I'm adding features as I require them :) Not all functions and documentation implemented or tested fully yet. Library is compatible with B4A and B4J. 1. Download jsoup...

www.b4x.com

Don´t know if this lib has support for parse(input).text() though.

tchart · Oct 23, 2021

I use an HTML sanitizer library. Let me see if I've posted it here.

tchart · Oct 23, 2021

Here you go, not sure if it works on B4A as I only use it on B4J

OWASP Java HTML Sanitizer

This is a wrapper for the OWASP Java HTML Sanitizer library. "A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS." I needed a way to sanitize request inputs to a web app after...

www.b4x.com

Star-Dust · Oct 23, 2021

B4X:

Dim Html As String = $"<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>"$
 
Log(Regex.Replace("<[^>]*>",Html,""))

My First Heading
My first paragraph.

luke2012 · Oct 23, 2021

Hi @DonManfred, @tchart and @Star-Dust and thanks for your suggestions

I have to choose and try

tigrot · Oct 23, 2021

Star-Dust said:
Log(Regex.Replace("<[^>]*>",Html,""))

Must start using regex

Android Question remove all htlm tags from a string

luke2012

Well-Known Member

DonManfred

Expert

jSoup HTML Parser

tchart

Well-Known Member

tchart

Well-Known Member

OWASP Java HTML Sanitizer

Star-Dust

Expert

luke2012

Well-Known Member

tigrot

Well-Known Member

Similar Threads