B4J Question Replacing foreign accents and signs... why is it so difficult?

wimpie3 · Tuesday at 8:15 PM

I've got an UTF-8 string I want to convert to ASCII characters (< 127). An example: the French élève has to become eleve (which is readable in ascii).

A quick search brought me to the existing java function: java.text.Normalizer.

but... this function doesn't replace all of them! Some of the characters that don't get replaced: ı ə ß Þ etc...

Is there really no function that can convert ALL of these foreign signs to the ASCII format?

Andrew (Digitwell) · Tuesday at 8:53 PM

how about StringUtils.StripAccents?

Is there a way to get rid of accents and convert a whole string to regular letters?

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one? Example: Input: orčpžsíáýd Output:

stackoverflow.com

wimpie3 · Tuesday at 9:32 PM

Andrew (Digitwell) said:
how about StringUtils.StripAccents?

Is there a way to get rid of accents and convert a whole string to regular letters?

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one? Example: Input: orčpžsíáýd Output:

stackoverflow.com

Seems like a better solution, but how can I implement this in B4J?

Daestrum · Tuesday at 10:05 PM

Uses JavaObject library

B4X:

Sub AppStart (Args() As String)
    Log(Me.as(JavaObject).RunMethod("defunkyfy",Array("Café au lait")))
End Sub

#if java
import java.text.*;
import java.util.regex.Pattern;

public static String defunkyfy(String unicodeString) {
    String normalizedString = Normalizer.normalize(unicodeString, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    String asciiString = pattern.matcher(normalizedString).replaceAll("");
    StringBuilder result = new StringBuilder();
    for (char c : asciiString.toCharArray()) {
        if (c <= 127) {
            result.append(c);
        }
    }
    return result.toString();
}
#End If

wimpie3 · Wednesday at 5:54 AM

Daestrum said:

Uses JavaObject library

B4X:

Sub AppStart (Args() As String)
    Log(Me.as(JavaObject).RunMethod("defunkyfy",Array("Café au lait")))
End Sub

#if java
import java.text.*;
import java.util.regex.Pattern;

public static String defunkyfy(String unicodeString) {
    String normalizedString = Normalizer.normalize(unicodeString, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    String asciiString = pattern.matcher(normalizedString).replaceAll("");
    StringBuilder result = new StringBuilder();
    for (char c : asciiString.toCharArray()) {
        if (c <= 127) {
            result.append(c);
        }
    }
    return result.toString();
}
#End If

Sorry but this uses the normalizer function which does not work all the time.

Erel · Wednesday at 6:08 AM

A solution based on Normalizer is also available here: https://www.b4x.com/android/forum/threads/remove-accents-from-string.40429/#content

The apache StringUtils solution, which I don't think that it meets your requirements:

B4X:

Private Sub StripAccents(s As String) As String
    Dim jo As JavaObject
    Return jo.InitializeStatic("org.apache.commons.lang3.StringUtils").RunMethod("stripAccents", Array(s))
End Sub

B4X:

#AdditionalJar: commons-lang3-3.9.jar

wimpie3 · Wednesday at 11:53 AM

Erel said:
A solution based on Normalizer is also available here: https://www.b4x.com/android/forum/threads/remove-accents-from-string.40429/#content

The apache StringUtils solution, which I don't think that it meets your requirements:

B4X:

Private Sub StripAccents(s As String) As String Dim jo As JavaObject Return jo.InitializeStatic("org.apache.commons.lang3.StringUtils").RunMethod("stripAccents", Array(s)) End Sub

B4X:

#AdditionalJar: commons-lang3-3.9.jar

Thank you Erel for helping me out. Unfortunately the results are the same as with Normalizer: ı ə ß Þ still appear in the converted string.

I think it's a matter of creating your own conversion table... which will be a lot of work...

Daestrum · Wednesday at 11:59 AM

I just tried that code I posted with "Café au lait ı ə ß Þ"

The result that was returned was "Cafe au lait"

wimpie3 · Wednesday at 3:25 PM

Daestrum said:
I just tried that code I posted with "Café au lait ı ə ß Þ"

The result that was returned was "Cafe au lait"

Yes, because the routine filters out all characters that are > 127. That's not what I want. I want the ı to be replaced by an i, etc...

tchart · Wednesday at 7:39 PM

wimpie3 said:
Yes, because the routine filters out all characters that are > 127. That's not what I want. I want the ı to be replaced by an i, etc...

I was looking into this a few months ago. I use a variation of the Java code above (ie similar to defunkyfy).

The problem is some accents (Greek/Cyrillic) are two bytes where most normalizers are only processing single byte chars.

I didnt find a solution to the problem yet.

tchart · Wednesday at 7:49 PM

I think the solution is something like this library

GitHub - jirutka/unidecode: Transliteration from Unicode to US-ASCII and ISO 8859-2.

Transliteration from Unicode to US-ASCII and ISO 8859-2. - jirutka/unidecode

github.com

wimpie3 · Wednesday at 8:01 PM

tchart said:
I think the solution is something like this library

GitHub - jirutka/unidecode: Transliteration from Unicode to US-ASCII and ISO 8859-2.

Transliteration from Unicode to US-ASCII and ISO 8859-2. - jirutka/unidecode

github.com

Yes that library uses three simple lookup tables (and I've already found the symbols giving me problems in one of the tables, so I'm pretty sure this is what I'm looking for, great find!). Perhaps it's even not THAT hard to convert the code into a pure B4J lib...

B4J Question Replacing foreign accents and signs... why is it so difficult?

wimpie3

Well-Known Member

Andrew (Digitwell)

Well-Known Member

Is there a way to get rid of accents and convert a whole string to regular letters?

wimpie3

Well-Known Member

Is there a way to get rid of accents and convert a whole string to regular letters?

Daestrum

Expert

wimpie3

Well-Known Member

Erel

B4X founder

wimpie3

Well-Known Member

Daestrum

Expert

wimpie3

Well-Known Member

tchart

Well-Known Member

tchart

Well-Known Member

GitHub - jirutka/unidecode: Transliteration from Unicode to US-ASCII and ISO 8859-2.

wimpie3

Well-Known Member

GitHub - jirutka/unidecode: Transliteration from Unicode to US-ASCII and ISO 8859-2.

Similar Threads