B4J Question Replacing foreign accents and signs... why is it so difficult?

wimpie3

Well-Known Member
Licensed User
Longtime User
I've got an UTF-8 string I want to convert to ASCII characters (< 127). An example: the French élève has to become eleve (which is readable in ascii).

A quick search brought me to the existing java function: java.text.Normalizer.

but... this function doesn't replace all of them! Some of the characters that don't get replaced: ı ə ß Þ etc...

Is there really no function that can convert ALL of these foreign signs to the ASCII format?
 

wimpie3

Well-Known Member
Licensed User
Longtime User
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
Uses JavaObject library
B4X:
Sub AppStart (Args() As String)
    Log(Me.as(JavaObject).RunMethod("defunkyfy",Array("Café au lait")))
End Sub

#if java
import java.text.*;
import java.util.regex.Pattern;

public static String defunkyfy(String unicodeString) {
    String normalizedString = Normalizer.normalize(unicodeString, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    String asciiString = pattern.matcher(normalizedString).replaceAll("");
    StringBuilder result = new StringBuilder();
    for (char c : asciiString.toCharArray()) {
        if (c <= 127) {
            result.append(c);
        }
    }
    return result.toString();
}
#End If
 
Upvote 0

wimpie3

Well-Known Member
Licensed User
Longtime User
Uses JavaObject library
B4X:
Sub AppStart (Args() As String)
    Log(Me.as(JavaObject).RunMethod("defunkyfy",Array("Café au lait")))
End Sub

#if java
import java.text.*;
import java.util.regex.Pattern;

public static String defunkyfy(String unicodeString) {
    String normalizedString = Normalizer.normalize(unicodeString, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    String asciiString = pattern.matcher(normalizedString).replaceAll("");
    StringBuilder result = new StringBuilder();
    for (char c : asciiString.toCharArray()) {
        if (c <= 127) {
            result.append(c);
        }
    }
    return result.toString();
}
#End If
Sorry but this uses the normalizer function which does not work all the time.
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
A solution based on Normalizer is also available here: https://www.b4x.com/android/forum/threads/remove-accents-from-string.40429/#content

The apache StringUtils solution, which I don't think that it meets your requirements:
B4X:
Private Sub StripAccents(s As String) As String
    Dim jo As JavaObject
    Return jo.InitializeStatic("org.apache.commons.lang3.StringUtils").RunMethod("stripAccents", Array(s))
End Sub
B4X:
#AdditionalJar: commons-lang3-3.9.jar
 
Upvote 0

wimpie3

Well-Known Member
Licensed User
Longtime User
A solution based on Normalizer is also available here: https://www.b4x.com/android/forum/threads/remove-accents-from-string.40429/#content

The apache StringUtils solution, which I don't think that it meets your requirements:
B4X:
Private Sub StripAccents(s As String) As String
    Dim jo As JavaObject
    Return jo.InitializeStatic("org.apache.commons.lang3.StringUtils").RunMethod("stripAccents", Array(s))
End Sub
B4X:
#AdditionalJar: commons-lang3-3.9.jar
Thank you Erel for helping me out. Unfortunately the results are the same as with Normalizer: ı ə ß Þ still appear in the converted string.

I think it's a matter of creating your own conversion table... which will be a lot of work...
 
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
I just tried that code I posted with "Café au lait ı ə ß Þ"

The result that was returned was "Cafe au lait"
 
Upvote 0

tchart

Well-Known Member
Licensed User
Longtime User
Yes, because the routine filters out all characters that are > 127. That's not what I want. I want the ı to be replaced by an i, etc...
I was looking into this a few months ago. I use a variation of the Java code above (ie similar to defunkyfy).

The problem is some accents (Greek/Cyrillic) are two bytes where most normalizers are only processing single byte chars.

I didnt find a solution to the problem yet.
 
Upvote 0

wimpie3

Well-Known Member
Licensed User
Longtime User
I think the solution is something like this library

Yes that library uses three simple lookup tables (and I've already found the symbols giving me problems in one of the tables, so I'm pretty sure this is what I'm looking for, great find!). Perhaps it's even not THAT hard to convert the code into a pure B4J lib...
 
Upvote 0
Top