Android Question Regex.Matcher() ...

StephenRM

Member
The following is the text, I am looking to get the Hebrew/ Greek text and the English text ....

stext = '<heb onclick="w(1,2120)" onmouseover="iw(1,2120)">זֶ֣ה</heb><gloss>this</gloss> <heb onclick="w(1,2121)" onmouseover="iw(1,2121)">סֵ֔פֶר</heb><gloss>letter</gloss> <heb onclick="w(1,2122)" onmouseover="iw(1,2122)">תֹּולְדֹ֖ת</heb><gloss>generations</gloss> <heb onclick="w(1,2123)" onmouseover="iw(1,2123)">אָדָ֑ם</heb><gloss>Adam</gloss> <heb onclick="w(1,2124)" onmouseover="iw(1,2124)">בְּ</heb><gloss>in</gloss> <heb onclick="w(1,2125)" onmouseover="iw(1,2125)">יֹ֗ום</heb><gloss>day</gloss> <heb onclick="w(1,2126)" onmouseover="iw(1,2126)">בְּרֹ֤א</heb><gloss>create</gloss> <heb onclick="w(1,2127)" onmouseover="iw(1,2127)">אֱלֹהִים֙</heb><gloss>god [pl.]</gloss> <heb onclick="w(1,2128)" onmouseover="iw(1,2128)">אָדָ֔ם</heb><gloss>human, mankind</gloss> <heb onclick="w(1,2129)" onmouseover="iw(1,2129)">בִּ</heb><gloss>in</gloss> <heb onclick="w(1,2130)" onmouseover="iw(1,2130)">דְמ֥וּת</heb><gloss>likeness</gloss> <heb onclick="w(1,2131)" onmouseover="iw(1,2131)">אֱלֹהִ֖ים</heb><gloss>god [pl.]</gloss> <heb onclick="w(1,2132)" onmouseover="iw(1,2132)">עָשָׂ֥ה</heb><gloss>[he]+ make</gloss> <heb onclick="w(1,2133)" onmouseover="iw(1,2133)">אֹתֹֽו</heb><gloss>[object marker] +[him]</gloss>'
Dim mat1, mat2 As Matcher
Regex:
   Try
        mat1 = Regex.Matcher(">(\W+)\S<",stext)                       'Hebrew / Greek Text
        mat2 = Regex.Matcher("<gloss>(\D+)</gloss>",stext)    'English Text
        Do While (mat1.Find)
            If mat1.GroupCount = mat2.GroupCount Then
                For g = 1 To mat1.GroupCount
                    retText = retText & mat1.Group(g) & " <sub> " & mat2.Group(g) & " </sub> "
                Next
            End If
        Loop
    Catch
        Log(LastException)
    End Try

But (mat1.Find), returns FALSE.
 

sirjo66

Well-Known Member
Licensed User
Longtime User
try with
B4X:
mat1 = Regex.Matcher("<heb.*?>(.*?)<\/heb>", stext)
or
B4X:
mat1 = Regex.Matcher("<heb[^>]*>([^<]*)<\/heb>", stext)

for english text the correct pattern is
B4X:
mat2 = Regex.Matcher("<gloss>(.*?)<\/gloss>",stext)
 
Last edited:
Upvote 0

JohnC

Expert
Licensed User
Longtime User
I did not test the previous post above, but this is what ChatGPT says:

The issue lies with the regular expression you're using to match the Hebrew/Greek text. Specifically, the regex ">(\W+)\S<" is not correctly matching the structure of your HTML-like tags. The problem is likely due to the way the > and < characters are treated in the text. Let's modify the regular expression to more accurately capture the Hebrew/Greek content between the <heb> tags.

Here's a revised version of your code with corrected regular expressions:

B4X:
Dim mat1, mat2 As Matcher
Try
    ' Hebrew/Greek text inside <heb>...</heb>
    mat1 = Regex.Matcher("<heb[^>]*>([^<]+)</heb>", stext)
    ' English gloss text inside <gloss>...</gloss>
    mat2 = Regex.Matcher("<gloss>([^<]+)</gloss>", stext)

    Do While (mat1.Find And mat2.Find)
        ' Concatenate Hebrew/Greek text with corresponding English gloss
        retText = retText & mat1.Group(1) & " <sub> " & mat2.Group(1) & " </sub> "
    Loop

Catch
    Log(LastException)
End Try

Explanation of changes:

  1. For Hebrew/Greek text: The regular expression "<heb[^>]*>([^<]+)</heb>" captures the text inside the <heb> tags. The [^>]* part ensures that any attributes (like onclick, onmouseover) are ignored, and ([^<]+) captures the text between the tags.
  2. For English gloss text: The regular expression "<gloss>([^<]+)</gloss>" captures the text inside the <gloss> tags, ensuring only the English text is matched.

Important notes:

  • Ensure that both matchers (mat1 and mat2) are synchronized by running mat1.Find and mat2.Find together in the same loop.
  • The mat1.Group(1) and mat2.Group(1) refer to the captured text within the first parentheses of each regex.
This should fix the issue and allow you to extract both the Hebrew/Greek text and the English gloss text.
 
Upvote 0

StephenRM

Member
try with
B4X:
mat1 = Regex.Matcher("<heb.*?>(.*?)<\/heb>", stext)
or
B4X:
mat1 = Regex.Matcher("<heb[^>]*>([^<]*)<\/heb>", stext)

for english text the correct pattern is
B4X:
mat2 = Regex.Matcher("<gloss>(.*?)<\/gloss>",stext)
Thank you so much for the reply, it worked...:D
 
Upvote 0
Top