Time for a LOA quiz

Erel

B4X founder
Staff member
Licensed User
Longtime User

Task: count the number of occurrences of each English letter in this list, with 370K words: https://github.com/dwyl/english-words/blob/master/words_alpha.txt

The output should be a list with the letters, sorted by the number of occurrences.

B4X:
Log(SortedLetters)
(ArrayList) [e, i, a, o, n, ....

I've implemented it in two ways. The first is straightforward and robust and the second is more optimized.
Testing with B4J in release mode, the first method takes ~200ms and the second ~25ms.
 

aeric

Expert
Licensed User
Longtime User
Sorry, I cheated with AI.

B4X:
Sub AppStart (Args() As String)
    ShowAnswer
    StartMessageLoop
End Sub

Sub ShowAnswer
    'Dim start As Long = DateTime.Now
    Wait For (CountAndSortLetters) Complete (SortedLetters As List)
    Log(SortedLetters)
    'Dim elapsed As Long = DateTime.Now - start
    'Log("Elapsed=" & elapsed & "ms")
End Sub

Public Sub CountAndSortLetters As ResumableSub '(FilePath As String, FileName As String) As List
    ' 1. Fast file read into a string list
    Dim Lines As List '= File.ReadList(FilePath, FileName)
    
    Dim words As String
    Dim job As HttpJob
    job.Initialize("", Me)
    job.Download("https://raw.githubusercontent.com/dwyl/english-words/refs/heads/master/words_alpha.txt")
    Wait For (job) JobDone(job As HttpJob)
    If job.Success Then
        words = job.GetString
    End If
    job.Release
    
    If words = "" Then Return Lines
    
    Lines.Initialize
    Lines = Regex.Split(CRLF, words)
'    Log("size=" & Lines.Size)
'    Log("first=" & Lines.Get(0))
'    Log("last=" & Lines.Get(Lines.Size - 1))
    
'    For i = 0 To Lines.Size - 1
'        Log(Lines.Get(i))
'    Next
    
    'Start benchmark
    Dim start As Long = DateTime.Now
    
    ' 2. Fixed array mapping to standard ASCII/English lowercase letters ('a' = 97 to 'z' = 122)
    Dim Counts(26) As Int
    
    ' 3. Optimized iteration using fast character/code point inspection
    For Each Word As String In Lines
        Dim Length As Int = Word.Length
        For i = 0 To Length - 1
            Dim CodePoint As Int = Asc(Word.CharAt(i))
            
            ' Bounds check to ensure we only count english alpha characters
            ' If you need it case-insensitive, add a lowercase conversion helper
            If CodePoint >= 97 And CodePoint <= 122 Then 
                Dim Index As Int = CodePoint - 97
                Counts(Index) = Counts(Index) + 1
            End If
        Next
    Next
    
    ' 4. Harness ListOfArrays (LOA) to cleanly package, map, and sort our structure
    Dim LetterTable As ListOfArrays = LOAUtils.CreateEmpty(Array("Letter", "Occurrences", "LetterAndOccurences"))
    
    For i = 0 To 25
        Dim LetterStr As String = Chr(i + 97)
        LetterTable.AddRow(Array(LetterStr, Counts(i), $"${LetterStr} (${Counts(i)})"$))
    Next
    
    ' Sort by "Occurrences" descending (False = Descending)
    LetterTable.Sort("Occurrences", False)
    
    ' Extract the single-dimension sorted column
    Dim SortedLetters As List = LetterTable.GetColumn("Letter")
    'Dim SortedLetters As List = LetterTable.GetColumn("LetterAndOccurences")
    
    ' End benchmark
    Dim elapsed As Long = DateTime.Now - start
    Log("Elapsed=" & elapsed & "ms")
    
    Return SortedLetters
End Sub

B4X:
Elapsed=247ms
(ArrayList) [e, i, a, o, n, s, r, t, l, c, u, p, d, m, h, g, y, b, f, v, k, w, z, x, q, j]

(ArrayList) [e (376455), i (313008), a (295792), o (251596), n (251435), s (250282), r (246143), t (230895), l (194915), c (152980), u (131495), p (113663), d (113192), m (105208), h (92368), g (82627), y (70581), b (63942), f (39238), v (33075), k (26814), w (22407), z (14757), x (10493), q (5883), j (5456)]
 
Last edited:

William Lancee

Well-Known Member
Licensed User
Longtime User
B4X:
Private Sub B4XPage_Created (Root1 As B4XView)
    Root = Root1
    Dim marktime As Long = DateTime.now
    Dim b() As Byte = File.ReadBytes(File.DirApp, "words_alpha.txt")
    Dim cnts(26) As Int
    For Each c As Int In b
        If c >= 97 Then c = c - 97 Else c = c - 65
        If c >= 0 And c <= 25 Then cnts(c) = cnts(c) + 1
    Next
    Dim LOA As ListOfArrays = LOAUtils.CreateEmpty(Array("Letter", "Count"))
    For i = 0 To 25
        LOA.AddRow(Array(Chr(i + 97), cnts(i)))
    Next
    LOA.Sort("Count", False)
    Log(LOA.GetColumn("Letter")) '(ArrayList) [e, i, a, o, n, s, r, t, l, c, u, p, d, m, h, g, y, b, f, v, k, w, z, x, q, j]
    Log(DateTime.Now - marktime) '23
End Sub
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
Great job. Here are my implementations:

B4X:
Sub Slower
    Dim n As Long = DateTime.Now
    Dim words As List = File.ReadList("C:\Users\H\Downloads\words_alpha.txt", "")
    Dim letters As Map = CreateMap()
    For Each word As String In words
        For i = 0 To word.Length - 1
            Dim c As Char = word.CharAt(i)
            Dim count As Int = letters.GetDefault(c, 0)
            letters.Put(c, count + 1)
        Next
    Next
    Dim loa As ListOfArrays = LOAUtils.CreateFromMap(Array("letter", "count"), letters)
    loa.Sort("count", False)
    Log(loa.ToString(0))
    Dim SortedLetters As List = loa.GetColumn("letter")
    Log(SortedLetters)
    Log(DateTime.Now - n) '~200ms
End Sub

Sub Fast
    Dim n As Long = DateTime.Now
    Dim data() As Byte = File.ReadBytes("C:\Users\H\Downloads\words_alpha.txt", "")
    Dim letters(26) As Int
    For i = 0 To data.Length - 1
        Dim c As Byte = data(i)
        If c = 10 Or c = 13 Then Continue
        letters(c - 97) = letters(c - 97) + 1
    Next
    Dim loa As ListOfArrays = LOAUtils.CreateEmpty(Array("letter", "count"))
    For i = 0 To letters.Length - 1
        loa.AddRow(Array(Chr(97 + i), letters(i)))
    Next
    Dim SortedLetters As List = loa.GetColumn("letter")
    Log(SortedLetters)
    Log(DateTime.Now - n) '~25ms
End Sub

The "slower" solution is not bad at all and will properly handle most unicode characters.
 

Chris2

Active Member
Licensed User
Longtime User
From a learning pont of view it would be interesting (to me at least, maybe others too?) if you can explain why the faster versions are faster please?

Is it mostly down to @Erel's 'Fast' version & @William Lancee starting with an array of bytes with File.ReadBytes rather than @Erel's 'Slower' verson & @aeric/AI's using a list of strings (File.ReadList or Regex.Split)?
 

William Lancee

Well-Known Member
Licensed User
Longtime User
Faster because of 3 things:
1. As you surmised, the reading of the file without parsing into words/lines
2. The fast version uses direct indexing of the counts, whereas the slow version uses a map and a look-up of each letter
3. The parsing of words into letters with CharAt(i) is not needed in the fast version.

The A.I. @aeric code reads the file twice, the second time through a slow HTTP web download.
It used direct indexing but still extracts each letter from a word.

The Linux pipe instruction offered by @Sandman will probably? be faster, although the routine requires six passes through of all of the data.
 
Last edited:

William Lancee

Well-Known Member
Licensed User
Longtime User
Important note!

The words_alpha.txt file in the exercise is a pure ASCII file.
The fast version "capitalizes" on this and assumes that each byte is a Ascii character.

Most modern "text files" use UTF-8, which is a Unicode encoding that is backwards-compatible with ASCII.
It uses 1 byte for standard ASCII characters and up to 4 bytes for other Unicode characters.

If the .txt file had multibyte unicode characters, the fast routine would still work statistically.
But only if the number of multi-byte characters was small compared to the total.

The slow version deals with multibyte characters properly.

Conclusion: : don't use the fast version - it is not worth the trouble it can cause!
The 'best general approach' is the AI version without the web download.
 

William Lancee

Well-Known Member
Licensed User
Longtime User
Here is a more general version of getting letter frequencies from any UTF-8 .txt file - even if it has capitals and special characters.
It takes only 58 msecs.

B4X:
Private Sub B4XPage_Created (Root1 As B4XView)
    Root = Root1
    Dim marktime As Long = DateTime.now
    Dim words As List = File.ReadList(File.DirApp, "words_alpha.txt")
    Dim cnts(26) As Int
    Dim c As Int
    For Each w As String In words
        For i = 0 To w.Length -1
            c = Asc(w.CharAt(i))
            If c >= 97 Then c = c - 97 Else c = c - 65
            If c >= 0 And c <= 25 Then cnts(c) = cnts(c) + 1
        Next
    Next
    Dim LOA As ListOfArrays = LOAUtils.CreateEmpty(Array("Letter", "Count"))
    For i = 0 To 25
        LOA.AddRow(Array(Chr(i + 97), cnts(i)))
    Next
    LOA.Sort("Count", False)
    Log(LOA.GetColumn("Letter")) '(ArrayList) [e, i, a, o, n, s, r, t, l, c, u, p, d, m, h, g, y, b, f, v, k, w, z, x, q, j]
    Log(DateTime.Now - marktime) '58 msecs
End Sub
 

William Lancee

Well-Known Member
Licensed User
Longtime User
You can shave 18 msecs from the above post #9 by using ByteConverter to transform the whole file to an array of Char first.
[In all the years I have used B4X this is the first time I used the Char datatype.]
[So, why don't I leave well enough alone? It is in my nature!]

Note: You can see from the progressive time marks that the file reading and Char transformation takes 22 msecs,
the counting of 4,234,910 characters takes 12 msecs, and the creation, sorting, column selection, and displaying of the LOA takes 6 msecs.

Edit: The File reading took 20 msecs and the bc.toChar() conversion took only 2 msecs - not shown below.

B4X:
Private Sub B4XPage_Created (Root1 As B4XView)
    Root = Root1
    Dim bc As ByteConverter
    Dim marktime As Long = DateTime.now
    Dim chars() As Char = bc.ToChars(File.ReadString(File.DirApp, "words_alpha.txt"))
    Log(chars.Length) '4234910
    Log("Time since start: " & (DateTime.Now - marktime))    'Time since start: 22
    Dim cnts(26) As Int
    Dim m As Int
    For Each c As Char In chars
        m = Asc(c)
        If m >= 97 Then m = m - 97 Else m = m - 65
        If m >= 0 And m <= 25 Then cnts(m) = cnts(m) + 1
    Next
    Log("Time since start: " & (DateTime.Now - marktime))    'Time since start: 34

    Dim LOA As ListOfArrays = LOAUtils.CreateEmpty(Array("Letter", "Count"))
    For i = 0 To 25
        LOA.AddRow(Array(Chr(i + 97), cnts(i)))
    Next
    LOA.Sort("Count", False)
    Log(LOA.GetColumn("Letter")) '(ArrayList) [e, i, a, o, n, s, r, t, l, c, u, p, d, m, h, g, y, b, f, v, k, w, z, x, q, j]
    Log("Time since start: " & (DateTime.Now - marktime))    'Time since start: 40
End Sub
 
Last edited:

emexes

Expert
Licensed User
Longtime User
You can shave 18 msecs from the above post #9 by using ByteConverter to transform the whole file to an array of Char first.

I found it was quicker again by dimming the count array large enough to count all byte values 0..255, so that you don't need the tests and offsets to count only the letters, and then the counts for letters are like:

B4X:
For Letter = 1 To 26
    Dim LetterCount As Int = cnts(64 + Letter) + cnts(96 + Letter)    'uppercase + lowercase
    Log("letter " & Chr(64 + Letter) & " occurs " & LetterCount & " times")
Next
 
Last edited:

William Lancee

Well-Known Member
Licensed User
Longtime User
Bravo, @emexes shaves another 7 msecs off the fastest method - from 23 msecs to 16 msecs - a new record!

B4X:
Private Sub Emexes_Version
    Dim marktime As Long = DateTime.now
    Dim b() As Byte = File.ReadBytes(File.DirApp, "words_alpha.txt")
    Log("Time since start: " & (DateTime.Now - marktime))    'Time since start: 5  [TO READ BYTES]
    Dim cnts(256) As Int
    For Each c As Int In b
        cnts(c) = cnts(c) + 1
    Next
    Log("Time since start: " & (DateTime.Now - marktime))    'Time since start: 10    [5 msecs more to COUNT]
    Dim LOA As ListOfArrays = LOAUtils.CreateEmpty(Array("Letter", "Count"))
    For letter = 1 To 26
        LOA.AddRow(Array(Chr(letter + 96), cnts(64 + letter) + cnts(96 + letter)))
    Next
    Log("Time since start: " & (DateTime.Now - marktime))    'Time since start: 12    [2 msecs more to create LOA]
    LOA.Sort("Count", False)
    Log(LOA.GetColumn("Letter")) '(ArrayList) [e, i, a, o, n, s, r, t, l, c, u, p, d, m, h, g, y, b, f, v, k, w, z, x, q, j]
    Log("Time since start: " & (DateTime.Now - marktime))    'Time since start:     16 [4 msecs more to sort and display LOA]
End Sub

Considering that the number of bytes in the file is 4234910 - B4X is a super fast in doing each step.
 

Giovanni_C

Member
Licensed User
The algorithmic part is very interesting. My compliments to everyone.
The sample used is questionable , it contains 370,105 words, which seems to be too many; it contains all the plurals and other variations... okay, more work for the programs.😁
P.S.: respect for Sandman, the wizard of Linux command line
 
Top