Android Question Number of lines in a text file

Sergey_New

Well-Known Member
Licensed User
Longtime User
I need to read a text file.
For the progressbar to work, I need to know the number of lines scount.
B4X:
    Try
        Dim rd As TextReader:
        rd.Initialize(File.OpenInput(Starter.myFolder, FileName))
        scount=Regex.Split(CRLF,rd.ReadAll).Length
        rd.Close
    Catch
        Log(LastException)
    End Try
If the number of lines is less than 90,000, then everything works, but if, for example, there are 1,500,000 lines, the application crashes on line #4.

What restrictions might there be?
 

Mahares

Expert
Licensed User
Longtime User
Did you try to use list. I don't think textreader is recommended.
B4X:
Dim l as list
l.initialize
l = file.readlist.(....
 
Upvote 0

drgottjr

Expert
Licensed User
Longtime User
one limitation, at least, should be obvious:
it's the amount of availabe memory (as determined by
the system). if your app crashes, you should see
an exception in the log.
if a file has 1,500,000 lines and each line has an average
of - what - 10 or 20 characters, you are into OOM territory.
there are ways of handling files greater than available memory,
but it's unclear how, if at all, android implements them. at least
when using java routines which may have their own limitations.

there is a system utility called wc. it does read an entire file
before processing word, lines and bytes. it may do this in
a different way than standard file reading functions work.

here is how it would work (assuming no crash):

B4X:
    Dim p As Phone
    Dim stdout,stderr As StringBuilder
    stdout.Initialize: stderr.initialize
    
    Dim args() As String = Array As String(File.Combine(File.Dirinternal, "some_big_text_file.txt"))
    p.Shell("wc", args, stdout, stderr)
    
    If stderr.Length = 0 Then
        Dim results() As String = Regex.Split(" ", stdout.ToString)
        MsgboxAsync( results(0) & " words, " & results(1) & " lines, " & results(2) & " bytes", "FYI:")
    Else
        MsgboxAsync("there was a problem: " & stderr.tostring, "FYI:")
    End If
 
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
Maybe

B4X:
    Dim tr As TextReader
    Dim linecount As Int
    tr.Initialize(File.OpenInput("","yourFile.txt"))
    Do While tr.ReadLine <> Null
        linecount = linecount + 1
    Loop
    tr.Close
    Log(linecount)
 
Last edited:
Upvote 0

RichardN

Well-Known Member
Licensed User
Longtime User
Is the objective to read 1.5m lines into memory? Because that is not only REALLY slow but also asking for a OOM exception.

I would go with @Daestrum's solution or better still put your 1.5m strings into an SQLite database and execute the following.... It will be an awful lot faster

B4X:
Dim RecordCount As Int

RecordCount = SQL.ExecQuerySingleResult("SELECT COUNT(Record) FROM MyTable")
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Maybe

B4X:
    Dim tr As TextReader
    Dim linecount As Int
    tr.Initialize(File.OpenInput("","yourFile.txt"))
    Do While tr.ReadLine <> Null
        linecount = linecount + 1
    Loop
    tr.Close
    Log(linecount)
Or maybe something like this:

B4X:
Sub Class_Globals
    Private RAF As RandomAccessFile
End Sub

Sub GetTextFileLineCount(strFolder As String, strFile As String, btEndOfLineByte As Byte) As Int

    Dim i As Int
    Dim iBytes As Int
    Dim lPosition As Long
    Dim iLines As Int
    
    RAF.Initialize(strFolder, strFile,True)
    
    iBytes = 10000 'could make smaller or larger

    Do While lPosition < RAF.Size
        
        Dim arrBytes(iBytes) As Byte
        iBytes = RAF.ReadBytes(arrBytes, 0, iBytes, lPosition)
        
        For i = 0 To iBytes - 1
            If arrBytes(i) = btEndOfLineByte Then
                iLines = iLines + 1
            End If
        Next
        
        lPosition = lPosition + iBytes
    
    Loop

    Return iLines
    
End Sub

Tested on a 4 Mb .txt file with 40000 lines and it takes some 30 milli-secs.

RBS
 
Upvote 0

Sergey_New

Well-Known Member
Licensed User
Longtime User
Did you try to use list.
Thank you!
I determined the number of lines according to your advice.
But the file is read until about 30%, and then the process stops. Still, there is a lack of RAM.

here is how it would work (assuming no crash):
Thank you!
I tried your example. I have a message for a large and a small file that there are no problems.
 
Upvote 0

drgottjr

Expert
Licensed User
Longtime User
Thank you!
I determined the number of lines according to your advice.
But the file is read until about 30%, and then the process stops. Still, there is a lack of RAM.


Thank you!
I tried your example. I have a message for a large and a small file that there are no problems.
i'm confused; the message is supposed to give you the information that you're looking for, not say that there were no problems. what did the messages say? did you get a line count for a big file or not? if yes, then it is able to read a big file and provide a line count without crashing.
 
Upvote 0

DonManfred

Expert
Licensed User
Longtime User
Why is that?
without seeing your file it can not be answered. Upload such a file or better create a small project (incl the file) to show the problem.
 
Upvote 0

LucaMs

Expert
Licensed User
Longtime User
When dealing with hundreds of thousands of data, the best solution is always to use a database, so that you can filter the data.
In this case, even a SQLite DB would take up a lot of memory, but you could put the DB file on external storage.
 
Upvote 0

emexes

Expert
Licensed User
I need to read a text file.
For the progress bar to work, I need to know the number of lines count.

Are you sure about that? Perhaps you can walk around the mountain rather than over it, by getting the file size first, and then using current_file_position / file_size to drive your progress bar. Although you might have to estimate current_file_position by tracking the cumulative length of all strings read up to that point, which might have some issues with line terminator size and variable-number-of-bytes UTF characters.

Or if you know the average line length (in bytes, including line terminator) then you could estimate the number of lines from the file size. For this use case: close enough should be good enough. ?
 
Last edited:
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Or maybe something like this:

B4X:
Sub Class_Globals
    Private RAF As RandomAccessFile
End Sub

Sub GetTextFileLineCount(strFolder As String, strFile As String, btEndOfLineByte As Byte) As Int

    Dim i As Int
    Dim iBytes As Int
    Dim lPosition As Long
    Dim iLines As Int
   
    RAF.Initialize(strFolder, strFile,True)
   
    iBytes = 10000 'could make smaller or larger

    Do While lPosition < RAF.Size
       
        Dim arrBytes(iBytes) As Byte
        iBytes = RAF.ReadBytes(arrBytes, 0, iBytes, lPosition)
       
        For i = 0 To iBytes - 1
            If arrBytes(i) = btEndOfLineByte Then
                iLines = iLines + 1
            End If
        Next
       
        lPosition = lPosition + iBytes
   
    Loop

    Return iLines
   
End Sub

Tested on a 4 Mb .txt file with 40000 lines and it takes some 30 milli-secs.

RBS
40000 lines was a bit small and tested on a 625 Mb file with 5891115 lines and that took 3 seconds and no memory problems that I could see.

RBS
 
Upvote 0

Sergey_New

Well-Known Member
Licensed User
Longtime User
Upload such a file
The file is large, you can download it from the Link.
The number of lines in this file is determined without problems:
B4X:
    Dim lst As List
    lst.Initialize
    lst=File.ReadList(Starter.myFolder, FileName)
    Log(lst.Size)
And reading lines and entering them, for example, into a List, causes memory overflow.
 
Last edited:
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Longtime User
Let's apply some flip-thinking to this challenge.

Whether the file with unknown contents has 90,000 or 1,500,000 lines, the progress bar in any case cannot possibly display the distinction between, for example, lines 67,000 and 67,001 on the screen. Why would you use such a detailed approach, like Microsoft with a file copy, spend a lot of resources (time, processor and memory capacity) on an approach that usually never works for me?

If the progress bar works in 100 steps, then each step represents 1%. That 1% on 1,500,000 lines means that 1 step corresponds to 15,000 lines. Now we don't know exactly what the length of each 15,000 lines is. However, it is responsible to use a statically based calculation. Given the large number, an assumption of the average length is reasonably feasible. You can then divide this estimated average length by the file length to find an estimated number of lines for the progress bar. You can also take advantage of the user's patience disappearing to make a pessimistic assessment, which will make the final steps much faster for the user.

And yes, if the length of the lines in the file is as variable as the weather, you can still recalculate the average line length when processing the file and use it for the next calculation if necessary.

This flip-thinking solution saves reading the file and therefore time and uses hardly any memory space and, as long as the file length is reasonably large, I expect that the user will not notice much of this estimate approach.
 
Upvote 0

emexes

Expert
Licensed User
The file is large, you can download it from the Link.
I downloaded this and read Royal.ged using your original code - in B4J, not B4A - and it worked fine.

The regex pattern for line breaks is "\n" so I was a bit dubious about using CRLF, but... it seems to work ok = ?
 
Last edited:
Upvote 0
Top