B4J Question Reading in large amounts of data in DB

strupp01 · Jun 27, 2017

How can I import a file with 40,000 - 100,000 lines as fast as possible into a database. Data per line are separated by commas.
"LOAD DATA INFILE" probably does not work. With "LoadCSV" you can only load into a list? Or does it go directly into a DB?
Who can help and can show me the command?

strupp01 · Jun 30, 2017

Here's the code

keirS · Jun 30, 2017

I have had a look and indeed your code runs incredibly slowly in the debugger.

This is pretty similar to what you did.

B4X:

#Region  Project Attributes
    #MainFormWidth: 600
    #MainFormHeight: 400
    #AdditionalJar: sqlite-jdbc-3.7.2
#End Region

Sub Process_Globals
    Private fx As JFX
    Private MainForm As Form
    Private TableView1 As TableView
    Dim T1 As Long
    Dim SQL1 As SQL

End Sub

Sub AppStart (Form1 As Form, Args() As String)
    MainForm = Form1
    MainForm.RootPane.LoadLayout("Main")
      MainForm.SetFormStyle("UTILITY")
    MainForm.Title = "B4J HowTo Test"
    MainForm.Show
    '
    'SQL1.InitializeSQLite("d:\B4J_SQL_Schnell_Test_Neu\TableViewImportCSV\Daten\","Schlaf_gut_Test.db", True)
    SQL1.InitializeSQLite(File.DirApp & "\SQL_Datenbank\","Schlaf_gut_Test.db", True)
    SQL1.ExecNonQuery("CREATE TABLE IF NOT EXISTS BRP (Datum_Aufzeichnung Text, Datum TEXT, Uhrzeit TEXT, Plus_Zeit Text, Flow_Rate Text, Mask_Pressure Text)")

    AppInit

'    Daten_To_DB
End Sub

'Init controls, settings etc.
Sub AppInit
    FillTableView
End Sub

Sub FillTableView

    Dim DB_Daten(3) As String
    Dim su As StringUtils
    Dim l As List

    Dim n As Long = DateTime.Now
    T1=DateTime.Now

    l = File.ReadList(File.DirApp  , "BRP.txt")
    Log("List Size: " & l.Size)
    'ProgressDialogShow2("Inserting data...", False)
    Log("Read File: ")
    Log((DateTime.Now - T1)/1000)
    Log((DateTime.Now - T1)/1000)
    For i = 0 To l.Size-1
        Dim str() As String = Regex.split(",",l.Get(i))
        SQL1.AddNonQueryToBatch("INSERT INTO BRP VALUES (?,?,?,?,?,?)", Array As Object("1", "2", "3", str(0), str(1), str(2)))
    Next

    Log("Read File Loop ")
    Log((DateTime.Now - T1)/1000)

     SQL1.ExecNonQueryBatch("SQL")
     'Causes massive performance hit in debug mode!
     'Wait For  SQL_NonQueryComplete (Success As Boolean)


    'Log("NonQuery:")
    Log((DateTime.Now - T1)/1000)



    'MainForm.Close

End Sub

'Close the app - add any specifics here
Sub AppClose
    MainForm.Close
End Sub


'Handle form closing via system close (top right X button)
Sub MainForm_CloseRequest (EventData As Event)
    AppClose
End Sub

Sub SQL_NonQueryComplete (Success As Boolean)
    Log("NonQuery: " & Success)
    Log((DateTime.Now - T1)/1000)
    SQL1.Close
End Sub

In the debug mode at least Wait For for has a massive hit on performance. Perhaps @Erel would like to comment on why this is. Take out the Wait For and use the NonQueryComplete sub and it runs pretty fast, or run your current code in release mode.

OliverA · Jun 30, 2017

Ok, some of my results using your code and the file you provided:

Win 10 via Parallels on a i7 Mac Mini with 16GB RAM and SSD drive:

First in B4J in RELEASE mode:

Run#1: 4855, 30611, 31047 -> Batch execution time (T2-T1): 25756
Run#2: 4661, 29557, 29776 -> 24896
Run#3: 4408, 28124, 28541 -> 23716

From command line on same machine:

Run#1: 52195, 69451, 69868 -> 17256
Run#2: 45068, 62978, 63394 -> 17910
Run#3: 47548, 64705, 65107 -> 17157

Ok, first, the terminal in an emulation sucks (that's why the first time is so much longer from the command line than B4J). Yet notice, the difference between between the second time and the first time (the ExecNonQueryBatch) is from 6 to 8 seconds faster when the program is executed from the command line then from B4J itself. Keep that in mind when bench marking.

Ok, the same program ran from the command line on an i3, regular hard drive, 6GB memory, Windows 10:

Run#1: 10767, 22726, 23049
Run#2: 10814, 22650, 22911
Run#3: 10601, 22564, 22903

Now we are sub 12 seconds (virtualization hurts me here).

For fun, I've created a non-ui version of the app (attached) and ran it on the i3-3320:

Run#1: 1782, 14083, 14208
Run#2: 1609, 13466, 16045
Run#3: 1558, 16045, 16154

A tad over 12 seconds.

Now for a change, I have a little Celeron 1037U, 4GB RAM, Ubuntu 16.04 LTS and an SSD and it gets (using non-ui app):

Run#1: 1577, 10868, 11056
Run#2: 1472, 10830, 11020
Run#3: 1548, 11343, 11533

Yeah, sub 10 seconds!

Some notes: DEBUG mode would not work for me. So all results are release mode code. There is a difference even in release mode when run from within B4J or directly from the command line. Virtualization will hurt your bench marks. SSD will also bring down your time, more so than CPU, since, in my case above, the i3 is almost twice as fast as the Celeron.

Question: What is your D: drive? A local drive? A external USB drive? A network drive? This could also make a difference.

strupp01 · Jun 30, 2017

My D: drive is an internal hard drive partitioned into drive C: and D:
However, I can not test more tomorrow. Must end today unfortunately.
Thank you for your effort.

Ed Brown · Jul 3, 2017

This may be too late but have you tried using the latest SQLite jar file? The one you have in code dates back to 2010. The latest jar is sqlite-jdbc-3.19.3.jar and is just over twice (2x) faster than the 2010 version. You can get the latest jar file from here https://bitbucket.org/xerial/sqlite-jdbc/downloads/ NOTE: this might not be the best place to get it but it's where I get mine from. You can review the performance increases in SQLite here https://sqlite.org/cpu.html

Also, depending on how the AddNonQueryToBatch method works AND if you are having issues with code breaking/crashing then this might help. There is a documented limitation on the size of a SQL statement for SQLite which the current default is 1,000,000 characters refer to #3 in this link https://sqlite.org/limits.html. If the AddNonQueryToBatch creates a simple sql statement then adding 400k rows will quickly reach this defined limit. If you are experiencing this problem then it might be useful to break the number of updates/inserts into smaller chunks.

strupp01 · Jul 3, 2017

Thank you for the information. I will try.

OliverA · Jul 5, 2017

Ed Brown said:
The latest jar is sqlite-jdbc-3.19.3.jar and is just over twice (2x) faster than the 2010 version

Yes and no. On my i3 system, the version 3.19.3 did bring my times down to sub 7 second run times (using a non-ui program). On the Celeron 1037U system, the times INCREASED by over a second. Interesting...

Using some instructions found in this stackoverflow.com post, I was able to bring my number to below 5.5 seconds on my i3 system using a standard hard drive. The additional code is

B4X:

SQL1.ExecNonQuery("PRAGMA page_size = 16384")
SQL1.ExecNonQuery("PRAGMA synchronous = OFF")
SQL1.ExecNonQuery("PRAGMA journal_mode = MEMORY")

16384 for the page size seems to be a sweet spot for my system. Please note at this point we are talking fractions of seconds in difference. See the SQLite documentation on this pramga for notes on when the page size becomes effective. Total memory consumption in my case seemed to hover around 300MB, no matter the page size chosen.

I've attached the code I used for testing. Please note that it is just a slightly modified version of the various code that was provided by other members of this discussion thread. I'm also amazed (lucky?) that nobody called my out on my bad math (by a factor of 1000) on one of my previous posts.

keirS · Jul 5, 2017

OliverA said:

B4X:

SQL1.ExecNonQuery("PRAGMA synchronous = OFF")
SQL1.ExecNonQuery("PRAGMA journal_mode = MEMORY")

[

Those settings are a recipe to corrupt your DB according to the SQLite authors.

OliverA · Jul 6, 2017

keirS said:
Those settings are a recipe to corrupt your DB according to the SQLite authors.

Correct. I guess I should have made a point for that, but please note that the source that I used did warn users as follows:

By setting synchronous = OFF, we are instructing SQLite to simply hand-off the data to the OS for writing and then continue. There's a chance that the database file may become corrupted if the computer suffers a catastrophic crash (or power failure) before the data is written to the platter.

Consider storing the rollback journal in memory by evaluating PRAGMA journal_mode = MEMORY. Your transaction will be faster, but if you lose power or your program crashes during a transaction you database could be left in a corrupt state with a partially-completed transaction

The rationale for using them is given as:

Let's combine the previous two optimizations. It's a little more risky (in case of a crash), but we're just importing data (not running a bank)

So yes, they are risky, but they may be useful in certain circumstances (in this case fast import of data). Please note that I did not advocate (nor does the link) that the db should be run with these settings 24/7.

Ed Brown · Jul 6, 2017

Hi @OliverA, thanks for info. I do agree with @keirS about the volatility of using those two pragmas.

Just reading details for the page_size pragma while looking at your code sample. You need to VACUUM the database after executing the page_size pragma otherwise it will have no effect.

B4J Question Reading in large amounts of data in DB

strupp01

Active Member

strupp01

Active Member

Attachments

keirS

Well-Known Member

OliverA

Expert

Attachments

strupp01

Active Member

Ed Brown

Active Member

strupp01

Active Member

OliverA

Expert

Attachments

keirS

Well-Known Member

OliverA

Expert

Ed Brown

Active Member

Similar Threads