Can anyone point me towards an example of removing duplicate records in a db?
I have searched for methods/commands in SQLite and DBUtils but nothing stood out to me.
Trying to match every record with every other record through iteration does not appeal and I'm sure there is a cleaner way do this.
NOTE: When I say duplicate records, I mean the data in all columns of both records is the same not just matching the "Names" column.
That is great that you solved it using any of two sets of codes. But there is an important piece of the puzzle missing. To prevent future duplicates from the existing table you have to include a primary key composed of the three columns. For that you have to create a temp table with the PRIMARY key:
B4X:
strQuery="CREATE TABLE IF NOT EXISTS sitesBU (SiteName TEXT, Longitude TEXT, Latitude TEXT, PRIMARY KEY (SiteName,Longitude,Latitude))"
then insert all the records from Sites table to SitesBU table, then drop table Sites, and finally rename SItesBU to Sites.
Mahares
Fortunately not an issue for me in this case. I bulk load a CSV file which is appended to the existing table, then run this Sub to clean out any duplicates.
Where entries are made manually they are checked for duplicates at the time.
Good information for others trying to do something similar.
Use a Long variable BeginTime = DateTime.Now at the beginning and at the end substract it from the final time Duration = DateTime.Now - BeginTime.
And Log it in Debug and Release mode, Log(Duration).
Use a Long variable BeginTime = DateTime.Now at the beginning and at the end substract it from the final time Duration = DateTime.Now - BeginTime.
And Log it in Debug and Release mode, Log(Duration).
Using the three line code I get get a count of 34.
Using the single line code I get a count of 26.
The single line is faster.
Regards Roger
CORRECTION:
After running both versions of the Sub many times if find the time is highly variable and there is NO stand out difference. Sorry about the first statement based on a single run.
Using the three line code I get get a count of 34.
Using the single line code I get a count of 26.
The single line is faster.
Regards Roger
CORRECTION:
After running both versions of the Sub many times if find the time is highly variable and there is NO stand out difference. Sorry about the first statement based on a single run.
With only 1100 records it is not worth measuring.
In that case it is premature optimization, which we all now is the source of all evil.
You will need something like a few hundred thousand records.
With only 1100 records it is not worth measuring.
In that case it is premature optimization, which we all now is the source of all evil.
You will need something like a few hundred thousand records.
I don't came with DBA background, just general database user. Base on my experience - this is one of the disadvantage of a table not being set for any constraint (primary key or index column). when working on small of record it will work just fine, but if it has huge of record then performance can be an issue.
Using where clause to a non-index column will taking more time rather than those which has indexed one. Joining two or more table using non-index column could has performance issue too.
I suspect the GROUP BY several fields plays a big role in slowing down the single line query. Although it looks pretty and more elegant to use a one line statement, we are not running a pageantry event. We may be better off with the method that uses the multi line statement, and you can still add to it the PRIMARY KEY to make it complete. I would love to get hold of a database that has thousands or records and thousands of duplicates and many columns to test it.
The site on your link is not accessible. If you or Roger has a very large SQLite database with duplicates and many columns to attach, I will be glad to test or you can do it yourself and let us know the results
The site on your link is not accessible. If you or Roger has a very large SQLite database with duplicates and many columns to attach, I will be glad to test or you can do it yourself and let us know the results
I created a 100000 records SQLite table with only 3 columns that has some duplicates on a Galaxy Tab A 7 inch tablet with OS 5.1. Here are the benchmark tests:
Using Single line syntax provided by RB Smissaert:
Number of recs created with dupl.: 100000
Secs To delete duplicates: 3.664
rec count after del dupl:99668
Using Multi line syntax provided by rraswisak
Number of recs created with dupl.: 100000
Secs To delete duplicates: 2.717
rec count after del dupl:99717
When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.
When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.
It is becoming difficult to decide
Better show a big continuous progress circle and send process to background if user presses back/cancel.
We simply do not know how long it will take in a particular database.
I created a 100000 records SQLite table with only 3 columns that has some duplicates on a Galaxy Tab A 7 inch tablet with OS 5.1. Here are the benchmark tests:
Using Single line syntax provided by RB Smissaert:
Number of recs created with dupl.: 100000
Secs To delete duplicates: 3.664
rec count after del dupl:99668
Using Multi line syntax provided by rraswisak
Number of recs created with dupl.: 100000
Secs To delete duplicates: 2.717
rec count after del dupl:99717
When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.
Causing some confusion there:
You mention different record numbers remaining after deleting the duplicates. So, did one the of the 2 methods do it wrongly?
Of course the unique index only needs to be created once.
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.
The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.
The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.
The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.