Android Question SQLite remove duplicate records.[SOLVED TWICE]

Roger Daley

Well-Known Member
Licensed User
Longtime User
Hi All,

Can anyone point me towards an example of removing duplicate records in a db?
I have searched for methods/commands in SQLite and DBUtils but nothing stood out to me.
Trying to match every record with every other record through iteration does not appeal and I'm sure there is a cleaner way do this.
NOTE: When I say duplicate records, I mean the data in all columns of both records is the same not just matching the "Names" column.

Regards Roger
 

Roger Daley

Well-Known Member
Licensed User
Longtime User
That is great that you solved it using any of two sets of codes. But there is an important piece of the puzzle missing. To prevent future duplicates from the existing table you have to include a primary key composed of the three columns. For that you have to create a temp table with the PRIMARY key:
B4X:
strQuery="CREATE TABLE IF NOT EXISTS  sitesBU (SiteName TEXT, Longitude TEXT, Latitude TEXT, PRIMARY KEY (SiteName,Longitude,Latitude))"
then insert all the records from Sites table to SitesBU table, then drop table Sites, and finally rename SItesBU to Sites.

Mahares
Fortunately not an issue for me in this case. I bulk load a CSV file which is appended to the existing table, then run this Sub to clean out any duplicates.
Where entries are made manually they are checked for duplicates at the time.

Good information for others trying to do something similar.

Regards Roger
 
Last edited:
Upvote 0

Roger Daley

Well-Known Member
Licensed User
Longtime User
Use a Long variable BeginTime = DateTime.Now at the beginning and at the end substract it from the final time Duration = DateTime.Now - BeginTime.
And Log it in Debug and Release mode, Log(Duration).

Kluas,

Using the three line code I get get a count of 34.
Using the single line code I get a count of 26.

The single line is faster.

Regards Roger

CORRECTION:
After running both versions of the Sub many times if find the time is highly variable and there is NO stand out difference. Sorry about the first statement based on a single run.

Roger
 
Last edited:
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Kluas,

Using the three line code I get get a count of 34.
Using the single line code I get a count of 26.

The single line is faster.

Regards Roger

CORRECTION:
After running both versions of the Sub many times if find the time is highly variable and there is NO stand out difference. Sorry about the first statement based on a single run.

Roger

With only 1100 records it is not worth measuring.
In that case it is premature optimization, which we all now is the source of all evil.
You will need something like a few hundred thousand records.

RBS
 
Upvote 0

Roger Daley

Well-Known Member
Licensed User
Longtime User
RBS, Klaus, rraswisak, Et Al

I have done a rerun with a .CSV with over 145K records. Five runs of both versions produced the following results.

Single line of code time: 1096, 1100, 1107, 1194, 1129
Three lines of code time: 693, 632, 615 , 973, 694

Interesting!

Regards Roger
 
Upvote 0

rraswisak

Active Member
Licensed User
Longtime User
I don't came with DBA background, just general database user. Base on my experience - this is one of the disadvantage of a table not being set for any constraint (primary key or index column). when working on small of record it will work just fine, but if it has huge of record then performance can be an issue.

Using where clause to a non-index column will taking more time rather than those which has indexed one. Joining two or more table using non-index column could has performance issue too.
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
Single line of code time: 1096, 1100, 1107, 1194, 1129
Three lines of code time: 693, 632, 615 , 973, 694
Interesting!
I suspect the GROUP BY several fields plays a big role in slowing down the single line query. Although it looks pretty and more elegant to use a one line statement, we are not running a pageantry event. We may be better off with the method that uses the multi line statement, and you can still add to it the PRIMARY KEY to make it complete. I would love to get hold of a database that has thousands or records and thousands of duplicates and many columns to test it.
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
Just append the same two/three times to get duplicate records
The site on your link is not accessible. If you or Roger has a very large SQLite database with duplicates and many columns to attach, I will be glad to test or you can do it yourself and let us know the results
 
Upvote 0

Roger Daley

Well-Known Member
Licensed User
Longtime User
The site on your link is not accessible. If you or Roger has a very large SQLite database with duplicates and many columns to attach, I will be glad to test or you can do it yourself and let us know the results
Sorry Mahares, I only have small needs. A 1000 records with 3 columns is big for me.

Regards Roger
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
I created a 100000 records SQLite table with only 3 columns that has some duplicates on a Galaxy Tab A 7 inch tablet with OS 5.1. Here are the benchmark tests:

Using Single line syntax provided by RB Smissaert:
Number of recs created with dupl.: 100000
Secs To delete duplicates: 3.664
rec count after del dupl:99668

Using Multi line syntax provided by rraswisak
Number of recs created with dupl.: 100000
Secs To delete duplicates: 2.717
rec count after del dupl:99717

When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.
 
Upvote 0

AnandGupta

Expert
Licensed User
Longtime User
When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.
It is becoming difficult to decide :)
Better show a big continuous progress circle and send process to background if user presses back/cancel.
We simply do not know how long it will take in a particular database.

Regards,

Anand
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
I created a 100000 records SQLite table with only 3 columns that has some duplicates on a Galaxy Tab A 7 inch tablet with OS 5.1. Here are the benchmark tests:

Using Single line syntax provided by RB Smissaert:
Number of recs created with dupl.: 100000
Secs To delete duplicates: 3.664
rec count after del dupl:99668

Using Multi line syntax provided by rraswisak
Number of recs created with dupl.: 100000
Secs To delete duplicates: 2.717
rec count after del dupl:99717

When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.

Causing some confusion there:
You mention different record numbers remaining after deleting the duplicates. So, did one the of the 2 methods do it wrongly?
Of course the unique index only needs to be created once.

RBS
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
Causing some confusion there:
You mention different record numbers remaining after deleting the duplicates
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.
It is becoming difficult to decide
The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.

The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.

> using random numbers

OK, confusion cleared.

RBS
 
Upvote 0

AnandGupta

Expert
Licensed User
Longtime User
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.

The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.
Ok. Better we use the multi line method, with progress circle, depending on size of data.

Regards,

Anand
 
Upvote 0
Top