Android Question SQLite remove duplicate records.[SOLVED TWICE]

Roger Daley

Well-Known Member
Licensed User
Longtime User
Hi All,

Can anyone point me towards an example of removing duplicate records in a db?
I have searched for methods/commands in SQLite and DBUtils but nothing stood out to me.
Trying to match every record with every other record through iteration does not appeal and I'm sure there is a cleaner way do this.
NOTE: When I say duplicate records, I mean the data in all columns of both records is the same not just matching the "Names" column.

Regards Roger
 

RB Smissaert

Well-Known Member
Licensed User
Longtime User
How practical is your code if the table has many columns, say 50 or 100 where you have to GROUP by each column individually.

Can't see a major problem, but if your table has that many columns then probably the schema can be designed better.

RBS
 
Upvote 0

rraswisak

Active Member
Licensed User
Longtime User
DELETE
FROM duptest
WHERE rowid NOT IN
(SELECT MAX(rowid)
FROM duptest4
GROUP BY id, name, age)

this query work with other case which is a table has column with unique value inside (say, primary or auto_increment value). As in your sql sample rowid has unique value

TS (Thread Starter, @Roger Daley ) said in the first post he want to eliminate duplicate record value match in all column, so your sql does not work in this case.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
this query work with other case which is a table has column with unique value inside (say, primary or auto_increment value). As in your sql sample rowid has unique value

TS (Thread Starter, @Roger Daley ) said in the first post he want to eliminate duplicate record value match in all column, so your sql does not work in this case.

Don't quite get what you are saying there and I think it does work, but feel free to prove me wrong with a test case.

RBS
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
so your sql does not work in this case.
Don't quite get what you are saying there an
The code presented by RBS works indeed because each table record has a unique row id regardless whether it has duplicates or not unless the table was created WITHOUT ROWID. Note that he uses NOT IN in his syntax. I was just wondering how efficient it is if you have a table with very large number of records and many columns in the case of a table comprising personal data like: last name, first name, sex, birth date, occupation, etc., etc. I agree the table should be optimized to avoid large number of columns, but sometimes it is unavoidable.
 
Upvote 0

rraswisak

Active Member
Licensed User
Longtime User
There is no doubt that the query is work as i state before:

this query work with other case which is a table has column with unique value inside (say, primary or auto_increment value)

Information as describe here:

SQLite primary key and rowid table
When you create a table without specifying the WITHOUT ROWID option, SQLite adds an implicit column called rowid that stores 64-bit signed integer. The rowid column is a key that uniquely identifies the rows in the table. Tables that have rowid columns are called rowid tables.

WITHOUT ROWID is an option on how table created as shown in this diagram:

create-table-stmt.gif


So it's depend on how the table was created, if the tabled created with rowid then the query works fine, as opposite one it doesn't.

By using distinct * no matter table created with or without rowid will keep works.
 
Last edited:
Upvote 0

Roger Daley

Well-Known Member
Licensed User
Longtime User
Hi All,

The preceding "discussion" has been an education for me. As such I tried the RBS one line solution, replacing three lines of code in my Sub.
After some fumbling with the syntax all worked well. It only deletes records if they match on all three columns which is correct.
I have attached the db as sites.zip to show the structure.

Many thanks for all the input.
Regards Roger

B4X:
Sub SQLDuplicate(SQLD As SQL, TableName As String)
    
    SQLD.BeginTransaction
    
    SQLD.ExecNonQuery("DELETE FROM "&TableName &" WHERE rowid Not IN (Select Max(rowid) FROM "& TableName & " GROUP BY SiteName, Longitude, Latitude)")
    
'    SQLD.ExecNonQuery("CREATE TABLE table_temp AS SELECT DISTINCT * FROM "& TableName)
'    SQLD.ExecNonQuery("DROP TABLE "& TableName)
'    SQLD.ExecNonQuery("ALTER TABLE table_temp RENAME TO "& TableName)
    SQLD.TransactionSuccessful
    SQLD.EndTransaction
End Sub
 

Attachments

  • sites.zip
    952 bytes · Views: 222
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Hi All,

The preceding "discussion" has been an education for me. As such I tried the RBS one line solution, replacing three lines of code in my Sub.
After some fumbling with the syntax all worked well. It only deletes records if they match on all three columns which is correct.
I have attached the db as sites.zip to show the structure.

Many thanks for all the input.
Regards Roger

B4X:
[/COLOR][/COLOR]
[COLOR=rgb(255, 0, 0)][COLOR=rgb(0, 0, 0)]Sub SQLDuplicate(SQLD As SQL, TableName As String)
   
    SQLD.BeginTransaction
   
    SQLD.ExecNonQuery("DELETE FROM "&TableName &" WHERE rowid Not IN (Select Max(rowid) FROM "& TableName & " GROUP BY SiteName, Longitude, Latitude)")
   
'    SQLD.ExecNonQuery("CREATE TABLE table_temp AS SELECT DISTINCT * FROM "& TableName)
'    SQLD.ExecNonQuery("DROP TABLE "& TableName)
'    SQLD.ExecNonQuery("ALTER TABLE table_temp RENAME TO "& TableName)
    SQLD.TransactionSuccessful
    SQLD.EndTransaction
End Sub

Nice to hear it works fine.
Maybe you could tell us which one was the faster solution.
You may need to compare on a larger sample.

RBS
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
all worked well. It only deletes records if they match on all three columns which is correct.

That is great that you solved it using any of two sets of codes. But there is an important piece of the puzzle missing. To prevent future duplicates from the existing table you have to include a primary key composed of the three columns. For that you have to create a temp table with the PRIMARY key:
B4X:
strQuery="CREATE TABLE IF NOT EXISTS  sitesBU (SiteName TEXT, Longitude TEXT, Latitude TEXT, PRIMARY KEY (SiteName,Longitude,Latitude))"
then insert all the records from Sites table to SitesBU table, then drop table Sites, and finally rename SItesBU to Sites.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
That is great that you solved it using any of two sets of codes. But there is an important piece of the puzzle missing. To prevent future duplicates from the existing table you have to include a primary key composed of the three columns. For that you have to create a temp table with the PRIMARY key:
B4X:
strQuery="CREATE TABLE IF NOT EXISTS  sitesBU (SiteName TEXT, Longitude TEXT, Latitude TEXT, PRIMARY KEY (SiteName,Longitude,Latitude))"
then insert all the records from Sites table to SitesBU table, then drop table Sites, and finally rename SItesBU to Sites.

No need to copy the rows to a new table.
After deleting the duplicates just create a unique key on all the fields.

RBS
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
No need to copy the rows to a new table.
After deleting the duplicates just create a unique key on all the fields.
As far as I know the UNIQUE constraint works with CREATE TABLE. So, when you create a table, it has no records initially, so you need to first create a temp table with UNIQUE constraint and only then insert the records to it. If you disagree, provide an example to back up your statement.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
As far as I know the UNIQUE constraint works with CREATE TABLE. So, when you create a table, it has no records initially, so you need to first create a temp table with UNIQUE constraint and only then insert the records to it. If you disagree, provide an example to back up your statement.

Just try:

create table test_unique_index(id int, name text, age int)

create unique index idx_test_unique_index_id_name_age on test_unique_index(id, name, age)

insert into test_unique_index values(1,'John', 25)

insert into test_unique_index values(1,'John', 25) <<< will fail

insert into test_unique_index values(1,'John', 26) <<< will succeed

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
You should have explicitly mentioned UNIQUE INDEX instead of unique key in post #32 which are two different things. we could have avoided unnecessary additional posts if you included your example in post #32

Yes, should be unique index, simple typo, typing quick.
In any case there is no need to copy the rows to a different table.

RBS
 
Upvote 0

Mahares

Expert
Licensed User
Longtime User
Last edited:
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
UNIQUE INDEX and UNIQUE constraint in creating a table are totally different. So, it depends on the objective and it in many cases it is better to use it in creating the table and copying the rows.. See this link for an explanation of the downside of UNIQUE INDEX
https://littlekendra.com/2016/09/08/unique-constraints-vs-unique-indexes/
Case is closed.

> Case is closed

??
I was just suggesting another option to the OP where there was no need to copy to another table.
What the better way is will depend on the particular case, but at least the OP may have learned something from it.

RBS
 
Upvote 0

Roger Daley

Well-Known Member
Licensed User
Longtime User
Nice to hear it works fine.
Maybe you could tell us which one was the faster solution.
You may need to compare on a larger sample.

RBS


RBS,
I loaded in over 1100 records and loaded them again, ran both versions of the sub. Subjectively I could not detect any difference in speed.
I tried to measure using a Timer but this did not work.

Regards Roger
 
Upvote 0
Top