Android Question SQLite remove duplicate records.[SOLVED TWICE]

Roger Daley · Oct 12, 2019

Hi All,

Can anyone point me towards an example of removing duplicate records in a db?
I have searched for methods/commands in SQLite and DBUtils but nothing stood out to me.
Trying to match every record with every other record through iteration does not appeal and I'm sure there is a cleaner way do this.
NOTE: When I say duplicate records, I mean the data in all columns of both records is the same not just matching the "Names" column.

Regards Roger

Mahares · Oct 14, 2019

RB Smissaert said:
DELETE
FROM duptest
WHERE rowid NOT IN
(SELECT MAX(rowid)
FROM duptest4
GROUP BY id, name, age)

How practical is your code if the table has many columns, say 50 or 100 where you have to GROUP by each column individually.

RB Smissaert · Oct 14, 2019

Mahares said:
How practical is your code if the table has many columns, say 50 or 100 where you have to GROUP by each column individually.

Can't see a major problem, but if your table has that many columns then probably the schema can be designed better.

RBS

rraswisak · Oct 15, 2019

RB Smissaert said:
DELETE
FROM duptest
WHERE rowid NOT IN
(SELECT MAX(rowid)
FROM duptest4
GROUP BY id, name, age)

this query work with other case which is a table has column with unique value inside (say, primary or auto_increment value). As in your sql sample rowid has unique value

TS (Thread Starter, @Roger Daley ) said in the first post he want to eliminate duplicate record value match in all column, so your sql does not work in this case.

RB Smissaert · Oct 15, 2019

rraswisak said:
this query work with other case which is a table has column with unique value inside (say, primary or auto_increment value). As in your sql sample rowid has unique value

TS (Thread Starter, @Roger Daley ) said in the first post he want to eliminate duplicate record value match in all column, so your sql does not work in this case.

Don't quite get what you are saying there and I think it does work, but feel free to prove me wrong with a test case.

RBS

Mahares · Oct 15, 2019

rraswisak said:
so your sql does not work in this case.

RB Smissaert said:
Don't quite get what you are saying there an

The code presented by RBS works indeed because each table record has a unique row id regardless whether it has duplicates or not unless the table was created WITHOUT ROWID. Note that he uses NOT IN in his syntax. I was just wondering how efficient it is if you have a table with very large number of records and many columns in the case of a table comprising personal data like: last name, first name, sex, birth date, occupation, etc., etc. I agree the table should be optimized to avoid large number of columns, but sometimes it is unavoidable.

RB Smissaert · Oct 15, 2019

RB Smissaert said:
Actually, it can indeed be done with one simple SQL:

DELETE
FROM duptest
WHERE rowid NOT IN
(SELECT MAX(rowid)
FROM duptest4
GROUP BY id, name, age)

RBS

To avoid any confusion:
duptest4 should be duptest.
Just a typo.

RBS

rraswisak · Oct 15, 2019

There is no doubt that the query is work as i state before:

rraswisak said:
this query work with other case which is a table has column with unique value inside (say, primary or auto_increment value)

Information as describe here:

SQLite primary key and rowid table
When you create a table without specifying the WITHOUT ROWID option, SQLite adds an implicit column called rowid that stores 64-bit signed integer. The rowid column is a key that uniquely identifies the rows in the table. Tables that have rowid columns are called rowid tables.

WITHOUT ROWID is an option on how table created as shown in this diagram:

So it's depend on how the table was created, if the tabled created with rowid then the query works fine, as opposite one it doesn't.

By using distinct * no matter table created with or without rowid will keep works.

Roger Daley · Oct 16, 2019

Hi All,

The preceding "discussion" has been an education for me. As such I tried the RBS one line solution, replacing three lines of code in my Sub.
After some fumbling with the syntax all worked well. It only deletes records if they match on all three columns which is correct.
I have attached the db as sites.zip to show the structure.

Many thanks for all the input.
Regards Roger

B4X:

Sub SQLDuplicate(SQLD As SQL, TableName As String)
    
    SQLD.BeginTransaction
    
    SQLD.ExecNonQuery("DELETE FROM "&TableName &" WHERE rowid Not IN (Select Max(rowid) FROM "& TableName & " GROUP BY SiteName, Longitude, Latitude)")
    
'    SQLD.ExecNonQuery("CREATE TABLE table_temp AS SELECT DISTINCT * FROM "& TableName)
'    SQLD.ExecNonQuery("DROP TABLE "& TableName)
'    SQLD.ExecNonQuery("ALTER TABLE table_temp RENAME TO "& TableName)
    SQLD.TransactionSuccessful
    SQLD.EndTransaction
End Sub

RB Smissaert · Oct 16, 2019

Roger Daley said:
Hi All,

The preceding "discussion" has been an education for me. As such I tried the RBS one line solution, replacing three lines of code in my Sub.
After some fumbling with the syntax all worked well. It only deletes records if they match on all three columns which is correct.
I have attached the db as sites.zip to show the structure.

Many thanks for all the input.
Regards Roger

B4X:

[/COLOR][/COLOR] [COLOR=rgb(255, 0, 0)][COLOR=rgb(0, 0, 0)]Sub SQLDuplicate(SQLD As SQL, TableName As String) SQLD.BeginTransaction SQLD.ExecNonQuery("DELETE FROM "&TableName &" WHERE rowid Not IN (Select Max(rowid) FROM "& TableName & " GROUP BY SiteName, Longitude, Latitude)") ' SQLD.ExecNonQuery("CREATE TABLE table_temp AS SELECT DISTINCT * FROM "& TableName) ' SQLD.ExecNonQuery("DROP TABLE "& TableName) ' SQLD.ExecNonQuery("ALTER TABLE table_temp RENAME TO "& TableName) SQLD.TransactionSuccessful SQLD.EndTransaction End Sub

Nice to hear it works fine.
Maybe you could tell us which one was the faster solution.
You may need to compare on a larger sample.

RBS

klaus · Oct 16, 2019

Maybe you could tell us which one was the faster solution.
You may need to compare on a larger sample.

And in Release mode!

Mahares · Oct 16, 2019

Roger Daley said:
all worked well. It only deletes records if they match on all three columns which is correct.

That is great that you solved it using any of two sets of codes. But there is an important piece of the puzzle missing. To prevent future duplicates from the existing table you have to include a primary key composed of the three columns. For that you have to create a temp table with the PRIMARY key:

B4X:

strQuery="CREATE TABLE IF NOT EXISTS  sitesBU (SiteName TEXT, Longitude TEXT, Latitude TEXT, PRIMARY KEY (SiteName,Longitude,Latitude))"

then insert all the records from Sites table to SitesBU table, then drop table Sites, and finally rename SItesBU to Sites.

RB Smissaert · Oct 16, 2019

Mahares said:
That is great that you solved it using any of two sets of codes. But there is an important piece of the puzzle missing. To prevent future duplicates from the existing table you have to include a primary key composed of the three columns. For that you have to create a temp table with the PRIMARY key:

B4X:

strQuery="CREATE TABLE IF NOT EXISTS sitesBU (SiteName TEXT, Longitude TEXT, Latitude TEXT, PRIMARY KEY (SiteName,Longitude,Latitude))"

then insert all the records from Sites table to SitesBU table, then drop table Sites, and finally rename SItesBU to Sites.

No need to copy the rows to a new table.
After deleting the duplicates just create a unique key on all the fields.

RBS

Mahares · Oct 16, 2019

RB Smissaert said:
No need to copy the rows to a new table.
After deleting the duplicates just create a unique key on all the fields.

As far as I know the UNIQUE constraint works with CREATE TABLE. So, when you create a table, it has no records initially, so you need to first create a temp table with UNIQUE constraint and only then insert the records to it. If you disagree, provide an example to back up your statement.

RB Smissaert · Oct 16, 2019

Mahares said:
As far as I know the UNIQUE constraint works with CREATE TABLE. So, when you create a table, it has no records initially, so you need to first create a temp table with UNIQUE constraint and only then insert the records to it. If you disagree, provide an example to back up your statement.

Just try:

create table test_unique_index(id int, name text, age int)

create unique index idx_test_unique_index_id_name_age on test_unique_index(id, name, age)

insert into test_unique_index values(1,'John', 25)

insert into test_unique_index values(1,'John', 25) <<< will fail

insert into test_unique_index values(1,'John', 26) <<< will succeed

RBS

Mahares · Oct 16, 2019

RB Smissaert said:
unique key

You should have explicitly mentioned UNIQUE INDEX instead of unique key in post #32 which are two different things. we could have avoided unnecessary additional posts if you included your example in post #32

RB Smissaert · Oct 16, 2019

Mahares said:
You should have explicitly mentioned UNIQUE INDEX instead of unique key in post #32 which are two different things. we could have avoided unnecessary additional posts if you included your example in post #32

Yes, should be unique index, simple typo, typing quick.
In any case there is no need to copy the rows to a different table.

RBS

Mahares · Oct 16, 2019

RB Smissaert said:
In any case there is no need to copy the rows to a different table.

UNIQUE INDEX and UNIQUE constraint in creating a table are totally different. So, it depends on the objective and it in many cases it is better to use it in creating the table and copying the rows.. See this link for an explanation of the downside of UNIQUE INDEX
https://littlekendra.com/2016/09/08/unique-constraints-vs-unique-indexes/
Case is closed.

RB Smissaert · Oct 16, 2019

Mahares said:
UNIQUE INDEX and UNIQUE constraint in creating a table are totally different. So, it depends on the objective and it in many cases it is better to use it in creating the table and copying the rows.. See this link for an explanation of the downside of UNIQUE INDEX
https://littlekendra.com/2016/09/08/unique-constraints-vs-unique-indexes/
Case is closed.

> Case is closed

??
I was just suggesting another option to the OP where there was no need to copy to another table.
What the better way is will depend on the particular case, but at least the OP may have learned something from it.

RBS

Roger Daley · Oct 16, 2019

RB Smissaert said:
Nice to hear it works fine.
Maybe you could tell us which one was the faster solution.
You may need to compare on a larger sample.

RBS

RBS,
I loaded in over 1100 records and loaded them again, ran both versions of the sub. Subjectively I could not detect any difference in speed.
I tried to measure using a Timer but this did not work.

Regards Roger

Roger Daley · Oct 16, 2019

klaus said:
And in Release mode!

Klaus,

Any clues on how to measure the difference, subjectively I can't pick a difference.

Regards Roger

Android Question SQLite remove duplicate records.[SOLVED TWICE]

Well-Known Member

Expert

Well-Known Member

Active Member

Well-Known Member

Expert

Well-Known Member

Active Member

Well-Known Member

Attachments

Well-Known Member

Expert

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Well-Known Member

Well-Known Member

Well-Known Member

Similar Threads