mirror of
https://github.com/jbranchaud/til
synced 2026-01-04 23:58:01 +00:00
Add Find Duplicate Records In Table Without Unique Id as a postgres til
This commit is contained in:
@@ -10,7 +10,7 @@ pairing with smart people at Hashrocket.
|
||||
|
||||
For a steady stream of TILs, [sign up for my newsletter](https://tinyletter.com/jbranchaud).
|
||||
|
||||
_1124 TILs and counting..._
|
||||
_1125 TILs and counting..._
|
||||
|
||||
---
|
||||
|
||||
@@ -558,6 +558,7 @@ _1124 TILs and counting..._
|
||||
- [Escaping String Literals With Dollar Quoting](postgres/escaping-string-literals-with-dollar-quoting.md)
|
||||
- [Export Query Results To A CSV](postgres/export-query-results-to-a-csv.md)
|
||||
- [Extracting Nested JSON Data](postgres/extracting-nested-json-data.md)
|
||||
- [Find Duplicate Records In Table Without Unique Id](postgres/find-duplicate-records-in-table-without-unique-id.md)
|
||||
- [Find Records That Contain Duplicate Values](postgres/find-records-that-contain-duplicate-values.md)
|
||||
- [Find Records That Have Multiple Associated Records](postgres/find-records-that-have-multiple-associated-records.md)
|
||||
- [Find The Data Directory](postgres/find-the-data-directory.md)
|
||||
|
||||
@@ -0,0 +1,45 @@
|
||||
# Find Duplicate Records In Table Without Unique Id
|
||||
|
||||
I recently came across a couple methods for listing out instances of duplicate
|
||||
records in a table where the table doesn't have an explicit unique identifier.
|
||||
Here is [a post](find-records-that-contain-duplicate-values.md) that explains
|
||||
how to do this when a unique identifier is present.
|
||||
|
||||
If the table doesn't have an explicit primary key or other uniquely identifying
|
||||
value, then we'll have to get some help from [PostgreSQL's internal system
|
||||
columns](https://www.postgresql.org/docs/current/ddl-system-columns.html) —
|
||||
namely the `ctid`.
|
||||
|
||||
The `ctid` is:
|
||||
|
||||
> The physical location of the row version within its table.
|
||||
|
||||
Let's use the example of the `mailing_list` table with potential duplicate
|
||||
`email` values.
|
||||
|
||||
Here is the [first approach](https://stackoverflow.com/a/26773018/535590):
|
||||
|
||||
```sql
|
||||
delete from mailing_list
|
||||
where ctid not in (
|
||||
select min(ctid)
|
||||
from mailing_list
|
||||
group by email
|
||||
);
|
||||
```
|
||||
|
||||
This uses a subquery to find the first occurrence of every unique email and
|
||||
then deletes the rest. The `ctid` is the unique value that we can call the
|
||||
`min` aggregate on.
|
||||
|
||||
A [second approach](https://stackoverflow.com/a/46775289/535590):
|
||||
|
||||
```sql
|
||||
delete from mailing_list ml1
|
||||
using mailing_list ml2
|
||||
where ml1.ctid < ml2.ctid
|
||||
and ml1.email = ml2.email;
|
||||
```
|
||||
|
||||
This uses `delete using` to join the table against itself as a cartesian
|
||||
product to compare every entry to every other entry.
|
||||
Reference in New Issue
Block a user