From 466310eb6d0e61bc5362a138e2739d8b8e89988d Mon Sep 17 00:00:00 2001 From: jbranchaud Date: Fri, 14 May 2021 17:16:21 -0500 Subject: [PATCH] Add Find Duplicate Records In Table Without Unique Id as a postgres til --- README.md | 3 +- ...cate-records-in-table-without-unique-id.md | 45 +++++++++++++++++++ 2 files changed, 47 insertions(+), 1 deletion(-) create mode 100644 postgres/find-duplicate-records-in-table-without-unique-id.md diff --git a/README.md b/README.md index 59101ba..a4627cb 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ pairing with smart people at Hashrocket. For a steady stream of TILs, [sign up for my newsletter](https://tinyletter.com/jbranchaud). -_1124 TILs and counting..._ +_1125 TILs and counting..._ --- @@ -558,6 +558,7 @@ _1124 TILs and counting..._ - [Escaping String Literals With Dollar Quoting](postgres/escaping-string-literals-with-dollar-quoting.md) - [Export Query Results To A CSV](postgres/export-query-results-to-a-csv.md) - [Extracting Nested JSON Data](postgres/extracting-nested-json-data.md) +- [Find Duplicate Records In Table Without Unique Id](postgres/find-duplicate-records-in-table-without-unique-id.md) - [Find Records That Contain Duplicate Values](postgres/find-records-that-contain-duplicate-values.md) - [Find Records That Have Multiple Associated Records](postgres/find-records-that-have-multiple-associated-records.md) - [Find The Data Directory](postgres/find-the-data-directory.md) diff --git a/postgres/find-duplicate-records-in-table-without-unique-id.md b/postgres/find-duplicate-records-in-table-without-unique-id.md new file mode 100644 index 0000000..9f1edcd --- /dev/null +++ b/postgres/find-duplicate-records-in-table-without-unique-id.md @@ -0,0 +1,45 @@ +# Find Duplicate Records In Table Without Unique Id + +I recently came across a couple methods for listing out instances of duplicate +records in a table where the table doesn't have an explicit unique identifier. +Here is [a post](find-records-that-contain-duplicate-values.md) that explains +how to do this when a unique identifier is present. + +If the table doesn't have an explicit primary key or other uniquely identifying +value, then we'll have to get some help from [PostgreSQL's internal system +columns](https://www.postgresql.org/docs/current/ddl-system-columns.html) — +namely the `ctid`. + +The `ctid` is: + +> The physical location of the row version within its table. + +Let's use the example of the `mailing_list` table with potential duplicate +`email` values. + +Here is the [first approach](https://stackoverflow.com/a/26773018/535590): + +```sql +delete from mailing_list +where ctid not in ( + select min(ctid) + from mailing_list + group by email +); +``` + +This uses a subquery to find the first occurrence of every unique email and +then deletes the rest. The `ctid` is the unique value that we can call the +`min` aggregate on. + +A [second approach](https://stackoverflow.com/a/46775289/535590): + +```sql +delete from mailing_list ml1 + using mailing_list ml2 +where ml1.ctid < ml2.ctid + and ml1.email = ml2.email; +``` + +This uses `delete using` to join the table against itself as a cartesian +product to compare every entry to every other entry.