Duplicate Rows in SQL and PostgreSQL

xw19 Published on: 2019-01-25

We will use PostgreSQL 10 for this experiment. We will create a table with no indexes.

CREATE TABLE account(
 username VARCHAR (50) NOT NULL,
 password VARCHAR (50) NOT NULL,
 email VARCHAR (355) NOT NULL
);

Next we will populate with one million rows

INSERT INTO account (username, password, email)
SELECT 'test'||g.id, 'test'||g.id, 'test'||g.id ||'@example.com' FROM generate_series(1, 1000000) AS g (id) ;

We will then create some duplicate rows

INSERT INTO account (username, password, email)
SELECT 'test'||g.id, 'test'||g.id, 'test'||g.id ||'@example.com' FROM generate_series(2000, 10000) AS g (id);

The query to find the duplicate using standard SQL

SELECT username, email, count(*)
from account
group by username, email
having count(*) > 1;

Also we can use PostgreSQL specfic query

SELECT * FROM (
  SELECT username, ROW_NUMBER() OVER(PARTITION BY username, email ORDER BY username asc) AS ROW
  FROM account
) duplicates WHERE duplicates.ROW > 1;

Explain and analyze

EXPLAIN ANLYZE SELECT username, email, count(*)
from account
group by username, email
having count(*) > 1;

postgres=# EXPLAIN ANALYZE SELECT username, email, count(*)
postgres-# from account
postgres-# group by username, email
postgres-# having count(*) > 1;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=144114.66..166619.78 rows=990511 width=40) (actual time=1041.315..1595.514 rows=8001 loops=1)
   Group Key: username, email
   Filter: (count(*) > 1)
   Rows Removed by Filter: 991999
   ->  Sort  (cost=144114.66..146634.66 rows=1008001 width=32) (actual time=1041.303..1405.274 rows=1008001 loops=1)
         Sort Key: username, email
         Sort Method: external merge  Disk: 43224kB
         ->  Seq Scan on account  (cost=0.00..19483.01 rows=1008001 width=32) (actual time=0.020..102.912 rows=1008001 loops=1)
 Planning time: 0.213 ms
 Execution time: 1600.822 ms
(10 rows)

postgres=#

EXPLAIN ANALYZE SELECT * FROM (
  SELECT username, ROW_NUMBER() OVER(PARTITION BY username, email) AS ROW
  FROM account
) duplicates WHERE duplicates.ROW > 1;

postgres=# EXPLAIN ANALYZE SELECT * FROM (
 ROWgres(#   SELECT username, ROW_NUMBER() OVER(PARTITION BY username, email) AS
postgres(#   FROM account
postgres(# ) duplicates WHERE duplicates.ROW > 1;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on duplicates  (cost=144114.66..176874.69 rows=336000 width=18) (actual time=1077.131..1916.132 rows=8001 loops=1)
   Filter: (duplicates."row" > 1)
   Rows Removed by Filter: 1000000
   ->  WindowAgg  (cost=144114.66..164274.68 rows=1008001 width=40) (actual time=1077.122..1871.481 rows=1008001 loops=1)
         ->  Sort  (cost=144114.66..146634.66 rows=1008001 width=32) (actual time=1077.116..1473.237 rows=1008001 loops=1)
               Sort Key: account.username, account.email
               Sort Method: external merge  Disk: 43224kB
               ->  Seq Scan on account  (cost=0.00..19483.01 rows=1008001 width=32) (actual time=0.021..105.845 rows=1008001 loops=1)
 Planning time: 0.141 ms
 Execution time: 1921.535 ms
(10 rows)

postgres=#