Skip to content
This repository has been archived by the owner on Dec 2, 2021. It is now read-only.

measure.websiteananalytics table postgres IntegrityError #58

Open
brew opened this issue Jun 28, 2018 · 1 comment
Open

measure.websiteananalytics table postgres IntegrityError #58

brew opened this issue Jun 28, 2018 · 1 comment

Comments

@brew
Copy link
Collaborator

brew commented Jun 28, 2018

When the website_analytics pipeline runs, it fails at the dump.to_sql step due to an IntegrityError:

IntegrityError: duplicate key value violates unique constraint "websiteanalytics_pkey"
DETAIL: Key (id)=(ca184c48-4b2b-4d69-8c26-10aa3eb28d95) already exists.

The websiteanalytics table either contains duplicate rows or can't update the latest row (last row added from previous successful pipeline run).

These duplicate rows can be discovered with a query like:

SELECT date, page_path, domain, project_id, count(*) 
FROM websiteanalytics 
WHERE domain='frictionlessdata.io' and project_id='frictionlessdata'
GROUP BY date, page_path, domain, project_id 
HAVING count(*) > 1;
@brew
Copy link
Collaborator Author

brew commented Jul 2, 2018

Rows are considered the same if date, page_path, domain and project_id are the same.

Some SQL queries to get row counts:

  1. Get the total row count:
-- Count total row
SELECT count(*)
FROM websiteanalytics;
  1. Get the number of duplicate rows (we want to get rid of these):
-- This is the number of duplicate rows in the table
WITH dups AS
    (SELECT date, page_path, domain, project_id, count(*)-1 AS dup_count
    FROM websiteanalytics
    GROUP BY date, page_path, domain, project_id
    HAVING count(*) > 1)
SELECT SUM(dup_count) FROM dups;
  1. Get the number of distinct rows without duplications (we want to keep these):
-- Count distinct rows (unique, for date, page_path, domain, project_id, without duplicates)
WITH unique_rows AS
    (SELECT DISTINCT ON (date, page_path, domain, project_id) * 
    FROM websiteanalytics)
SELECT count(*) FROM unique_rows;

Row counts before deletion

Total Dups Distinct
1599864 56099 1543765

Now we can run this query to delete duplicate rows:

DELETE FROM websiteanalytics a USING (
      SELECT MIN(ctid) as ctid, date, page_path, domain, project_id
        FROM websiteanalytics
        GROUP BY date, page_path, domain, project_id HAVING COUNT(*) > 1
      ) b
WHERE a.date = b.date
AND a.page_path = b.page_path
AND a.domain = b.domain
AND a.project_id = b.project_id
AND a.ctid <> b.ctid;

Row counts after deletion:

Total Dups Distinct
1543765 null 1543765

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant