You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 2, 2021. It is now read-only.
Rows are considered the same if date, page_path, domain and project_id are the same.
Some SQL queries to get row counts:
Get the total row count:
-- Count total rowSELECTcount(*)
FROM websiteanalytics;
Get the number of duplicate rows (we want to get rid of these):
-- This is the number of duplicate rows in the table
WITH dups AS
(SELECTdate, page_path, domain, project_id, count(*)-1AS dup_count
FROM websiteanalytics
GROUP BYdate, page_path, domain, project_id
HAVINGcount(*) >1)
SELECTSUM(dup_count) FROM dups;
Get the number of distinct rows without duplications (we want to keep these):
-- Count distinct rows (unique, for date, page_path, domain, project_id, without duplicates)
WITH unique_rows AS
(SELECT DISTINCTON (date, page_path, domain, project_id) *FROM websiteanalytics)
SELECTcount(*) FROM unique_rows;
Row counts before deletion
Total
Dups
Distinct
1599864
56099
1543765
Now we can run this query to delete duplicate rows:
DELETEFROM websiteanalytics a USING (
SELECTMIN(ctid) as ctid, date, page_path, domain, project_id
FROM websiteanalytics
GROUP BYdate, page_path, domain, project_id HAVINGCOUNT(*) >1
) b
WHEREa.date=b.dateANDa.page_path=b.page_pathANDa.domain=b.domainANDa.project_id=b.project_idANDa.ctid<>b.ctid;
Row counts after deletion:
Total
Dups
Distinct
1543765
null
1543765
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When the website_analytics pipeline runs, it fails at the
dump.to_sql
step due to an IntegrityError:IntegrityError: duplicate key value violates unique constraint "websiteanalytics_pkey"
DETAIL: Key (id)=(ca184c48-4b2b-4d69-8c26-10aa3eb28d95) already exists.
The websiteanalytics table either contains duplicate rows or can't update the
latest
row (last row added from previous successful pipeline run).These duplicate rows can be discovered with a query like:
The text was updated successfully, but these errors were encountered: