Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide partial support for string "format" constraint #16

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

stefanov-sm
Copy link

@stefanov-sm stefanov-sm commented Feb 25, 2024

Provide partial support for string "format" constraint

Formats that correspond to native PostgreSQL data types are implemented. These are

date-time, date, time, duration, uuid, ipv4, ipv6, regex

Email format validation by a regex suggested by pbaumard.

Consistent with current behaviour unsupported options validate positive.

@stefanov-sm
Copy link
Author

"email" format validation may be added with one extra line in the case list that uses the popular regular expression:

 WHEN 'email' THEN IF target !~* '^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$' THEN RAISE; END IF;

It would not be as strict as the rest though. The regular expression might be brushed up too.

@kwakwaversal
Copy link
Contributor

Nice work. I've thought about adding this myself and can see it might be useful (especially for uuid).

Have you performance tested this though? Something in the back of my head thinks that adding exception handling adds a bit of overhead.

@stefanov-sm
Copy link
Author

stefanov-sm commented Mar 5, 2024

Fair enough, exception handling does add some overhead. Yet given the relative complexity of regular expressions and ISO8601 validation (620+ timezone names and abbreviations) I think that this is a price worth paying. The performance loss is not that significant. Benchmark on a very modest laptop - 100K {"type":"number"} validate in 11.4 s, 100K {"type":"string", "format":"date-time"} validate in 12.9 s.

@kwakwaversal
Copy link
Contributor

Thanks for checking. What are the with exception handling and without exception handling benchmark times?

It's worth mentioning because people might validate large amounts of data (if triggers run on updates for example) which might have unintended consequences based on the current version.

@pbaumard
Copy link

pbaumard commented Mar 5, 2024

"email" format validation may be added with one extra line in the case list that uses the popular regular expression:

 WHEN 'email' THEN IF target !~* '^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$' THEN RAISE; END IF;

It would not be as strict as the rest though. The regular expression might be brushed up too.

Using the regexp in https://dba.stackexchange.com/a/165923 following HTML5 email spec would surely be better:

WHEN 'email' THEN IF target !~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$' THEN RAISE; END IF;

it does not follow RFC 5321, section 4.1.2 specified in json-schema format, but as written in HTML5 email spec:

This requirement is a willful violation of RFC 5322, which defines a syntax for email addresses that is simultaneously too strict (before the "@" character), too vague (after the "@" character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.

@stefanov-sm
Copy link
Author

Validating 100K {"type":"number"} with commented "format" section (i.e. the original function with no exception handling code) took the same 11.3 s. There is no performance change unless {"type":"string", "format":"date-time"} is hit and the exception machinery gets invoked.

Uses the HTML5-style regex suggested by [Pierre Baumard](https://github.com/pbaumard)
@stefanov-sm
Copy link
Author

"email" format validation may be added with one extra line in the case list that uses the popular regular expression:

 WHEN 'email' THEN IF target !~* '^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$' THEN RAISE; END IF;

It would not be as strict as the rest though. The regular expression might be brushed up too.

Using the regexp in https://dba.stackexchange.com/a/165923 following HTML5 email spec would surely be better:

WHEN 'email' THEN IF target !~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$' THEN RAISE; END IF;

it does not follow RFC 5321, section 4.1.2 specified in json-schema format, but as written in HTML5 email spec:

This requirement is a willful violation of RFC 5322, which defines a syntax for email addresses that is simultaneously too strict (before the "@" character), too vague (after the "@" character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.

Nice & clean. I have added it to the PR, quoting your comment/suggestion.

IF schema ? 'format' AND jsonb_typeof(data) = 'string' THEN
DECLARE
target text := (data #>> '{}');
EMAIL_RX constant text := '^[\w.!#$%&''*+/=?^`{|}~-]+@[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?(?:\.[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?)*$';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case sensitive check is faster, so why not use the full case sensitive regex here and target !~ EMAIL_RX below?

Suggested change
EMAIL_RX constant text := '^[\w.!#$%&''*+/=?^`{|}~-]+@[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?(?:\.[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?)*$';
EMAIL_RX constant text := '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$';

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. Longer and maybe more difficult to read but 10% are always 10%.

WHEN 'ipv6' THEN PERFORM target::inet; IF target NOT LIKE '%:%' THEN RAISE; END IF;
WHEN 'ipv4' THEN PERFORM target::inet; IF target LIKE '%:%' THEN RAISE; END IF;
WHEN 'regex' THEN PERFORM '' ~ target;
WHEN 'email' THEN IF target !~* EMAIL_RX THEN RAISE; END IF;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See regex comment above, a case sensitive check would be faster:

Suggested change
WHEN 'email' THEN IF target !~* EMAIL_RX THEN RAISE; END IF;
WHEN 'email' THEN IF target !~ EMAIL_RX THEN RAISE; END IF;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants