Author: Zara Lim (@jzaralim) | Release Target: 0.20.0; 7.0.0 | Status: Merged | Discussion: #7417
tl;dr: Add support for DATE and TIME types. This will give users a more complete set of types to represent time data with, and makes it easier to handle date/time data from other data sources.
The TIMESTAMP data type was introduced in ksqlDB 0.17. It was an improvement on time data representation in ksqlDB, and we would like to expand on this by introducing DATE and TIME. DATE and TIME are useful to have because:
- They store a different kind of time data. A TIMESTAMP stores some specific point in time, whereas DATE and TIME store a date on a calendar or a time on a clock. For example, if we wanted to represent a daily alarm time in ksqlDB, it would not make sense to store it as a TIMESTAMP.
- Clear representation of the underlying Kafka types. Currently, ksqlDB converts Date and Time data from Kafka to BIGINT. This is inconsistent with how ksqlDB converts Timestamp data, and is a messy user experience especially when moving data to/from different sources. For example, if a user wants to use ksqlDB to manipulate their Kafka Data and Time data using the built-in time functions and then write their results to an external database, they would first have to convert their BIGINT data to TIMESTAMP and then set up their Sink connector to convert their data back to Data/Time.
- Add TIME and DATE types to KSQL
- Support comparisons with the TIME and DATE types
- Support TIME and DATE usage in STRUCT, MAP and ARRAY
- Serialization and de-serialization of TIME and DATE to Avro, JSON, Protobuf and Delimited formats
- New UDFs to support the TIME and DATE types, as well as deprecating old DATE functions
- Casting between TIMESTAMP, STRING and DATE/TIME
- A DATETIME type - there is no underlying Kafka type for this
- Allowing TIME values to be defined with timezones (13:32:52+0100) - most other systems don't support this
The DATE data type will store a calendar date independent of time zone. The syntax is as follows:
CREATE STREAM stream_name (birthday DATE, COL2 STRING) AS ...
CREATE TABLE table_name (col1 STRUCT<field DATE>) AS ...
DATEs will be displayed in console as strings in the form yyyy-MM-dd
:
> SELECT date FROM stream_name EMIT CHANGES;
+------------------------+
|date |
+------------------------+
|1994-11-05 |
DATEs can be represented by date strings:
INSERT INTO stream_name VALUES ("1994-11-05");
The format yyyy-MM-dd
can also represent a TIMESTAMP, so if a DATE or TIMESTAMP value are both
acceptable in the context that a date string is used, then TIMESTAMP will take precedence.
For example, because the function FORMAT_DATE
can only accept DATE parameters, the date string will
be converted to a DATE.
SELECT DATE_ADD(DAYS, 2, '1994-11-05') FROM stream_name;
+------------------------+
|date |
+------------------------+
|1994-11-07 |
In the following example, the query makes sense as both a DATE and a TIMESTAMP comparison, so ksqlDB will read them as TIMESTAMP.
SELECT '1994-11-07' > '2020-03-31' FROM stream_name;
The TIME data type will store a time without timezone information. The syntax is as follows:
CREATE STREAM stream_name (alarm_time TIME, COL2 STRING) AS ...
CREATE TABLE table_name (col1 STRUCT<field TIME>) AS ...
TIMEs will be displayed in console as strings in the form hh:mm:ss.SSS
:
> SELECT time FROM stream_name EMIT CHANGES;
+------------------------+
|time |
+------------------------+
|00:40:53.222 |
TIMEs can be represented by time strings:
INSERT INTO stream_name VALUES ("00:40:53.222");
Time zones will not be recognized in time strings.
The following UDFs should be deprecated:
DATETOSTRING
STRINGTODATE
These functions will still be available so that existing queries using these functions won't break, but the documentation will state that they are deprecated and will direct users towards using the DATE type and the new functions.
The following functions will be added/updated:
FORMAT_DATE(format, date)
- converts a date to a string in the specified formatFORMAT_TIME(format, time)
- converts a time to a string in the specified formatPARSE_DATE(format, date_string)
- converts a date string in the specified format to a DATEPARSE_TIME(format, time_string)
- converts a time string in the specified format to a TIMEUNIX_DATE(date)
- returns an INTEGER number of days that have passed between Unix epoch and the specified dateFROM_DAYS(int)
- convert epoch days to a DATE valueDATEADD(time unit, integer, date)
- Adds an interval to the date. The time unit must beDAYS
orYEARS
. If it is not, then the function will throw an errorDATESUB(time unit, integer, date)
- Subtracts an interval from the date. The time unit must beDAYS
orYEARS
. If it is not, then the function will throw an errorTIMEADD(time unit, integer, time)
- Adds an interval to the time. If the result falls outside of the00:00:00.000
to23:59:59.999
range, then the result will be adjusted.TIMESUB(time unit, integer, time)
- Subtracts an interval from the time. If the result falls outside of the00:00:00.000
to23:59:59.999
range, then the result will be adjusted.
The DATE type will be handled by the java.sql.Date
class within KSQL and the TIME type will be
handled by java.sql.Time
. The corresponding Kafka Connect types are org.apache.kafka.connect.data.Date
and org.apache.kafka.connect.data.Time.
Both are represented as integers in Schema Registry with tags to identify their logical types.
Avro schemas represent dates as
{
"type": "int",
"logicalType": "date"
}
And time as
{
"type": "int",
"logicalType": "time-millis"
}
The Avro deserializer ksqlDB uses support these.
Dates will get stored in JSON and CSV files as an int number of days since Unix Epoch. Times will get stored as an int number of milliseconds since 00:00:00.000
The KSQL JSON and delimited deserializers will be updated to parse dates and times.
Protobuf has a google.type.TimeOfDay
type for time and a google.type.Date
type for dates.
The Protouf deserializer KSQL uses supports this.
Casting from TIMESTAMP to DATE discards the time portion of the TIMESTAMP:
-- ts is the timestamp, '2007-01-01T03:20:45'
SELECT CAST(ts AS DATE) FROM s;
> 2007-01-1
Casting from TIMESTAMP to DATE discards the date portion of the TIMESTAMP:
-- ts is the timestamp, '2007-01-01T03:20:45'
SELECT CAST(ts AS TIME) FROM s;
> 03:20:45
Casting from DATE to TIMESTAMP adds 00:00:00.000
as the time portion to the DATE:
-- d is the date, '2007-01-01'
SELECT CAST(d AS TIMESTAMP) FROM s;
> 2007-01-1T00:00:00
Casting from TIME to STRING will convert the TIME value to a STRING of the form hh:mm:ss
-- t is the time, '03:20:45'
SELECT CAST(t AS STRING) FROM s;
> 03:20:45
Casting from DATE to STRING will convert the DATE value to a STRING of the form YYYY-MM-DD
-- d is the date, '2007-01-01'
SELECT CAST(d AS STRING) FROM s;
> 2007-01-01
Casting from STRING to TIME will attempt to parse the string to a TIME value in hh:mm:ss[.sss]
format.
SELECT CAST('03:20:45' AS TIME) FROM s;
> 03:20:45
Casting from STRING to DATE will attempt to parse the string to a DATE value in YYYY-MM-DD
format.
SELECT CAST('2007-01-01' AS DATE) FROM s;
> 2007-01-01
The result of a DATE/DATE or TIME/TIME comparison will be the same as the comparison of the integer values of them (number of days since Unix epoch for DATE and number of milliseconds since the beginning of the day for TIME).
Any comparison with a TIMESTAMP will convert the other value to TIMESTAMP.
A STRING/DATE comparison will convert both the STRING and the DATE to TIMESTAMP.
A STRING/TIME comparison will convert the STRING to TIME if the format is hh:mm:ss[.sss]
, otherwise
it will convert the STRING to TIMEZONE.
Comparisons between DATE and TIME will not be allowed.
There will need to be tests for the following:
- Integration with Kafka Connect and Schema Registry
- All serialization formats
- QTTs with all of the new and updated UDFs
DATE and TIME are each a milestone. They can both be broken up as follows:
- Adding the types to the syntax - 2 days
- Serialization/deserialization - 3 days
- Documentation - 3 days
- Adding and updating UDFs + documentation - 1 week
- Add to Connect integration test - 1 day
DATE and TIME should be implemented together for the following componenets:
- Casting - 4 days
- Comparisons - 3 days
- The deprecation of DATETOSTRING and STRINGTODATE should be stated and users should be directed to the DATE type and the new set of UDFs.
- Add and update UDFs to
docs/developer-guide/ksqldb-reference/scalar-functions.md
- Serialization/deserialization information in
docs/reference/serialization.md
- Section on casting in
docs/developer-guide/ksqldb-reference/type-coercion.md
- Detailed description of all the time types in
docs/reference/sql/data-types.md
- New section in
docs/developer-guide/ksqldb-reference/operations.md
for comparisons
If a user issues a command that includes the DATE or TIME type, then previous versions of KSQL will not recognize the TIMESTAMP type, and the server will enter a DEGRADED state.
None