Skip to content

Commit

Permalink
[InLong-1012][Doc] Add InLong sort format usage and extend doc
Browse files Browse the repository at this point in the history
  • Loading branch information
baomingyu committed Sep 29, 2024
1 parent 675ad89 commit 95d3656
Show file tree
Hide file tree
Showing 7 changed files with 3,268 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: InLong sort format extend
sidebar_position: 5
---
## Overview

This article is aimed at InLong-Sort-Formats data format parsing developers and aims to comprehensively explain the process of developing data parsing for a data format.

The InLong-Sort-Formats module supports two major types of data format parsing, implemented based on the Flink Row and Flink RowData types. These two implementations differ only in the Flink API used. This article will describe the implementation based on the Flink RowData.

Currently, InLong-Sort supports the following formats, including 6 formats encapsulated in the InLongMsg format and 3 original data formats:
- InLongMsg binlog
- InLongMSg CSV
- InLongMsg KV
- InLongMsg Tlog-CSV
- InLongMsg Tlog-KV
- InLongMsg PB
- CSV
- KV
- JSON
By implementing the data parsing process for these formats, developers can effectively handle and process data in the InLong-Sort module.

## Before Development

- InLongMsg refer to [InLongMsg](img/inlong_msg.md);
- When selecting a specific parsing method, it is important to note that the original data, after passing through the DataProxy module, is encapsulated using the InLongMsg format;
- InLongMsg encapsulates at least one data record, so the parsing logic should handle scenarios with multiple records.
- Each InLongMsg consists of two parts: the message header and the message body:
+ The message header is composed of key-value pairs in the format of k1=v1&k2=v2. It should contain essential information such as the streamId (stream ID) and dt (data timestamp).
+ The message body is represented by a binary array of the specific data format to be parsed, such as key-value (kv), comma-separated values (csv), and so on.
When implementing the parsing process, you need to extract the message header and body separately. The header can be parsed to retrieve the necessary information, while the body should be processed based on the specific data format such as CSV, kv and so on.

By understanding the structure and components of the InLongMsg format, you can develop the appropriate parsing logic to handle multiple records and extract the necessary information from the message header and body.

## Processing
- Raw Format Data
+ Parse processing
Step 1: Build a DeserializationFormatFactory object for the specific format, for example, CsvFormatFactory.
Step 2: Use the DeserializationFormatFactory object to obtain the corresponding DecodingFormat object.
Step 3: Use the DecodingFormat object to obtain the DeserializationSchema object for the specific format.
Step 4: Use the DeserializationSchema object to call the deserialize(byte[] message) or deserialize(byte[] message, Collector<T> out) method, passing the data to be parsed and retrieving the parsed result.

+ extend processing
To extend the parsing of raw data formats that are not encapsulated in the InLongMsg format, you would need to implement the three interfaces shown in Figure 1. The arrows in the diagram represent the calling relationships between the implementations.
![The extension of parsing raw data formats not encapsulated in the InLongMsg](img/sort_data_raw_format_extend.png) <center> Figure 1 The extension of parsing raw data formats not encapsulated in the InLongMsg</center>

- Data formatted using InLongMsg
+ Parse processing
Step 1: Build a specific format DeserializationFormatFactory, such as InLongMsgCsvFormatFactory. This factory class is responsible for creating parsers for the desired format.
Step 2: Using the DeserializationFormatFactory object, obtain the corresponding DecodingFormat object, which is a subclass of AbstractInLongMsgDecodingFormat. This object is used to decode the InLongMsg formatted data.
Step 3: Using the DecodingFormat object, obtain the DeserializationSchema object corresponding to the specific format. This object is a subclass of AbstractInLongMsgDeserializationSchema. It defines the parsing rules and how the parsed results are returned.
Step 4: Using the DeserializationSchema object, call the deserialize(byte[] message) or deserialize(byte[] message, Collector<T> out) methods. Pass the data to be parsed as input and retrieve the parsed results. These methods will parse the data according to the defined rules and return the parsed results.
By following these steps, you can parse data in the specific format of InLongMsg and obtain the parsed results. The actual implementation may involve specific classes and methods based on your requirements and the specific parsing format.

+ extend processing
To extend the parsing of data formats encapsulated in the InLongMsg format, you need to implement one interface and inherit three base classes as shown in Figure 2. The arrows in the diagram represent the calling relationships between the implementations. ![To extend the parsing of data formats encapsulated in the InLongMsg format](img/sort_data_inlongmsg_format_extend.png) <center>Figure 2 To extend the parsing of data formats encapsulated in the InLongMsg format</center>
Compareing with the parsing process shown in Figure 1, the main difference is that the DecodingFormat and DeserializationSchema obtained in the Figure 2 are subclasses of AbstractInLongMsgDecodingFormat and AbstractInLongMsgDeserializationSchema.
For the implementation of the subclass of AbstractInLongMsgDeserializationSchema, it is recommended to implement a subclass of AbstractInLongMsgFormatDeserializer and invoke it.

## Demo

- For Raw Format Data
refer to:[format-rowdata-kv](https://github.com/apache/inlong/tree/master/inlong-sort/sort-formats/format-rowdata/format-rowdata-kv)
- For InLongMsg Format Data
refer to:[format-inlongmsg-rowdata-kv](https://github.com/apache/inlong/tree/master/inlong-sort/sort-formats/format-rowdata/format-inlongmsg-rowdata-kv)

## Last but not Least

In Inlong-Sort-formats, we provide implementations for common data formats to achieve out-of-the-box usability. We have also designed a unified data parsing and processing framework that allows developers to customize their own data parsing and transformation based on the characteristics of the data format they are working with.

However, it's important to note that our current implementation and architectural design are primarily focused on meeting the current parsing and processing requirements, and we currently only support a few common data formats. We are committed to continuously expanding the range of supported data formats, improving parsing efficiency, and making overall enhancements. We also welcome your contributions and involvement in this endeavor.

Thank you for your feedback and support! If you have any questions or suggestions, please feel free to reach out to us.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: InLong 分拣数据组织及协议解析
sidebar_position: 5
---
## 总览

本文面向 InLong-Sort-Formats 数据分拣开发人员, 尝试尽可能全面地阐述开发一个数据格式的数据解析过程。
InLong-Sort-Formats 模块支持两大类的数据格式解析,分别基于 Flink Row 和 Flink RowData 类型实现,这两类实现仅仅是,使用的 Flink API 不同,本文基于 Flink RowData 方式的实现进行描述。
目前,InLong-Sort 支持如下几种格式(通过 InLongMsg 格式封装的 6 种,原始的数据格式 3 种):
- InLongMsg binlog
- InLongMSg CSV
- InLongMsg KV
- InLongMsg Tlog-CSV
- InLongMsg Tlog-KV
- InLongMsg PB
- CSV
- KV
- JSON

## 开发之前

- InLongMsg 格式介绍参照 [InLongMsg](img/inlong_msg.md)
- 原始数据经过 DataProxy 模块的数据均会使用 InLongMsg 格式进行封装,在选用具体的解析方式时,需要注意。
- InLongMsg 中会包含至少一条数据,解析的时候需要处理包含多条的场景;
- InLongMsg 每条消息由两部分组成,消息头和消息体,其中:
+ 消息头:是由 k1=v1&k2=v2 这种格式的 kv 对组成,至少包含 streamId (流 Id)、dt(数据时间戳)等基本信息。
+ 消息体:由具体要解析的数据格式的二进制数组表示,如 kv、csv 等格式。

## 流程图示
- 原始格式数据
+ 解析流程
第一步:构建具体格式的 DeserializationFormatFactory 对象,例如:CsvFormatFactory;
第二步:通过 DeserializationFormatFactory 对象,获取对应格式的 DecodingFormat 对象;
第三步:通过 DecodingFormat 对象获取 具体格式对应的 DeserializationSchema 对象;
第四步:通过 DeserializationSchema,调用 deserialize(byte[] message) 或 deserialize(byte[] message, Collector<T> out) 方法,传递要解析的数据及获取数据解析后的结果。

+ 扩展流程
如图1 所示,扩展非 InLongMsg 格式封装的原始数据格式的解析,需要实现图1 中的三个接口, 图中箭头表示实现间的调用关系。 ![非 InLongMsg 格式封装的原始数据格式解析扩展](img/sort_data_raw_format_extend.png) <center>图 1 非 InLongMsg 格式封装的原始数据格式解析扩展</center>

- 经过 InLong Msg 封装的格式数据
+ 解析流程
第一步:构建具体格式的 DeserializationFormatFactory InLongMsgCsvFormatFactory
第二步:通过 DeserializationFormatFactory 对象,获取对应格式的 DecodingFormat 对象 (AbstractInLongMsgDecodingFormat 类的子类);
第三步:通过 DecodingFormat 对象获取 具体格式对应的 DeserializationSchema 对象 (AbstractInLongMsgDeserializationSchema 类的子类);
第四步:通过 DeserializationSchema,调用 deserialize(byte[] message) 或 deserialize(byte[] message, Collector<T> out) 方法,传递要解析的数据及获取数据解析后的结果。

+ 扩展流程
如图2 所示,扩展 InLongMsg 格式封装的数据格式的解析,需要实现图 2 中的1个接口和继承实现3个基类,图中箭头表示实现间的调用关系: ![InLongMsg 格式封装的数据格式解析扩展](img/sort_data_inlongmsg_format_extend.png) <center>图 2 InLongMsg 格式封装的数据格式解析扩展</center>
扩展流程与图1 中所示流程最大的区别是,中间的获取的 DecodingFormat、DeserializationSchema 分别是 AbstractInLongMsgDecodingFormat 与 AbstractInLongMsgDeserializationSchema 类的子类。
其中,AbstractInLongMsgDeserializationSchema 子类的实现,建议通过实现 AbstractInLongMsgFormatDeserializer 基类的子类, 并调用实现。

## 参考 Demo

- 原始格式数据解析扩展
参考:[format-rowdata-kv](https://github.com/apache/inlong/tree/master/inlong-sort/sort-formats/format-rowdata/format-rowdata-kv)
- InLongMsg 封装后的格式数据解析扩展
参考:[format-inlongmsg-rowdata-kv](https://github.com/apache/inlong/tree/master/inlong-sort/sort-formats/format-rowdata/format-inlongmsg-rowdata-kv)

## 写在最后

我们在 Inlong-Sort-formats 中提供常见的多种数据格式的解析处理实现,以便达到开箱即用的目的,同时设计了统一的数据解析处理框架,方便开发人员基于接入的数据格式特点,自定义实现数据解析、转换等方式。诚然,当前实现方式、架构设计仅仅是为了满足当前的解析处理需求和扩展需求, 当前也仅仅支持了几种常见的数据格式的解析,我们会持续致力于丰富更多数据格式的解析和解析效率的提升、改进,也欢迎您的加入。
Loading

0 comments on commit 95d3656

Please sign in to comment.