Skip to content

Commit

Permalink
Merge pull request #19 from Cysharp/utf8
Browse files Browse the repository at this point in the history
UTF8 String serialization
  • Loading branch information
neuecc authored Oct 3, 2022
2 parents 5c28dee + c494495 commit af6c443
Show file tree
Hide file tree
Showing 20 changed files with 490 additions and 82 deletions.
34 changes: 26 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,9 +336,9 @@ Serialize has three overloads.

```csharp
// Non generic API also available, these version is first argument is Type and value is object?
byte[] Serialize<T>(in T? value)
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value)
async ValueTask SerializeAsync<T>(Stream stream, T? value, CancellationToken cancellationToken = default)
byte[] Serialize<T>(in T? value, MemoryPackSerializeOptions? options = default)
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value, MemoryPackSerializeOptions? options = default)
async ValueTask SerializeAsync<T>(Stream stream, T? value, MemoryPackSerializeOptions? options = default, CancellationToken cancellationToken = default)
```

The recommended way to do this in Performance is to use `BufferWriter`. This serializes directly into the buffer. It can be applied to `PipeWriter` in `System.IO.Pipelines`, `BodyWriter` in ASP .NET Core, etc.
Expand All @@ -349,6 +349,16 @@ Note that `SerializeAsync` for `Stream` is asynchronous only for Flush; it seria

If you want to do complete streaming write, see [Streaming Serialization](#streaming-serialization) section.

### MemoryPackSerializeOptions

`MemoryPackSerializeOptions` configures how serialize string as Utf16 or Utf8. If passing null then uses `MemoryPackSerializeOptions.Default`, it is same as `MemoryPackSerializeOptions.Utf8`, in other words, serialize the string as Utf8. If you want to serialize with Utf16, you can use `MemoryPackSerializeOptions.Utf16`.

Since C#'s internal string representation is UTF16, UTF16 performs better. However, the payload tends to be larger; in UTF8, an ASCII string is one byte, while in UTF16 it is two bytes. Because the difference in size of this payload is so large, UTF8 is set by default.

If the data is non-ASCII (e.g. Japanese, which can be more than 3 bytes, and UTF8 is larger), or if you have to compress it separately, UTF16 may give better results.

Whether UTF8 or UTF16 is selected during serialization, it is not necessary to specify it during deserialization. It will be automatically detected and deserialized normally.

Deserialize API
---
Deserialize has `ReadOnlySpan<byte>` and `ReadOnlySequence<byte>`, `Stream` overload and `ref` support.
Expand Down Expand Up @@ -473,10 +483,10 @@ Payload size depends on the target value; unlike JSON, there are no keys and it

For those with varint encoding, such as MessagePack and Protobuf, MemoryPack tends to be larger if ints are used a lot (in MemoryPack, ints are always 4 bytes due to fixed size encoding, while MsgPack is 1~5 bytes).

Also, strings are usually UTF8 for other formats, but MemoryPack is UTF16 fixed length (2 bytes), so MemoryPack is larger if the string occupies ASCII. Conversely, MemoryPack may be smaller if the string contains many UTF8 characters of 3 bytes or more, such as Japanese.

float and double are 4 bytes and 8 bytes in MemoryPack, but 5 bytes and 9 bytes in MsgPack. So MemoryPack is smaller, for example, for Vector3 (float, float, float) arrays.

String is UTF8 by default, which is similar to other serializers, but if the UTF16 option is chosen, it will be of a different nature.

In any case, if the payload size is large, compression should be considered. LZ4, ZStandard and Brotli are recommended. An efficient way to combine compression and serialization will be presented at a later date.

Packages
Expand Down Expand Up @@ -548,15 +558,16 @@ If you request it, there is a possibility to make a detuned Unity version. Pleas

Binary wire format specification
---
The type of `T` defined in `Serialize<T>` and `Deserialize<T>` is called C# schema. MemoryPack format is not self described format. Deserialize requires the corresponding C# schema. Four types exist as internal representations of binaries, but types cannot be determined without a C# schema.
The type of `T` defined in `Serialize<T>` and `Deserialize<T>` is called C# schema. MemoryPack format is not self described format. Deserialize requires the corresponding C# schema. Five types exist as internal representations of binaries, but types cannot be determined without a C# schema.

There are no endian specifications. It is not possible to convert on machines with different endianness. However modern computers are usually little-endian.

There are four value types of format.
There are five value types of format.

* Unmanaged struct
* Object
* Collection
* String
* Union

### Unmanaged struct
Expand All @@ -574,7 +585,14 @@ Object has 1byte unsigned byte as member count in header. Member count allows `0

`[int length, values...]`

Collection has 4byte signed interger as data count in header, `-1` represents `null`. Values store memorypack value for the number of length. String is collection(serialize as `ReadOnlySpan<char>`, in other words, UTF16).
Collection has 4byte signed interger as data count in header, `-1` represents `null`. Values store memorypack value for the number of length.

### String

`(int utf16-length, utf16-value)`
`(int ~utf8-length, int utf16-length, utf8-value)`

String has two-form, UTF16 and UTF8. If first 4byte signed integer is `-1`, represents null. `0`, represents empty. UTF16 is same as collection(serialize as `ReadOnlySpan<char>`, utf16-value's byte count is utf16-length * 2). If first signed integer <= `-2`, value is encoded by UTF8. utf8-length is encoded in complement, `~utf8-length` to retrieve length. Next signed integer is utf16-length, it allows `-1` that represents unknown length. utf8-value store byte value for the number of utf8-length.

### Union

Expand Down
12 changes: 6 additions & 6 deletions sandbox/Benchmark/Benchmarks/DeserializeTest.cs
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@

namespace Benchmark.Benchmarks;

[GenericTypeArguments(typeof(int))]
[GenericTypeArguments(typeof(Vector3[]))]
[GenericTypeArguments(typeof(JsonResponseModel))]
[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
//[GenericTypeArguments(typeof(int))]
//[GenericTypeArguments(typeof(Vector3[]))]
//[GenericTypeArguments(typeof(JsonResponseModel))]
//[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
public class DeserializeTest<T> : SerializerTestBase<T>
{
//SerializerSessionPool pool;
Expand Down Expand Up @@ -51,13 +51,13 @@ public DeserializeTest()
payloadJson = JsonSerializer.SerializeToUtf8Bytes(value);
}

[Benchmark(Baseline = true)]
[Benchmark]
public T MessagePackDeserialize()
{
return MessagePackSerializer.Deserialize<T>(payloadMessagePack);
}

[Benchmark]
[Benchmark(Baseline = true)]
public T? MemoryPackDeserialize()
{
return MemoryPackSerializer.Deserialize<T>(payloadMemoryPack);
Expand Down
23 changes: 18 additions & 5 deletions sandbox/Benchmark/Benchmarks/SerializeTest.cs
Original file line number Diff line number Diff line change
Expand Up @@ -70,16 +70,22 @@ public SerializeTest()
jsonWriter = new Utf8JsonWriter(writer);
}

[Benchmark(Baseline = true), BenchmarkCategory(Categories.Bytes)]
[Benchmark, BenchmarkCategory(Categories.Bytes)]
public byte[] MessagePackSerialize()
{
return MessagePackSerializer.Serialize(value);
}

[Benchmark, BenchmarkCategory(Categories.Bytes)]
[Benchmark(Baseline = true), BenchmarkCategory(Categories.Bytes)]
public byte[] MemoryPackSerialize()
{
return MemoryPackSerializer.Serialize(value);
return MemoryPackSerializer.Serialize(value, MemoryPackSerializeOptions.Default);
}

[Benchmark, BenchmarkCategory(Categories.Bytes)]
public byte[] MemoryPackSerializeUtf16()
{
return MemoryPackSerializer.Serialize(value, MemoryPackSerializeOptions.Utf16);
}

// requires T:new(), can't test it.
Expand Down Expand Up @@ -113,20 +119,27 @@ public byte[] SystemTextJsonSerialize()
// return orleansSerializer.SerializeToArray(value);
//}

[Benchmark(Baseline = true), BenchmarkCategory(Categories.BufferWriter)]
[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
public void MessagePackBufferWriter()
{
MessagePackSerializer.Serialize(writer, value);
writer.Clear();
}

[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
[Benchmark(Baseline = true), BenchmarkCategory(Categories.BufferWriter)]
public void MemoryPackBufferWriter()
{
MemoryPackSerializer.Serialize(writer, value);
writer.Clear();
}

[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
public void MemoryPackBufferWriterUtf16()
{
MemoryPackSerializer.Serialize(writer, value, MemoryPackSerializeOptions.Utf16);
writer.Clear();
}

//[Benchmark]
//public void BinaryPackStream()
//{
Expand Down
107 changes: 107 additions & 0 deletions sandbox/Benchmark/Benchmarks/Utf16VsUtf8.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
using Benchmark.BenchmarkNetUtilities;
using BinaryPack.Models.Helpers;
using MemoryPack;
using System.Net.Http;

namespace Benchmark.Benchmarks;

[PayloadColumn]
public class Utf16VsUtf8
{
readonly string ascii;
readonly string japanese;
readonly string largeAscii;

readonly byte[] utf16Jpn;
readonly byte[] utf8Jpn;
readonly byte[] utf16Ascii;
readonly byte[] utf8Ascii;
readonly byte[] utf16LargeAscii;
readonly byte[] utf8LargeAscii;

public Utf16VsUtf8()
{
this.japanese = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん";
this.ascii = "abcedfghijklmnopqrstuvwxyz0123456789";
this.utf16Jpn = MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf16);
this.utf8Jpn = MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf8);
this.utf16Ascii = MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf16);
this.utf8Ascii = MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf8);

this.largeAscii = RandomProvider.NextString(600);
this.utf16LargeAscii = MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf16);
this.utf8LargeAscii = MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf8);
}

[Benchmark]
public byte[] SerializeUtf16Ascii()
{
return MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf16);
}

[Benchmark]
public byte[] SerializeUtf16Japanese()
{
return MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf16);
}

[Benchmark]
public byte[] SerializeUtf8Ascii()
{
return MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf8);
}

[Benchmark]
public byte[] SerializeUtf8Japanese()
{
return MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf8);
}

[Benchmark]
public byte[] SerializeUtf16LargeAscii()
{
return MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf16);
}

[Benchmark]
public byte[] SerializeUtf8LargeAscii()
{
return MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf8);
}

[Benchmark]
public void DeserializeUtf16Ascii()
{
MemoryPackSerializer.Deserialize<string>(utf16Ascii);
}

[Benchmark]
public void DeserializeUtf16Japanese()
{
MemoryPackSerializer.Deserialize<string>(utf16Jpn);
}

[Benchmark]
public void DeserializeUtf8Ascii()
{
MemoryPackSerializer.Deserialize<string>(utf8Ascii);
}

[Benchmark]
public void DeserializeUtf8Japanese()
{
MemoryPackSerializer.Deserialize<string>(utf8Jpn);
}

[Benchmark]
public void DeserializeUtf16LargeAscii()
{
MemoryPackSerializer.Deserialize<string>(utf16LargeAscii);
}

[Benchmark]
public void DeserializeUtf8LargeAscii()
{
MemoryPackSerializer.Deserialize<string>(utf8LargeAscii);
}
}
4 changes: 2 additions & 2 deletions sandbox/Benchmark/Micro/GetLocalVsStaticField.cs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ public GetLocalVsStaticField()
[Benchmark(Baseline = true)]
public void GetFromProvider()
{
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter);
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter, MemoryPackSerializeOptions.Default);
for (int i = 0; i < 100; i++)
{
writer.GetFormatter<int>().Serialize(ref writer, ref i);
Expand All @@ -35,7 +35,7 @@ public void GetFromProvider()
[Benchmark]
public void GetFromLocal()
{
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter);
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter, MemoryPackSerializeOptions.Default);
var provider = writer.GetFormatter<int>();
for (int i = 0; i < 100; i++)
{
Expand Down
10 changes: 5 additions & 5 deletions sandbox/Benchmark/Micro/RawSerialize.cs
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ public byte[] HandMemoryPackWriterEmpty()
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
}

var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
try
{
if (value == null)
Expand Down Expand Up @@ -106,7 +106,7 @@ public byte[] HandMemoryPackWriterHeaderOnly()
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
}

var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
try
{
if (value == null)
Expand Down Expand Up @@ -140,7 +140,7 @@ public byte[] HandMemoryPackWriterHeaderInt3()
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
}

var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
try
{
if (value == null)
Expand Down Expand Up @@ -174,7 +174,7 @@ public byte[] HandMemoryPackWriterHeaderInt3String1()
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
}

var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
try
{
if (value == null)
Expand Down Expand Up @@ -208,7 +208,7 @@ public byte[] HandMemoryPackFull()
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
}

var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
try
{
if (value == null)
Expand Down
39 changes: 39 additions & 0 deletions sandbox/Benchmark/Micro/Utf8Decoding.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.Unicode;
using System.Threading.Tasks;

namespace Benchmark.Micro;

public class Utf8Decoding
{
byte[] utf8bytes;
int utf8length;
int utf16length;

public Utf8Decoding()
{
// Japanese Hiragana
var text = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん";
utf8bytes = Encoding.UTF8.GetBytes(text);
utf8length = utf8bytes.Length;
utf16length = text.Length;
}

[Benchmark]
public string UTF8GetString()
{
return Encoding.UTF8.GetString(utf8bytes);
}

[Benchmark]
public string Utf16LengthUtf8ToUtf16()
{
return string.Create(utf16length, utf8bytes, static (dest, source) =>
{
Utf8.ToUtf16(source, dest, out var read, out var written);
});
}
}
Loading

0 comments on commit af6c443

Please sign in to comment.