Skip to content

Commit

Permalink
UTF8 as default
Browse files Browse the repository at this point in the history
  • Loading branch information
neuecc committed Oct 3, 2022
1 parent 22cac66 commit c494495
Show file tree
Hide file tree
Showing 11 changed files with 154 additions and 43 deletions.
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,9 +336,9 @@ Serialize has three overloads.

```csharp
// Non generic API also available, these version is first argument is Type and value is object?
byte[] Serialize<T>(in T? value)
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value)
async ValueTask SerializeAsync<T>(Stream stream, T? value, CancellationToken cancellationToken = default)
byte[] Serialize<T>(in T? value, MemoryPackSerializeOptions? options = default)
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value, MemoryPackSerializeOptions? options = default)
async ValueTask SerializeAsync<T>(Stream stream, T? value, MemoryPackSerializeOptions? options = default, CancellationToken cancellationToken = default)
```

The recommended way to do this in Performance is to use `BufferWriter`. This serializes directly into the buffer. It can be applied to `PipeWriter` in `System.IO.Pipelines`, `BodyWriter` in ASP .NET Core, etc.
Expand All @@ -349,6 +349,16 @@ Note that `SerializeAsync` for `Stream` is asynchronous only for Flush; it seria

If you want to do complete streaming write, see [Streaming Serialization](#streaming-serialization) section.

### MemoryPackSerializeOptions

`MemoryPackSerializeOptions` configures how serialize string as Utf16 or Utf8. If passing null then uses `MemoryPackSerializeOptions.Default`, it is same as `MemoryPackSerializeOptions.Utf8`, in other words, serialize the string as Utf8. If you want to serialize with Utf16, you can use `MemoryPackSerializeOptions.Utf16`.

Since C#'s internal string representation is UTF16, UTF16 performs better. However, the payload tends to be larger; in UTF8, an ASCII string is one byte, while in UTF16 it is two bytes. Because the difference in size of this payload is so large, UTF8 is set by default.

If the data is non-ASCII (e.g. Japanese, which can be more than 3 bytes, and UTF8 is larger), or if you have to compress it separately, UTF16 may give better results.

Whether UTF8 or UTF16 is selected during serialization, it is not necessary to specify it during deserialization. It will be automatically detected and deserialized normally.

Deserialize API
---
Deserialize has `ReadOnlySpan<byte>` and `ReadOnlySequence<byte>`, `Stream` overload and `ref` support.
Expand Down Expand Up @@ -473,10 +483,10 @@ Payload size depends on the target value; unlike JSON, there are no keys and it

For those with varint encoding, such as MessagePack and Protobuf, MemoryPack tends to be larger if ints are used a lot (in MemoryPack, ints are always 4 bytes due to fixed size encoding, while MsgPack is 1~5 bytes).

Also, strings are usually UTF8 for other formats, but MemoryPack is UTF16 fixed length (2 bytes), so MemoryPack is larger if the string occupies ASCII. Conversely, MemoryPack may be smaller if the string contains many UTF8 characters of 3 bytes or more, such as Japanese.

float and double are 4 bytes and 8 bytes in MemoryPack, but 5 bytes and 9 bytes in MsgPack. So MemoryPack is smaller, for example, for Vector3 (float, float, float) arrays.

String is UTF8 by default, which is similar to other serializers, but if the UTF16 option is chosen, it will be of a different nature.

In any case, if the payload size is large, compression should be considered. LZ4, ZStandard and Brotli are recommended. An efficient way to combine compression and serialization will be presented at a later date.

Packages
Expand Down
12 changes: 6 additions & 6 deletions sandbox/Benchmark/Benchmarks/DeserializeTest.cs
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@

namespace Benchmark.Benchmarks;

[GenericTypeArguments(typeof(int))]
[GenericTypeArguments(typeof(Vector3[]))]
[GenericTypeArguments(typeof(JsonResponseModel))]
[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
//[GenericTypeArguments(typeof(int))]
//[GenericTypeArguments(typeof(Vector3[]))]
//[GenericTypeArguments(typeof(JsonResponseModel))]
//[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
public class DeserializeTest<T> : SerializerTestBase<T>
{
//SerializerSessionPool pool;
Expand Down Expand Up @@ -51,13 +51,13 @@ public DeserializeTest()
payloadJson = JsonSerializer.SerializeToUtf8Bytes(value);
}

[Benchmark(Baseline = true)]
[Benchmark]
public T MessagePackDeserialize()
{
return MessagePackSerializer.Deserialize<T>(payloadMessagePack);
}

[Benchmark]
[Benchmark(Baseline = true)]
public T? MemoryPackDeserialize()
{
return MemoryPackSerializer.Deserialize<T>(payloadMemoryPack);
Expand Down
29 changes: 21 additions & 8 deletions sandbox/Benchmark/Benchmarks/SerializeTest.cs
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ namespace Benchmark.Benchmarks;
//[GenericTypeArguments(typeof(MyClass))]


//[GenericTypeArguments(typeof(int))]
//[GenericTypeArguments(typeof(Vector3[]))]
//[GenericTypeArguments(typeof(JsonResponseModel))]
//[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
[GenericTypeArguments(typeof(int))]
[GenericTypeArguments(typeof(Vector3[]))]
[GenericTypeArguments(typeof(JsonResponseModel))]
[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
[CategoriesColumn]
[PayloadColumn]
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
Expand Down Expand Up @@ -70,18 +70,24 @@ public SerializeTest()
jsonWriter = new Utf8JsonWriter(writer);
}

[Benchmark(Baseline = true), BenchmarkCategory(Categories.Bytes)]
[Benchmark, BenchmarkCategory(Categories.Bytes)]
public byte[] MessagePackSerialize()
{
return MessagePackSerializer.Serialize(value);
}

[Benchmark, BenchmarkCategory(Categories.Bytes)]
[Benchmark(Baseline = true), BenchmarkCategory(Categories.Bytes)]
public byte[] MemoryPackSerialize()
{
return MemoryPackSerializer.Serialize(value, MemoryPackSerializeOptions.Default);
}

[Benchmark, BenchmarkCategory(Categories.Bytes)]
public byte[] MemoryPackSerializeUtf16()
{
return MemoryPackSerializer.Serialize(value, MemoryPackSerializeOptions.Utf16);
}

// requires T:new(), can't test it.
//[Benchmark]
//public byte[] BinaryPackSerialize()
Expand Down Expand Up @@ -113,20 +119,27 @@ public byte[] SystemTextJsonSerialize()
// return orleansSerializer.SerializeToArray(value);
//}

[Benchmark(Baseline = true), BenchmarkCategory(Categories.BufferWriter)]
[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
public void MessagePackBufferWriter()
{
MessagePackSerializer.Serialize(writer, value);
writer.Clear();
}

[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
[Benchmark(Baseline = true), BenchmarkCategory(Categories.BufferWriter)]
public void MemoryPackBufferWriter()
{
MemoryPackSerializer.Serialize(writer, value);
writer.Clear();
}

[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
public void MemoryPackBufferWriterUtf16()
{
MemoryPackSerializer.Serialize(writer, value, MemoryPackSerializeOptions.Utf16);
writer.Clear();
}

//[Benchmark]
//public void BinaryPackStream()
//{
Expand Down
12 changes: 6 additions & 6 deletions sandbox/Benchmark/Benchmarks/Utf16VsUtf8.cs
Original file line number Diff line number Diff line change
Expand Up @@ -23,26 +23,26 @@ public Utf16VsUtf8()
{
this.japanese = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん";
this.ascii = "abcedfghijklmnopqrstuvwxyz0123456789";
this.utf16Jpn = MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Default);
this.utf16Jpn = MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf16);
this.utf8Jpn = MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf8);
this.utf16Ascii = MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Default);
this.utf16Ascii = MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf16);
this.utf8Ascii = MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf8);

this.largeAscii = RandomProvider.NextString(600);
this.utf16LargeAscii = MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Default);
this.utf16LargeAscii = MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf16);
this.utf8LargeAscii = MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf8);
}

[Benchmark]
public byte[] SerializeUtf16Ascii()
{
return MemoryPackSerializer.Serialize(ascii);
return MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf16);
}

[Benchmark]
public byte[] SerializeUtf16Japanese()
{
return MemoryPackSerializer.Serialize(japanese);
return MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf16);
}

[Benchmark]
Expand All @@ -60,7 +60,7 @@ public byte[] SerializeUtf8Japanese()
[Benchmark]
public byte[] SerializeUtf16LargeAscii()
{
return MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Default);
return MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf16);
}

[Benchmark]
Expand Down
8 changes: 5 additions & 3 deletions sandbox/Benchmark/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,14 @@

//BenchmarkRunner.Run<SerializeTest<JsonResponseModel>>(config, args);

BenchmarkRunner.Run<Utf16VsUtf8>(config, args);
//BenchmarkRunner.Run<Utf16VsUtf8>(config, args);

//BenchmarkRunner.Run<SerializeTest<NeuralNetworkLayerModel>>(config, args);

// BenchmarkRunner.Run<DeserializeTest<NeuralNetworkLayerModel>>(config, args);
//BenchmarkRunner.Run<DeserializeTest<JsonResponseModel>>(config, args);


BenchmarkRunner.Run<DeserializeTest<JsonResponseModel>>(config, args);


//BenchmarkRunner.Run<GetLocalVsStaticField>(config, args);
Expand All @@ -67,7 +69,7 @@
Console.WriteLine(foo);

Check<JsonResponseModel>();
//Check<NeuralNetworkLayerModel>();
Check<NeuralNetworkLayerModel>();

void Check<T>()
where T : IInitializable, IEquatable<T>, new()
Expand Down
26 changes: 22 additions & 4 deletions sandbox/SandboxConsoleApp/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,40 @@
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Security.Cryptography.X509Certificates;
using System.Text;
using System.Xml.Linq;



var bin = MemoryPackSerializer.Serialize("hogehoge");
var takotako = MemoryPackSerializer.Deserialize<string>(bin);
//var bin = MemoryPackSerializer.Serialize("hogehoge");
//var takotako = MemoryPackSerializer.Deserialize<string>(bin);

Console.WriteLine(takotako);
//Console.WriteLine(takotako);

// ---

var str = "あいうえおかきくけこさしすせそたちつてとなにぬねの";
var bytes = Encoding.UTF8.GetBytes(str);

var encoder = new BrotliEncoder(4, 22);

//var encoder = new BrotliEncoder(4, 22);




var dest = new byte[1024];

//bytes = MemoryMarshal.AsBytes(str.AsSpan()).ToArray();

encoder.Compress(bytes, dest, out var consumed, out var written, true);


var foo = dest.AsSpan(0, written).ToArray();

Console.WriteLine(bytes.Length);
Console.WriteLine(foo.Length);

//// new BrotliDecoder().Decompress(

Expand Down
20 changes: 16 additions & 4 deletions src/MemoryPack.Core/MemoryPackReader.cs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
using System.Buffers;
using System.Reflection.Emit;
using System.Reflection.Metadata;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
Expand Down Expand Up @@ -234,28 +235,39 @@ string ReadUtf8(int utf8Length)

utf8Length = ~utf8Length;

// TODO:security

ref var spanRef = ref GetSpanReference(utf8Length + 4); // + read utf16 length

string str;
var utf16Length = Unsafe.ReadUnaligned<int>(ref spanRef);

if (utf16Length <= 0)
{
var src = MemoryMarshal.CreateReadOnlySpan(ref Unsafe.Add(ref spanRef, 4), utf8Length);
str = Encoding.UTF8.GetString(src);
}
else
{
// check malformed utf16Length
var max = unchecked((Remaining + 1) * 3);
if (max < 0) max = int.MaxValue;
if (max < utf16Length)
{
MemoryPackSerializationException.ThrowInsufficientBufferUnless(utf8Length);
}

// regular path, know decoded UTF16 length will gets faster decode result
unsafe
{
fixed (byte* p = &Unsafe.Add(ref spanRef, 4))
{
str = string.Create(utf16Length, ((IntPtr)p, utf8Length), static (dest, state) =>
{
var src = MemoryMarshal.CreateSpan(ref Unsafe.AsRef<byte>((byte*)state.Item1), state.Item2);
var status = Utf8.ToUtf16(src, dest, out var bytesRead, out var charsWritten);
// TODO: throw when status failed
var status = Utf8.ToUtf16(src, dest, out var bytesRead, out var charsWritten, replaceInvalidSequences: false);
if (status != OperationStatus.Done)
{
MemoryPackSerializationException.ThrowFailedEncoding(status);
}
});
}
}
Expand Down
9 changes: 8 additions & 1 deletion src/MemoryPack.Core/MemoryPackSerializationException.cs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
using System.Diagnostics.CodeAnalysis;
using System.Buffers;
using System.Diagnostics.CodeAnalysis;

namespace MemoryPack;

Expand Down Expand Up @@ -103,4 +104,10 @@ public static void ThrowDeserializeObjectIsNull(string target)
{
throw new MemoryPackSerializationException($"Deserialized {target} is null.");
}

[DoesNotReturn]
public static void ThrowFailedEncoding(OperationStatus status)
{
throw new MemoryPackSerializationException($"Failed Utf8 encoding/decoding process, status: {status}.");
}
}
7 changes: 5 additions & 2 deletions src/MemoryPack.Core/MemoryPackSerializeOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@

public record MemoryPackSerializeOptions
{
public static MemoryPackSerializeOptions Default = new MemoryPackSerializeOptions { StringEncoding = StringEncoding.Utf16 };
public static MemoryPackSerializeOptions Utf8 = Default with { StringEncoding = StringEncoding.Utf8 };
// Default is Utf8
public static readonly MemoryPackSerializeOptions Default = new MemoryPackSerializeOptions { StringEncoding = StringEncoding.Utf8 };

public static readonly MemoryPackSerializeOptions Utf8 = Default with { StringEncoding = StringEncoding.Utf8 };
public static readonly MemoryPackSerializeOptions Utf16 = Default with { StringEncoding = StringEncoding.Utf16 };

public StringEncoding StringEncoding { get; init; }
}
Expand Down
8 changes: 4 additions & 4 deletions src/MemoryPack.Core/MemoryPackWriter.cs
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ void WriteUtf16(string value)
Advance(copyByteCount + 4);
}

[MethodImpl(MethodImplOptions.NoInlining)] // non default, no inline
[MethodImpl(MethodImplOptions.AggressiveInlining)]
void WriteUtf8(string value)
{
// [utf8-length, utf16-length, utf8-value]
Expand All @@ -220,14 +220,14 @@ void WriteUtf8(string value)

ref var destPointer = ref GetSpanReference(maxByteCount + 8); // header

// write utf8-length is final
// write utf16-length
Unsafe.WriteUnaligned(ref Unsafe.Add(ref destPointer, 4), source.Length);

var dest = MemoryMarshal.CreateSpan(ref Unsafe.Add(ref destPointer, 8), maxByteCount);
var status = Utf8.FromUtf16(source, dest, out var _, out var bytesWritten);
var status = Utf8.FromUtf16(source, dest, out var _, out var bytesWritten, replaceInvalidSequences: false);
if (status != OperationStatus.Done)
{
// TODO: throw when write failed.
MemoryPackSerializationException.ThrowFailedEncoding(status);
}

// write written utf8-length in header, that is ~length
Expand Down
Loading

0 comments on commit c494495

Please sign in to comment.