WebAssembly · ppenzin · Dec 20, 2024
diff --git a/simd/2024/SIMD-04-12.md b/simd/2024/SIMD-04-12.md
@@ -36,13 +36,13 @@ Logistics: is this still a good time slot for everybody? Maybe we should file an
 
 ### Attendees
 
-Anton Kirilov
-Deepti Gandluri
-Ilya Rezvov
-Petr Penzin
-Shravan Narayan
-Thomas Lively
-Yury Delendik
+- Anton Kirilov
+- Deepti Gandluri
+- Ilya Rezvov
+- Petr Penzin
+- Shravan Narayan
+- Thomas Lively
+- Yury Delendik
 
 ### Update and discussion on fp16 support
 

diff --git a/simd/2024/SIMD-06-21.md b/simd/2024/SIMD-06-21.md
@@ -42,6 +42,7 @@ This meeting will be a Google Meet video conference.
 - Yury Delendik
 
 ### Update and discussion on FP16
+
 AB: I am curious about instruction lowering and when hardware support for it would be available across the board
 
 IR: (Shares link to [Lowering](https://github.com/WebAssembly/half-precision/blob/main/proposals/half-precision/Lowering.md))

diff --git a/simd/2024/SIMD-07-19.md b/simd/2024/SIMD-07-19.md
@@ -32,5 +32,55 @@ This meeting will be a Google Meet video conference.
 
 ## Meeting notes
 
-TBD
+### Attendees
+
+- Anton Kirilov
+- Petr Penzin
+- Yury Delendik
+
+### Horizontal operations
+
+https://github.com/WebAssembly/flexible-vectors/issues/65
+
+AK: The goal is to pattern match some vector additions and some shuffles.
+
+PP: was hoping to talk about this before LLVM patch gets merged … What is the pattern?
+
+AK: AFAIK the patch produces horizontal operations as a series of addi
+
+PP: yes, there seems to be fp32
+
+AK: this should help other runtiems
+
+PP: we should document that, given this is a bit more hardware oriented patch
+
+AK: there was a patch for integer splats vs integer … Were horizontal ops discussed in SIMD proposal before?
+
+PP: https://github.com/WebAssembly/simd/issues/20 There are some concerns for performance
+
+AK: Horizontal ops are also slower on Arm, but still useful
+
+AB: Looking at the LLVM PR. Seems like patch adds pairwise rather than split in half, why was that done?
+
+AK: The original output had shuffle masks that would change with subtle changes to input
+
+PP: Since this is pairwise would this interact with x86 pattern matching?
+
+YD: We are working on shuffle lowering for horizontal min. Shuffles generated for autovectorized ops are not performance efficient and non-deterministic. Sometimes part of the shuffle mask represent lanes that are discarded. We are having a problem when we can’t make the right selection for either Arm or x86. I am wondering shold we even have a shuffle with non-deterministic output.
+
+AK: Horizontal min reduction
+
+PP: discarded lane indices (FK’s patch)
+
+AK: Neon only has byte shuffles, SVE has other shuffles
+
+YD: https://bugzilla.mozilla.org/show_bug.cgi?id=1887312
+
+PP: what is the incompatibility between x86 and Arm here
+
+YD: For example rotate produces a shuffle that is matches neither of the two styles… Autovectorizer
+
+AK: in Sam’s notes one of the places this is coming from is SPEC, autovectorizer
+
+PP: there is cost model problem
 
diff --git a/simd/2024/SIMD-08-02 b/simd/2024/SIMD-08-02
diff --git a/simd/2024/SIMD-08-02.md b/simd/2024/SIMD-08-02.md
@@ -0,0 +1,100 @@
+![WebAssembly logo](/images/WebAssembly.png)
+
+## Agenda for the August 2 video call of WebAssembly's SIMD Subgroup
+
+- **Dates**: 2024-08-02
+- **Times**:
+    - 3pm-4pm UTC (8am-9am PDT)
+- **Location**: *link on calendar invite*
+- **Contact**:
+    - Name: Petr Penzin
+    - Email: [email protected]
+
+
+### Registration
+
+Fill out [sign-up form](https://forms.gle/bscWhsD9U4hZEsUV9) to attend.
+
+### Logistics
+
+This meeting will be a Google Meet video conference.
+
+## Agenda items
+
+1. Opening, welcome and roll call
+    1. Opening of the meeting
+    1. Introduction of attendees
+1. Find volunteers for note taking
+1. Adoption of the agenda
+1. Proposals and discussions
+    1. Continue discussion of shuffle masks (https://github.com/WebAssembly/flexible-vectors/issues/66)
+1. Closure
+
+## Meeting notes
+
+### Attendees
+
+- Andrew Brown
+- Anton Kirilov
+- Petr Penzin
+- Yury Delendik
+
+## Continue discussion of shuffle masks
+
+https://github.com/WebAssembly/flexible-vectors/issues/66
+
+PP: do you have sources?
+
+YD: no, but I can reproduce it via simple C++ code, something like:
+
+```
+unsigned char arr[10000];
+unsigned char m = 0; for (i = 0; i < 10000; i++) if (a[i] < m) m = a[i]
+```
+
+PP: We should suggest to LLVM how to generate better patterns for these operations
+
+YD: You have to load the mask into a register, perform the shuffle and then post-process the results
+
+PP: Having a few of them in a row would really explode register usage
+
+AK: There is still value to add horizontal reduction to Wasm irrespective of whether we do something about loop tails in LLVM. Not sure about the other three examples.
+
+YD: https://github.com/Microsoft/onnxruntime
+
+PP: Pattern 3 looks like 8->32 bit extend, but the 4 elements it extending it would produce
+
+AK: If we have this flexibility we could broadcast the element. Non-determism w.r.t masks
+
+YD: Maybe we can ask relaxed variant
+
+PP: In flexible vector proposal we can express just the shrinking if we implement per-value length, but it is likely would just move uncertainty somewhere lese
+
+YD: the only way to experiment is to measure ONNX, we should figure out how to run that s a benchmark
+
+YD: Bugzilla bug https://bugzilla.mozilla.org/show_bug.cgi?id=1887312
+
+PP: Maybe this is the source: https://github.com/xenova/transformers.js
+
+YD: Don’t yet know why V8 is faster
+
+AK: Maybe it is matching specific permutation patterns
+
+PP: to summarize maybe disable LLVM, adding operation, and indicate indices to be ignored in the result. Numbers would be good if we are going to present new instruction proposals.
+
+AK: Similar question for flexible vectors, if you get a value and then increase its length
+
+YD: maybe set the lanes to zero explicitly by a SIMD `and` after the shuffle and pattern match on that
+
+PP: would that be too much dataflow analysis for JS engine?
+
+YD: don’t know yet, but we are considering adding an additional pass for dataflow
+
+PP: maybe room for dataflow in other direction by taking lanes out of explicitly zero value if only one value really used
+
+YD: maybe also interleave
+
+PP: LLVM can produce some cheaper pattern, like splat of 4-byte lane, but that goes back to LLVM not having a cost model
+
+https://github.com/llvm/llvm-project/issues/101725 filed
+
diff --git a/simd/2024/SIMD-09-27.md b/simd/2024/SIMD-09-27.md
@@ -0,0 +1,101 @@
+![WebAssembly logo](/images/WebAssembly.png)
+
+## Agenda for the August 2 video call of WebAssembly's SIMD Subgroup
+
+- **Dates**: 2024-09-27
+- **Times**:
+    - 3pm-4pm UTC (8am-9am PDT)
+- **Location**: *link on calendar invite*
+- **Contact**:
+    - Name: Petr Penzin
+    - Email: [email protected]
+
+
+### Registration
+
+Fill out [sign-up form](https://forms.gle/bscWhsD9U4hZEsUV9) to attend.
+
+### Logistics
+
+This meeting will be a Google Meet video conference.
+
+## Agenda items
+
+1. Opening, welcome and roll call
+    1. Opening of the meeting
+    1. Introduction of attendees
+1. Find volunteers for note taking
+1. Adoption of the agenda
+1. Proposals and discussions
+    1. Continue discussion of shuffle masks (https://github.com/WebAssembly/flexible-vectors/issues/66)
+1. Closure
+
+## Meeting notes
+
+### Attendees
+
+- Andrew Brown
+- Anton Kirilov
+- Brendan Dahl
+- Petr Penzin
+- Sergey Rubanov
+- Yury Delendik
+
+### Hardware Specialized WebAssembly
+
+AB giving an overview of https://github.com/WebAssembly/design/issues/1528, interested in feedback.
+
+AK is wondering how software emulation would work for the bultins that are not supported on the platform.
+
+AB: there is a detection mechanism whether a builtin a supported natively by the platform.
+
+AK: If developer providing an alternative implementation, it likely would be different from implementation that uses the accelerative version, would that be an issue
+
+AB: As an example, XNNPACK would provide different kernel implementations based on hardware support
+
+AK: Would this be similar to high-level API?
+
+AB: The CG discussion is leaning towards limiting the size of the builtins, I personally think that is maybe OK to try that, at least explore it
+
+AK: Another question - maintaining builtins database, if we worry about the speed of CG proposals, this might be similar
+
+AB: That might be a concern, though for the sake of trying this, we might want to lift process restrictions and let engines adds builtins
+
+AK: Centralized process would reduce the risk of multiple builtins with slightly different semantics
+
+AB: The proposal includes a way to ensure that fallback code is doing what is expected
+
+YD: We have experience with a builtin implementation already where we load imported functions and then substitute with native implementation with a fallback path, that is pretty much identical to the proposal. The downside is that it is hard for developers to rely on fast builtin functions. This is not exposed to the web, only on extension level, look up mozIntGemm, also ticket: https://bugzilla.mozilla.org/show_bug.cgi?id=1720747
+
+AK: From our experience with optimized implementations, addition of new instructions is a bit easier for developers to target.
+
+AK: For the libraries required as native implementation there is going to be an issue with integration, as engines are implemented differently
+
+PP: This is going to be an issue for long builtins and less for the ones producing a single instruction
+
+YD: We need to have the builtins integrated into our compilation pipelines to integration with register allocation, etc
+
+AK: Couple thoughts. The registry should be machine-readable, which I think you have already. This can be even used to add assembly sequences to that, but maybe at a later point. Assembly templates that proposal discusses might help.
+
+PP: Worth (again) mentioning that CG process plays a role, and maybe we need to improve that to some extent. On the other hand, opcode space is so full it is worth to relieve some pressure there.
+
+AK: What is the deprecation process? I think that is one of the motivations for the CG process.
+
+YD, AB: fallback _is_ the depreciation process
+
+YD: Who would be the authority to add/remove the builtins?
+
+PP: There is a case to be made of having more than one authority/registry
+
+AB: Builtins subgroup, which would figure that out eventually?
+
+YD: Can we borrow from JS builtins, maybe
+
+AB - asks BD and SR for any thoughts
+
+BD is curious about the tool integration story, AK suggests function multiversioning. 
+
+YD: JS strings is the going be similar to this proposal
+
+AK: Bulk memory ops could’ve been implemented via this style of proposal
+
diff --git a/simd/2024/SIMD-11-22.md b/simd/2024/SIMD-11-22.md
@@ -0,0 +1,67 @@
+![WebAssembly logo](/images/WebAssembly.png)
+
+## Agenda for the November 22 video call of WebAssembly's SIMD Subgroup
+
+- **Dates**: 2024-11-22
+- **Times**:
+    - 3pm-4pm UTC (8am-9am PDT)
+- **Location**: *link on calendar invite*
+- **Contact**:
+    - Name: Petr Penzin
+    - Email: [email protected]
+
+
+### Registration
+
+Fill out [sign-up form](https://forms.gle/bscWhsD9U4hZEsUV9) to attend.
+
+### Logistics
+
+This meeting will be a Google Meet video conference.
+
+## Agenda items
+
+1. Opening, welcome and roll call
+    1. Opening of the meeting
+    1. Introduction of attendees
+1. Find volunteers for note taking
+1. Adoption of the agenda
+1. Proposals and discussions
+    1. Relaxed SIMD trunc NaN semantics
+1. Closure
+
+## Meeting notes
+
+### Attendees
+
+- Yury Delendik
+- Evan Nemerson
+- Brendan Dahl
+- Petr Penzin
+
+https://github.com/WebAssembly/relaxed-simd/pull/144 and https://github.com/WebAssembly/relaxed-simd/pull/140
+
+YD: this was passing on some particular test because liftoff was producing just the right output, neither turbofan, nor liftoff pass this test
+
+YD: 140 is adding alternative values produced by other engines, AR in 144 said that was rather arbitrary
+
+EN: can we just accept any value when a NaN value is passed
+
+PP: converting NaN to int is not a valid operation, should that be an error?
+
+YD: if we tighten the semantics (i.e. raise error) we get non-relaxed version
+
+EN: what is the difference between relaxed and strict?
+
+YD: number of operations
+
+PP: if you sanitize the inputs you end up with more instructions
+
+EN: Can we consider this behavior undefined?
+
+PP: spec really tries to avoid undefined, I do support the idea that we might not need this operation
+
+YD: maybe we should check the use cases are in the wild, xnnpack, onnx runtime. Will open an issue, want to know what Andreas has to say about it
+
+BD: xnnpack doesn’t seem to use this particular intrinsic (and it doesn't use autovectorization)
+