Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Introduce LeftJoin LINQ operator #110292

Open
roji opened this issue Dec 1, 2024 · 38 comments · May be fixed by #110872
Open

[API Proposal]: Introduce LeftJoin LINQ operator #110292

roji opened this issue Dec 1, 2024 · 38 comments · May be fixed by #110872
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@roji
Copy link
Member

roji commented Dec 1, 2024

Background and motivation

Background

LINQ has a Join operator, which, like its SQL INNER JOIN counterpart, correlates elements of two sequences based on matching keys; the LINQ join implementation internally creates a Lookup for the inner sequence, and then loops over the outer sequence, doing a lookup for the matching inner elements. In SQL database parlance, this is known as the hash join strategy (SQL Server docs, PostgreSQL docs as well as this useful post).

In addition to the above, SQL also has LEFT JOIN, which returns outer elements even if there's no corresponding inner ones; LINQ, in contrast, lacks this operator. The LINQ conceptual documentation shows how to combine existing operators to achieve a left join:

var query = students
    .GroupJoin(departments, student => student.DepartmentID, department => department.ID, (student, departmentList) => new { student, subgroup = departmentList })
    .SelectMany(
        joinedSet => joinedSet.subgroup.DefaultIfEmpty(),
        (student, department) => new
        {
            student.student.FirstName,
            student.student.LastName,
            Department = department.Name
        });

There are two issues with the above suggestion:

  • It's complicated, requiring combining 3 different LINQ operators in a specific way, and is easy to accidentally get wrong. Many EF users have complained about the complexity of this construct for expressing a simple SQL LEFT JOIN.
  • It is inefficient - the combination of operators adds significant overhead compared to a single operator using a hash join (as Join does).

Proposal

This proposes introducing a 1st-class LeftJoin operator, which operates very similar to Join, except that it returns outer elements for which no inner element could be correlated. Aside from being much simpler to use than GroupJoin/SelectMany, it would also simply use Lookup internally - just like Join - and would therefore be much faster.

An initial implementation shows significant performance improvement compared to GroupJoin/SelectMany; LeftJoin is always faster than the equivalent GroupJoin/SelectMany construct, since GroupJoin itself constructs and uses a Lookup internally to implement an inner join internally - just like LeftJoin does - but also adds additional work on top.

BenchmarkDotNet v0.14.0, macOS Sequoia 15.1.1 (24B91) [Darwin 24.1.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
Method InnersPerOuter OuterCount Mean Error StdDev Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
LeftJoin 1 1 97.73 ns 0.402 ns 0.376 ns 0.64 0.00 0.0573 - - 480 B 0.71
GroupJoin_SelectMany 1 1 153.02 ns 0.370 ns 0.309 ns 1.00 0.00 0.0811 - - 680 B 1.00
LeftJoin 1 10 488.66 ns 5.764 ns 5.391 ns 0.57 0.01 0.2031 - - 1704 B 0.57
GroupJoin_SelectMany 1 10 856.40 ns 4.305 ns 4.027 ns 1.00 0.01 0.3567 - - 2984 B 1.00
LeftJoin 1 100 4,814.61 ns 19.884 ns 17.627 ns 0.59 0.00 1.7090 0.0534 - 14344 B 0.54
GroupJoin_SelectMany 1 100 8,173.26 ns 37.560 ns 35.134 ns 1.00 0.01 3.1586 0.1068 - 26424 B 1.00
LeftJoin 1 1000 47,619.93 ns 227.566 ns 201.731 ns 0.59 0.00 16.2964 3.4790 - 136728 B 0.53
GroupJoin_SelectMany 1 1000 80,828.42 ns 323.010 ns 302.144 ns 1.00 0.01 30.6396 6.5918 - 256808 B 1.00
LeftJoin 10 1 300.03 ns 3.089 ns 2.580 ns 0.70 0.01 0.1316 - - 1104 B 0.85
GroupJoin_SelectMany 10 1 430.06 ns 3.919 ns 3.273 ns 1.00 0.01 0.1554 - - 1304 B 1.00
LeftJoin 10 10 2,954.84 ns 28.864 ns 25.587 ns 0.87 0.01 0.9460 0.0076 - 7944 B 0.86
GroupJoin_SelectMany 10 10 3,397.03 ns 9.497 ns 8.419 ns 1.00 0.00 1.1024 0.0114 - 9224 B 1.00
LeftJoin 10 100 31,873.63 ns 120.088 ns 106.455 ns 0.81 0.02 9.1553 0.7324 - 76744 B 0.86
GroupJoin_SelectMany 10 100 39,299.04 ns 743.548 ns 856.271 ns 1.00 0.03 10.5591 0.8545 - 88824 B 1.00
LeftJoin 10 1000 402,837.52 ns 5,140.127 ns 4,808.078 ns 0.88 0.01 90.8203 30.2734 - 760728 B 0.86
GroupJoin_SelectMany 10 1000 457,658.94 ns 5,117.136 ns 4,786.572 ns 1.00 0.01 104.9805 34.6680 - 880808 B 1.00
LeftJoin 100 1 1,906.55 ns 9.026 ns 8.443 ns 0.83 0.01 0.6981 0.0019 - 5848 B 0.97
GroupJoin_SelectMany 100 1 2,310.49 ns 29.862 ns 27.933 ns 1.00 0.02 0.7210 0.0038 - 6048 B 1.00
LeftJoin 100 10 18,861.98 ns 366.707 ns 560.001 ns 0.85 0.03 6.5918 0.2747 - 55384 B 0.98
GroupJoin_SelectMany 100 10 22,285.02 ns 189.605 ns 177.356 ns 1.00 0.01 6.7444 0.2747 - 56664 B 1.00
LeftJoin 100 100 197,041.60 ns 2,033.092 ns 1,802.283 ns 0.83 0.01 65.6738 15.8691 - 551144 B 0.98
GroupJoin_SelectMany 100 100 236,428.77 ns 1,920.429 ns 1,702.410 ns 1.00 0.01 67.1387 16.6016 - 563224 B 1.00
LeftJoin 100 1000 2,628,960.79 ns 22,819.664 ns 21,345.528 ns 0.85 0.03 656.2500 316.4063 - 5504731 B 0.98
GroupJoin_SelectMany 100 1000 3,113,006.15 ns 62,259.419 ns 95,076.717 ns 1.00 0.04 671.8750 328.1250 - 5624811 B 1.00
LeftJoin 1000 1 17,487.96 ns 203.775 ns 180.641 ns 0.84 0.01 5.8594 0.1221 - 49056 B 1.00
GroupJoin_SelectMany 1000 1 20,818.61 ns 251.607 ns 210.103 ns 1.00 0.01 5.8594 0.1831 - 49256 B 1.00
LeftJoin 1000 10 175,500.86 ns 2,377.458 ns 1,856.162 ns 0.77 0.02 58.1055 11.9629 - 487464 B 1.00
GroupJoin_SelectMany 1000 10 228,064.76 ns 4,414.845 ns 4,907.089 ns 1.00 0.03 58.3496 12.6953 - 488744 B 1.00
LeftJoin 1000 100 2,092,691.16 ns 39,429.196 ns 77,829.392 ns 0.86 0.04 582.0313 250.0000 - 4871947 B 1.00
GroupJoin_SelectMany 1000 100 2,422,638.76 ns 47,560.763 ns 84,539.215 ns 1.00 0.05 582.0313 246.0938 - 4884027 B 1.00
LeftJoin 1000 1000 42,875,766.40 ns 738,934.130 ns 691,199.444 ns 0.89 0.02 5900.0000 1700.0000 100.0000 48712842 B 1.00
GroupJoin_SelectMany 1000 1000 47,966,803.96 ns 925,858.818 ns 1,066,220.392 ns 1.00 0.03 5888.8889 1666.6667 111.1111 48832891 B 1.00
Benchmark code
[MemoryDiagnoser]
public class LeftJoinBenchmarks
{
    [Params(1, 10, 100, 1000, Priority = 1)]
    // [Params(1, Priority = 1)]
    public int InnersPerOuter { get; set; }

    [Params(1, 10, 100, 1000, Priority = 2)]
    // [Params(1, Priority = 2)]
    public int OuterCount { get; set; }

    private Outer[] _outers = null!;
    private Inner[] _inners = null!;

    class Outer
    {
        public int Id { get; set; }
        public string? OuterPayload { get; set; }
    }

    class Inner
    {
        public int Id { get; set; }
        public int OuterId { get; set; }
        public string? InnerPayload { get; set; }
    }

    private const int RandomSeed = 42;

    [GlobalSetup]
    public void Setup()
    {
        _outers = new Outer[OuterCount];
        _inners = new Inner[OuterCount * InnersPerOuter];

        var remainingInners = new List<Inner>(OuterCount * InnersPerOuter);
        for (var outerId = 0; outerId < OuterCount; outerId++)
        {
            _outers[outerId] = new Outer { Id = outerId, OuterPayload = $"Outer{outerId}" };

            for (var j = 0; j < InnersPerOuter; j++)
            {
                var innerId = outerId * j + j;
                remainingInners.Add(new Inner { Id = innerId, OuterId = outerId, InnerPayload = $"Inner{innerId}" });
            }
        }

        var random = new Random(RandomSeed);

        for (var i = 0; i < _inners.Length; i++)
        {
            var j = random.Next(0, remainingInners.Count);
            _inners[i] = remainingInners[j];
            remainingInners.RemoveAt(j);
        }

        Debug.Assert(remainingInners.Count == 0);
    }

    [Benchmark]
    public int LeftJoin()
        => _outers.LeftJoin(_inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload }).Count();

    [Benchmark(Baseline = true)]
    public int GroupJoin_SelectMany()
        => _outers
            .GroupJoin(_inners, o => o.Id, i => i.OuterId, (o, inners) => new { Outer = o, Inners = inners })
            .SelectMany(
                joinedSet => joinedSet.Inners.DefaultIfEmpty(),
                (o, i) => new
                {
                    o.Outer.OuterPayload,
                    i?.InnerPayload
                })
            .Count();
}
LeftJoin operator prototype implementation
private static IEnumerable<TResult> LeftJoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner?, TResult> resultSelector, IEqualityComparer<TKey>? comparer)
{
    using (IEnumerator<TOuter> e = outer.GetEnumerator())
    {
        if (e.MoveNext())
        {
            Lookup<TKey, TInner> innerLookup =
                Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer);
            do
            {
                TOuter item = e.Current;
                Grouping<TKey, TInner>? g = innerLookup.GetGrouping(outerKeySelector(item), create: false);
                if (g is null)
                {
                    yield return resultSelector(item, default);
                }
                else
                {
                    int count = g._count;
                    TInner[] elements = g._elements;
                    for (int i = 0; i != count; ++i)
                    {
                        yield return resultSelector(item, elements[i]);
                    }
                }
            } while (e.MoveNext());
        }
    }
}

Note: The current LINQ documentation for GroupJoin/SelectMany shows using AsQueryable, for no apparent reason. The addition of AsQueryable here adds very significant perf overhead - see dotnet/docs#43807 for benchmarks and a proposal to remove AsQueryable from that code sample.

Additional Notes

  • This proposal also helps LINQ provider such as EF - the ability to directly express a SQL LEFT JOIN comes up regularly.
  • For inner value types, the operator returns default when an outer has no inners; this makes it impossible to distinguish between an inner not being found, and the inner being found but being null. This is similar to e.g. FirstOrDefault; although it's quite contrived for LeftJoin, see below for some notes and alternative API designs.
  • Mostly for symmetry's sake, we could also introduce RightJoin(), which is the reverse of LeftJoin() (i.e. elements from the inner sequence are returned if no correlated outer is found). Right joins are seldom used in SQL, and it's always possible to flip the sequences around to express the join as a left join instead.
  • In theory, we could also introduce an operator to perform a FULL OUTER JOIN, which is a combination of LEFT and RIGHT JOIN (elements from both outer and inner are returned even if there's no correlated element on the other side). This is, however, a rare operation and not generally very useful, and is more complicated to implement efficiently. We can do this later if the need arises, but should keep naming in mind.
  • Naming-wise:
    • Following the full operation name, we could call the operator LeftOuterJoin instead of LeftJoin; though the OUTER keyword is optional in all major SQL databases, and routinely dropped. LeftJoin is a lighter, more accessible name.
    • It may seem a bit odd to introduce the very SQL-oriented LeftJoin, RightJoin. However, LINQ operators already follow SQL naming (e.g. Where/Select rather than Filter/Map).
    • We could have JoinLeft, JoinRight to have all join operators grouped together in Intellisense (though the naming is less intuitive).
  • Corresponding syntax could be added to LINQ expression syntax, on the C# side (just like join) - but this is optional. Proposal: Proposal: introduce left and right join clauses to C# query expression syntax csharplang#8892.
  • We should consider an analyzer+code fix to transform the current recommended GroupJoin/SelectMany into the new LeftJoin - including with the unneeded AsQueryable shown in the docs - as many people have been copy-pasting that and have very slow code (see this).

/cc @jeffhandley @dotnet/area-system-linq @dotnet/efteam

API Proposal

namespace System.Linq;

public static class Enumerable
{
    // Alternative naming: LeftOuterJoin
    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector);

    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

// ... plus optionally RightJoin version of the above
}

public static class Queryable
{
    public static System.Linq.IQueryable<TResult> LeftJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector);
	
    public static System.Linq.IQueryable<TResult> Join<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector,
        System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}

// If we decide to also introduce RightJoin (or RightOuterJoin):

public static class Enumerable
{
    // Alternative naming: RightOuterJoin
    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector);

    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

// ... plus optionally RightJoin version of the above
}

public static class Queryable
{
    public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector);
	
    public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector,
        System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}

API Usage

var query = outers.LeftJoin(inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload });

Value types and alternative LeftJoin API shapes

As pointed out above, the fact that the result selector accepts a defaultable inner means that it's impossible to distinguish between no inner being found for an outer, and the situation where the inner itself happens to be the default (thanks for discussion on this, @eiriktsarpalis). This problem isn't specific to this proposal, other operators (e.g. FirstOrDefault) have the same problem.

An alternative API to address this would pass a boolean to the result selector, representing whether an inner was found or not:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
	Func<TOuter, bool, TInner, TResult> resultSelector);

Alternatively, with the upcoming introduction of discriminated unions to .NET, an Optional type could allow the same thing:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
	Func<TOuter, Optional<TInner>, TResult> resultSelector);

However, it seems that cases where the distinction between "not found" and "found but default" matters are especially rare for joining; the inner key selector would have to accept a null/default inner and "extract" a key out of that (matching the outer key); not impossible, but definitely feels contrived. It's also possible to work around the ambiguity (in some cases) by switching to a nullable value type.

In other words, we should IMHO avoid making the API more complex/heavy for everyone because of a 1% case.

Previous related issues

@roji roji added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 1, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 1, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 1, 2024
@teo-tsirpanis teo-tsirpanis added area-System.Linq and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 1, 2024
@roji roji removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Dec 1, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

@roji roji added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 1, 2024
@jeffhandley
Copy link
Member

Your Additional Notes covered all the questions and naming thoughts I had, @roji. Thanks for including all of that.

@Clockwork-Muse
Copy link
Contributor

@stephentoub - Is this something we should also consider for PLINQ?

(For others - because of the internals of PLINQ, it's not possible for third parties to implement their own operators, as it is with normal LINQ)

@stephentoub
Copy link
Member

stephentoub commented Dec 2, 2024

@stephentoub - Is this something we should also consider for PLINQ?

From an API perspective, #98689 covers adding to PLINQ any APIs added to LINQ that PLINQ doesn't already have. I don't think this particular API is any more special than the others already omitted, so I'd just want to lump this one in with those.

From an implementation perspective, I suspect the PLINQ implementation would just be the code @roji wrote in his opening comments, with that operator implemented by delegating to GroupJoin/SelectMany. Doing a full-fledged open-coded implementation of PLINQ's would be a lot of difficult to validate code.

because of the internals of PLINQ, it's not possible for third parties to implement their own operators

When doing an open-coded implementation, sure. But anyone can layer an implementation on top of the existing operators, just as with LINQ, e.g.

public static ParallelQuery<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
    this ParallelQuery<TOuter> outer, ParallelQuery<TInner> inner,
    Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner?, TResult> resultSelector) =>
    outer.GroupJoin(inner, outerKeySelector, innerKeySelector, (outerItem, innerItem) => (outerItem, innerItem))
         .SelectMany(joinedSet => joinedSet.innerItem.DefaultIfEmpty(), (joinedSet, innerItem) => resultSelector(joinedSet.outerItem, innerItem));

@roji
Copy link
Member Author

roji commented Dec 2, 2024

From an implementation perspective, I suspect the PLINQ implementation would just be the code @roji wrote in his opening comments, with that operator implemented by delegating to GroupJoin/SelectMany. Doing a full-fledged open-coded implementation of PLINQ's would be a lot of difficult to validate code.

@stephentoub I know very little about PLINQ, but assuming an implementation of Join already exists, wouldn't an implementation of LeftJoin be very similar (just as the non-PLINQ implementation of LeftJoin is very similar to the Join implementation)?

@stephentoub
Copy link
Member

stephentoub commented Dec 2, 2024

wouldn't an implementation of LeftJoin be very similar (just as the non-PLINQ implementation of LeftJoin is very similar to the Join implementation)?

Likely. But to put it in context, this is the LINQ implementation of Join:

private static IEnumerable<TResult> JoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector, IEqualityComparer<TKey>? comparer)
{
using (IEnumerator<TOuter> e = outer.GetEnumerator())
{
if (e.MoveNext())
{
Lookup<TKey, TInner> lookup = Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer);
if (lookup.Count != 0)
{
do
{
TOuter item = e.Current;
Grouping<TKey, TInner>? g = lookup.GetGrouping(outerKeySelector(item), create: false);
if (g is not null)
{
int count = g._count;
TInner[] elements = g._elements;
for (int i = 0; i != count; ++i)
{
yield return resultSelector(item, elements[i]);
}
}
}
while (e.MoveNext());
}
}
}

and this monster is the PLINQ implementation:
/// <summary>
/// A join operator takes a left query tree and a right query tree, and then yields the
/// matching pairs between the two. LINQ supports equi-key-based joins. Hence, a key-
/// selection function for the left and right data types will yield keys of the same
/// type for both. We then merely have to match elements from the left with elements from
/// the right that have the same exact key. Note that this is an inner join. In other
/// words, outer elements with no matching inner elements do not appear in the output.
///
/// Hash-joins work in two phases:
///
/// (1) Building - we build a hash-table from one of the data sources. In the case
/// of this specific operator, the table is built from the hash-codes of
/// keys selected via the key selector function. Because elements may share
/// the same key, the table must support one-key-to-many-values.
/// (2) Probing - for each element in the data source not used for building, we
/// use its key to look into the hash-table. If we find elements under this
/// key, we just enumerate all of them, yielding them as join matches.
///
/// Because hash-tables exhibit on average O(1) lookup, we turn what would have been
/// an O(n*m) algorithm -- in the case of nested loops joins -- into an O(n) algorithm.
/// We of course require some additional storage to do so, but in general this pays.
/// </summary>
/// <typeparam name="TLeftInput"></typeparam>
/// <typeparam name="TRightInput"></typeparam>
/// <typeparam name="TKey"></typeparam>
/// <typeparam name="TOutput"></typeparam>
internal sealed class JoinQueryOperator<TLeftInput, TRightInput, TKey, TOutput> : BinaryQueryOperator<TLeftInput, TRightInput, TOutput>
{
private readonly Func<TLeftInput, TKey> _leftKeySelector; // The key selection routine for the outer (left) data source.
private readonly Func<TRightInput, TKey> _rightKeySelector; // The key selection routine for the inner (right) data source.
private readonly Func<TLeftInput, TRightInput, TOutput> _resultSelector; // The result selection routine.
private readonly IEqualityComparer<TKey>? _keyComparer; // An optional key comparison object.
//---------------------------------------------------------------------------------------
// Constructs a new join operator.
//
internal JoinQueryOperator(ParallelQuery<TLeftInput> left, ParallelQuery<TRightInput> right,
Func<TLeftInput, TKey> leftKeySelector,
Func<TRightInput, TKey> rightKeySelector,
Func<TLeftInput, TRightInput, TOutput> resultSelector,
IEqualityComparer<TKey>? keyComparer)
: base(left, right)
{
Debug.Assert(left != null && right != null, "child data sources cannot be null");
Debug.Assert(leftKeySelector != null, "left key selector must not be null");
Debug.Assert(rightKeySelector != null, "right key selector must not be null");
Debug.Assert(resultSelector != null, "need a result selector function");
_leftKeySelector = leftKeySelector;
_rightKeySelector = rightKeySelector;
_resultSelector = resultSelector;
_keyComparer = keyComparer;
_outputOrdered = LeftChild.OutputOrdered;
SetOrdinalIndex(OrdinalIndexState.Shuffled);
}
public override void WrapPartitionedStream<TLeftKey, TRightKey>(
PartitionedStream<TLeftInput, TLeftKey> leftStream, PartitionedStream<TRightInput, TRightKey> rightStream,
IPartitionedStreamRecipient<TOutput> outputRecipient, bool preferStriping, QuerySettings settings)
{
Debug.Assert(rightStream.PartitionCount == leftStream.PartitionCount);
if (LeftChild.OutputOrdered)
{
if (ExchangeUtilities.IsWorseThan(LeftChild.OrdinalIndexState, OrdinalIndexState.Increasing))
{
PartitionedStream<TLeftInput, int> leftStreamInt =
QueryOperator<TLeftInput>.ExecuteAndCollectResults(leftStream, leftStream.PartitionCount, OutputOrdered, preferStriping, settings)
.GetPartitionedStream();
WrapPartitionedStreamHelper<int, TRightKey>(
ExchangeUtilities.HashRepartitionOrdered(leftStreamInt, _leftKeySelector, _keyComparer, null, settings.CancellationState.MergedCancellationToken),
rightStream, outputRecipient, settings.CancellationState.MergedCancellationToken);
}
else
{
WrapPartitionedStreamHelper<TLeftKey, TRightKey>(
ExchangeUtilities.HashRepartitionOrdered(leftStream, _leftKeySelector, _keyComparer, null, settings.CancellationState.MergedCancellationToken),
rightStream, outputRecipient, settings.CancellationState.MergedCancellationToken);
}
}
else
{
WrapPartitionedStreamHelper<int, TRightKey>(
ExchangeUtilities.HashRepartition(leftStream, _leftKeySelector, _keyComparer, null, settings.CancellationState.MergedCancellationToken),
rightStream, outputRecipient, settings.CancellationState.MergedCancellationToken);
}
}
//---------------------------------------------------------------------------------------
// This is a helper method. WrapPartitionedStream decides what type TLeftKey is going
// to be, and then call this method with that key as a generic parameter.
//
private void WrapPartitionedStreamHelper<TLeftKey, TRightKey>(
PartitionedStream<Pair<TLeftInput, TKey>, TLeftKey> leftHashStream, PartitionedStream<TRightInput, TRightKey> rightPartitionedStream,
IPartitionedStreamRecipient<TOutput> outputRecipient, CancellationToken cancellationToken)
{
if (RightChild.OutputOrdered && LeftChild.OutputOrdered)
{
PairOutputKeyBuilder<TLeftKey, TRightKey> outputKeyBuilder = new PairOutputKeyBuilder<TLeftKey, TRightKey>();
IComparer<Pair<TLeftKey, TRightKey>> outputKeyComparer =
new PairComparer<TLeftKey, TRightKey>(leftHashStream.KeyComparer, rightPartitionedStream.KeyComparer);
WrapPartitionedStreamHelper<TLeftKey, TRightKey, Pair<TLeftKey, TRightKey>>(leftHashStream,
ExchangeUtilities.HashRepartitionOrdered(rightPartitionedStream, _rightKeySelector, _keyComparer, null, cancellationToken),
outputKeyBuilder, outputKeyComparer, outputRecipient, cancellationToken);
}
else
{
LeftKeyOutputKeyBuilder<TLeftKey, int> outputKeyBuilder = new LeftKeyOutputKeyBuilder<TLeftKey, int>();
WrapPartitionedStreamHelper<TLeftKey, int, TLeftKey>(leftHashStream,
ExchangeUtilities.HashRepartition(rightPartitionedStream, _rightKeySelector, _keyComparer, null, cancellationToken),
outputKeyBuilder, leftHashStream.KeyComparer, outputRecipient, cancellationToken);
}
}
private void WrapPartitionedStreamHelper<TLeftKey, TRightKey, TOutputKey>(
PartitionedStream<Pair<TLeftInput, TKey>, TLeftKey> leftHashStream, PartitionedStream<Pair<TRightInput, TKey>, TRightKey> rightHashStream,
HashJoinOutputKeyBuilder<TLeftKey, TRightKey, TOutputKey> outputKeyBuilder, IComparer<TOutputKey> outputKeyComparer,
IPartitionedStreamRecipient<TOutput> outputRecipient, CancellationToken cancellationToken)
{
int partitionCount = leftHashStream.PartitionCount;
PartitionedStream<TOutput, TOutputKey> outputStream =
new PartitionedStream<TOutput, TOutputKey>(partitionCount, outputKeyComparer, OrdinalIndexState);
for (int i = 0; i < partitionCount; i++)
{
JoinHashLookupBuilder<TRightInput, TRightKey, TKey> rightLookupBuilder =
new JoinHashLookupBuilder<TRightInput, TRightKey, TKey>(rightHashStream[i], _keyComparer);
outputStream[i] = new HashJoinQueryOperatorEnumerator<TLeftInput, TLeftKey, TRightInput, TRightKey, TKey, TOutput, TOutputKey>(
leftHashStream[i], rightLookupBuilder, _resultSelector, outputKeyBuilder, cancellationToken);
}
outputRecipient.Receive(outputStream);
}
internal override QueryResults<TOutput> Open(QuerySettings settings, bool preferStriping)
{
QueryResults<TLeftInput> leftResults = LeftChild.Open(settings, false);
QueryResults<TRightInput> rightResults = RightChild.Open(settings, false);
return new BinaryQueryOperatorResults(leftResults, rightResults, this, settings, false);
}
//---------------------------------------------------------------------------------------
// Returns an enumerable that represents the query executing sequentially.
//
internal override IEnumerable<TOutput> AsSequentialQuery(CancellationToken token)
{
IEnumerable<TLeftInput> wrappedLeftChild = CancellableEnumerable.Wrap(LeftChild.AsSequentialQuery(token), token);
IEnumerable<TRightInput> wrappedRightChild = CancellableEnumerable.Wrap(RightChild.AsSequentialQuery(token), token);
return wrappedLeftChild.Join(
wrappedRightChild, _leftKeySelector, _rightKeySelector, _resultSelector, _keyComparer);
}
//---------------------------------------------------------------------------------------
// Whether this operator performs a premature merge that would not be performed in
// a similar sequential operation (i.e., in LINQ to Objects).
//
internal override bool LimitsParallelism
{
get { return false; }
}
}
/// <summary>
/// Class to build a HashJoinHashLookup of right elements for use in Join operations.
/// </summary>
/// <typeparam name="TElement"></typeparam>
/// <typeparam name="TOrderKey"></typeparam>
/// <typeparam name="THashKey"></typeparam>
internal sealed class JoinHashLookupBuilder<TElement, TOrderKey, THashKey> : HashLookupBuilder<TElement, TOrderKey, THashKey>
{
private readonly QueryOperatorEnumerator<Pair<TElement, THashKey>, TOrderKey> _dataSource; // data source. For building.
private readonly IEqualityComparer<THashKey>? _keyComparer; // An optional key comparison object.
internal JoinHashLookupBuilder(QueryOperatorEnumerator<Pair<TElement, THashKey>, TOrderKey> dataSource, IEqualityComparer<THashKey>? keyComparer)
{
Debug.Assert(dataSource != null);
_dataSource = dataSource;
_keyComparer = keyComparer;
}
public override HashJoinHashLookup<THashKey, TElement, TOrderKey> BuildHashLookup(CancellationToken cancellationToken)
{
HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> lookup =
new HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>>(_keyComparer);
JoinBaseHashBuilder baseHashBuilder = new JoinBaseHashBuilder(lookup);
BuildBaseHashLookup(_dataSource, baseHashBuilder, cancellationToken);
return new JoinHashLookup(lookup);
}
protected override void Dispose(bool disposing)
{
Debug.Assert(_dataSource != null);
_dataSource.Dispose();
}
/// <summary>
/// Adds TElement,TOrderKey values to a HashLookup of HashLookupValueLists.
/// </summary>
private readonly struct JoinBaseHashBuilder : IBaseHashBuilder<TElement, TOrderKey>
{
private readonly HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> _base;
public JoinBaseHashBuilder(HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> baseLookup)
{
Debug.Assert(baseLookup != null);
_base = baseLookup;
}
public bool Add(THashKey hashKey, TElement element, TOrderKey orderKey)
{
HashLookupValueList<TElement, TOrderKey> currentValue = default(HashLookupValueList<TElement, TOrderKey>);
if (!_base.TryGetValue(hashKey, ref currentValue))
{
currentValue = new HashLookupValueList<TElement, TOrderKey>(element, orderKey);
_base.Add(hashKey, currentValue);
return false;
}
else
{
if (currentValue.Add(element, orderKey))
{
// We need to re-store this element because the pair is a value type.
_base[hashKey] = currentValue;
}
return true;
}
}
}
/// <summary>
/// A wrapper for the HashLookup returned by JoinHashLookupBuilder.
///
/// Since Join operations do not require a default, this just passes the call on to the base lookup.
/// </summary>
private sealed class JoinHashLookup : HashJoinHashLookup<THashKey, TElement, TOrderKey>
{
private readonly HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> _base;
internal JoinHashLookup(HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> baseLookup)
{
Debug.Assert(baseLookup != null);
_base = baseLookup;
}
public override bool TryGetValue(THashKey key, ref HashLookupValueList<TElement, TOrderKey> value)
{
return _base.TryGetValue(key, ref value);
}
}
}
}

@roji
Copy link
Member Author

roji commented Dec 2, 2024

Point taken :) Maybe we can have a common implementation with a flag/enum that determines left vs. inner... If that works..

@JohnGalt1717
Copy link

Please add leftjoin for query syntax. (And crossjoin and rightjoin)

This is desperately needed

@bachratyg
Copy link

Corresponding syntax could be added to LINQ expression syntax, on the C# side (just like join) - but this is optional.

Please please please don't skip query syntax! Operators that introduce new variables (SelectMany, Join, GroupBy) are a pain to both read and write using method syntax.

@OJacot-Descombes
Copy link

A version having no result selector returning an IEnumerable<(TOuter, TInner?)> would be handy. Note that the Enumerable.Index(IEnumerable) Method returns a value tuple enumeration as well.

@CyrusNajmabadi
Copy link
Member

Language change requests (like query syntax) would need a discussion opened for that idea at dotnet/csharplang

@roji
Copy link
Member Author

roji commented Dec 3, 2024

@OJacot-Descombes this proposal intentionally follows the existing Enumerable.Join API shape very closely (that's the right model here, rather than Enumerable.Index). I'd prefer keeping additional new APIs as a separate, later discussion for both Join and LeftJoin, assuming the latter makes it in.

@CyrusNajmabadi yep. I'd like to get sign-off on the operator here first, once that happens I'll start the language conversation.

@eiriktsarpalis
Copy link
Member

For inner value types, the operator returns default when an outer has no inners.

I know this just reflects convention in existing methods such as FirstOrDefault, but I do think this approach is problematic in the general case. As any user of LINQ has probably already internalized, value sequences must be lifted to Nullable<T> and in reference types null doesn't necessarily represent the absence of a value. I'm wondering if we could do better going forward, e.g. by passing a boolean to the delegate or perhaps looking at using true option types assuming the language does get DU support soon.

@roji
Copy link
Member Author

roji commented Dec 3, 2024

Updated the proposal above, with the signatures for the Queryable variants, for RightJoin (in case we decide to add it), and with notes and alternative API proposals based on a conversation with @eiriktsarpalis on the ambiguity caused by the operator passing default to the result selector's inner parameter.

I agree that there's an ambiguity here in the general caes, though I think the cases where it matters are generally rare/contrived to warrant a more complex API... See above for more notes.

@eiriktsarpalis eiriktsarpalis added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Dec 3, 2024
@eiriktsarpalis eiriktsarpalis added this to the 10.0.0 milestone Dec 3, 2024
@aloraman
Copy link

aloraman commented Dec 3, 2024

Have you looked at alternative implementations, such as https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/LeftJoin.cs ?

P.S. People ask for cross join, right join in query syntax. If you would consider it, please, consider also adding support for theta joins (that is, joining not on equality (equijoins), but arbitrary conditions)

@terrajobst
Copy link
Member

terrajobst commented Dec 3, 2024

Video

  • We should do both LeftJoin and RightJoin as reordering the Linq query expression is much more involved than it would be in SQL
  • We could do an overload that doesn't take a result selector and returns a tuple. But we decided to keep this separate as we should do the same for the existing Join method
  • We decided against including Outer because the existing Join method omitted Inner.
namespace System.Linq;

public static class Enumerable
{
    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector);

    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector);

    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);
}

public static class Queryable
{
    public static IQueryable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter, TInner?, TResult>> resultSelector);
	
    public static IQueryable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter, TInner?, TResult>> resultSelector,
        IEqualityComparer<TKey>? comparer);
    
    public static IQueryable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter?, TInner, TResult>> resultSelector);
	
    public static IQueryable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter?, TInner, TResult>> resultSelector,
        IEqualityComparer<TKey>? comparer);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Dec 3, 2024
@jogibear9988
Copy link

Linq2db already also added extension methods like this for query syntax:

https://linq2db.github.io/articles/sql/Join-Operators.html

@jogibear9988
Copy link

linq2db has:

LeftJoin, RightJoin, InnerJoin, FullJoin and CrossJoin

they are defined here: https://github.com/linq2db/linq2db/blob/507843d091b4d28c1808d568a729a456de4dde53/Source/LinqToDB/LinqExtensions.cs#L3438

@roji
Copy link
Member Author

roji commented Dec 3, 2024

Have you looked at alternative implementations, such as https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/LeftJoin.cs ?

Yep. One comment on that API is that it seems overly complex - requiring both a firstSelector and a bothSelector; it does account for the case of disambiguating between "not found" and "found but default" (see "Value types and alternative LeftJoin API shapes" in the OP), but as with the other proposals, it seems to make the basic API more complicated for a 1% edge case.

P.S. People ask for cross join, right join in query syntax. If you would consider it, please, consider also adding support for theta joins (that is, joining not on equality (equijoins), but arbitrary conditions)

Left join has received overwhelmingly more requests than the others, which is why this proposal concentrates on it (note that RightJoin is included as well). Cross join specifically is already quite easy to express (from ... from ..., SelectMany), so it doesn't necessarily require a dedicated operator.

But of course, the fact that other operators aren't included in this specific proposal doesn't mean that they won't be added in the future - I'm just not sure we've seen lots of demand for them.

@obiwanjacobi
Copy link

Decl: So you have a TOuter and a TInner. Which one is the left (or right) one?
Usage: Is the this extension argument the left one: left.LeftJoin(right, ...)?

@roji
Copy link
Member Author

roji commented Dec 11, 2024

@obiwanjacobi this works exactly like the existing (non-left) Join(), and the signature expresses this:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
	Func<TInner, TKey> innerKeySelector,
	Func<TOuter, TInner?, TResult> resultSelector);

In other words, in x.LeftJoin(y, ...), x is the TOuter and y is the TInner.

@obiwanjacobi
Copy link

@roji but with a normal Join left or right is not important as it is here.
Is TOuter left? Or TInner? It is not clear.
Tell which side is seen as left (and right).
Or are you going for the Compare() vibe :-)

@JohnGalt1717
Copy link

Indeed. The inner should be the non-nullable entries, and the outer should be the optional join.

RightJoin join would be the opposite and CrossJoin would be just "left and right" with both being optional.

@terrajobst
Copy link
Member

terrajobst commented Dec 11, 2024

Maybe I'm missing something but in regular SQL it looks like this:

SELECT *
FROM   Outer o
        LEFT JOIN Inner i ON i.Id = O.Id

In C# it looks like this:

var results = dbContext.Outer.LeftJoin(dbContext.Inner, o => o.Id, i => i.Id, (o, i) => (Outer: o, Inner: i));

The same argument can be made for RightJoin. Outer is always on the left, inner is always on the right. LeftJoin and RightJoin signatures don't need to change to reflect that.

The confusing bit I guess is that in SQL, the join types use INNER and OUTER terminology, that is also used to describe the sides of the join. The way I think of this is that a join like X JOIN Y ON P can be implemented as nested loops like FOREACH X { FOREACH Y { IF (P) RETURN X, Y } }. There, outer and inner refer to loops (the inner loop vs the outer loop). Obviously, a SQL DB tries hard to avoid nested loops because it's O(X*Y), but for the purposes here that's an implementation detail.

@Clockwork-Muse
Copy link
Contributor

CrossJoin would be just "left and right" with both being optional.

No, that's FULL OUTER JOIN.
A CROSSJOIN is a full cartesian product of all rows in the table multiplied by the rows in the result set. It's equivalent to:

SELECT ...
FROM <table reference>
INNER JOIN <other reference>
ON 1 = 1

Both sides are always required.

(This capability is rarely useful with tables with real data, instead more often being used with subquery result sets and to build other such computed sets. This is particularly useful when constructing ad-hoc buckets for date/time window analysis).

@terrajobst
Copy link
Member

terrajobst commented Dec 11, 2024

We already have a cross join (kinda) with SelectMany().

Also, in standard SQL you can actually express a cross join:

SELECT *
FROM   Outer o
        CROSS JOIN Inner i

@obiwanjacobi
Copy link

Maybe I'm missing something but in regular SQL it looks like this:

SELECT *
FROM Outer o
LEFT JOIN Inner i ON i.Id = O.Id
In C# it looks like this:

var results = dbContext.Outer.LeftJoin(dbContext.Inner, o => o.Id, i => i.Id, (o, i) => (Outer: o, Inner: i));
The same argument can be made for RightJoin. Outer is always on the left, inner is always on the right. LeftJoin and RightJoin signatures don't need to change to reflect that.

The confusing bit I guess is that in SQL, the join types use INNER and OUTER terminology, that is also used to describe the sides of the join. The way I think of this is that a join like X JOIN Y ON P can be implemented as nested loops like FOREACH X { FOREACH Y { IF (P) RETURN X, Y } }. There, outer and inner refer to loops (the inner loop vs the outer loop). Obviously, a SQL DB tries hard to avoid nested loops because it's O(X*Y), but for the purposes here that's an implementation detail.

I don't know any SQL (I do - but suppose) so a function JoinLeft with no indication what left is, looks weird to me (the Compare vibe). Maybe I know a bit of set-related stuff, so I know what left and right joins do. Then I see this decl and think: what does inner and outer have to do with left and right? (okay I'm done ;-))

@roji
Copy link
Member Author

roji commented Dec 12, 2024

@obiwanjacobi naming is hard. First, LINQ already follows SQL naming with most of its operators: verbs such as Select and Where seem intuitive, but they also follow SQL naming (various other languages uess e.g. map and filter instead). Also, the majority of programmers have some knowledge of SQL, probably to the level where the concept of left joins is familiar. That's a major advantage of this naming - if you just know a bit of SQL you know exactly what the function does.

I'm also not sure if another naming would do any better. We can't replace the name LeftJoin with OuterJoin, since RightJoin would then have to be called InnerJoin, which would be completely in conflict with the concept of inner joins (the current Join operator). Any other name would also be unlikely to instantly convey the meaning anyway, so users would have to look at documentation regardless - LeftJoin at least has the advantage of being transparent for anyone with a passing knowledge of SQL.

@Clockwork-Muse
Copy link
Contributor

... The overlap with SQL is even more pronounced then just the method names, since the set-math behavior of Union/Intersect/Except is inherited from there as well.

@terrajobst
Copy link
Member

terrajobst commented Dec 12, 2024

@obiwanjacobi

I don't know any SQL (I do - but suppose) so a function JoinLeft with no indication what left is, looks weird to me (the Compare vibe)

FWIW, there many examples where methods are used as infix operators, like A + B becomes a.Add(b) or Op.Add(a, b). So there is quite a bit of prior art for methods processing two sides to name the first argument left and the second right because that's where they appear in operator syntax or when invoked as an instance method.

Is that ideal? Maybe not, but as @roji said, naming is hard and there are trade offs to any naming convention. Our overriding principle is consistency with what is already there and Linq (for better or worse) uses SQL-like naming. My personal opinion is that it's not ideal because some concepts don't translate well, such as SelectMany. But that train has sailed (deliberately mixing metaphors here).

@Mr0N
Copy link

Mr0N commented Dec 14, 2024

It would be good to have such an API, if it can be implemented, of course.

using System;
using System.Linq;

Enumerable.Empty<string>().LeftJoin(Enumerable.Empty<int>(), a => a.input == a.join.ToString());
static class Test
{
    public record JoinType<TInput, TJoin>(TInput input, TJoin join);
    public record JoinTypeInput<TInput, TJoin>(TInput input, TJoin join);
    public static IEnumerable<JoinType<TInput, TJoin>> LeftJoin<TJoin, TInput>(this IEnumerable<TInput> inputs, IEnumerable<TJoin> join, Func<JoinTypeInput<TInput, TJoin>, bool> selector)
    {
        throw new Exception();
    }
}

@roji
Copy link
Member Author

roji commented Dec 14, 2024

@Mr0N I'm very unclear on what it is you're asking for there, some usage examples may help clarify.

@Mr0N
Copy link

Mr0N commented Dec 14, 2024

@Mr0N I'm very unclear on what it is you're asking for there, some usage examples may help clarify.

It seems that the LeftJoin interface was meant to work like in this method. In SQL, for example, you don’t have to first select the fields to compare and then compare them separately; you can directly select the fields and compare them in one function. This results in the join of two tables, which can’t be changed. In the interface above, you first need to choose the fields to join, and then in a separate function, you compare those fields, which doesn’t seem very logical, because it could be done in one function.

select * from tab as t
left join jTab as j on t.id = j.id -- function

You can reduce the entire LeftJoin method interface from three functions to one.

var result = tab.LeftJoin(jTab,a=>a.join.id==a.input.id);

@Mr0N
Copy link

Mr0N commented Dec 14, 2024

Well, it's not that easy to implement so that with the help of a single function two collections can be joined using Func<JoinTypeInput<TInput, TJoin>, bool>. Essentially, the information from this function is quite sufficient to join the two collections. The function returns true or false, meaning whether the two elements are equal or not.

@roji
Copy link
Member Author

roji commented Dec 14, 2024

var result = tab.LeftJoin(jTab,a=>a.join.id==a.input.id);

Both the existing Join() operator and the proposed new LeftJoin() operator represent equijoin operations only; while SQL allows you to join over any expression function (e.g. SELECT * FROM a JOIN b ON b.x > a.y). This is a very intentional design: since only equality is supported, the internal implementation constructs a lookup table for the keys of the inner collection, and then loops over the outer collection, performing hash lookups. This is known in SQL databases as a "hash join", and is generally a much more efficient joining strategy than simply looping over all the inner elements for each outer element (that's known as a "nested loop join").

In other words, the fact that LINQ Join and LeftJoin are constrained to equijoins only - and their signatures look the way they do - allows them to perform much, much better than if they accepted an arbitrary bool-returning lambda, as in your example.

By the way, SQL works the same way: the database can perform hash joins for equijoins, but the moment you use a non-equality operator in your join condition, that join strategy cannot be used, and you may end up with a far slower implementation (though what exactly happens varies).

We could consider introducing an additional overload, which looks like what you propose (i.e. accepts a bool-returning lambda). This would unrelated to this LeftJoin proposal, since we'd also do it for Join and GroupJoin; it would also probably be quite a pit of failure, as users would be tempted to use it and get much worse performance compared to the existing overloads.

@manandre
Copy link
Contributor

I would like contributing to the implementation of these new LINQ operators.
Tell me if you would accept a PR for this.

@roji
Copy link
Member Author

roji commented Dec 15, 2024

@manandre that's appreciated, but I have a PR mostly already ready (I did the work so that I could benchmark - mostly tests and cleanup remain)...

@roji
Copy link
Member Author

roji commented Dec 15, 2024

C# query syntax proposal: dotnet/csharplang#8892

roji added a commit to roji/runtime that referenced this issue Dec 20, 2024
@roji roji linked a pull request Dec 20, 2024 that will close this issue
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.