-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Introduce LeftJoin LINQ operator #110292
Comments
Tagging subscribers to this area: @dotnet/area-system-linq |
Your Additional Notes covered all the questions and naming thoughts I had, @roji. Thanks for including all of that. |
@stephentoub - Is this something we should also consider for PLINQ? (For others - because of the internals of PLINQ, it's not possible for third parties to implement their own operators, as it is with normal LINQ) |
From an API perspective, #98689 covers adding to PLINQ any APIs added to LINQ that PLINQ doesn't already have. I don't think this particular API is any more special than the others already omitted, so I'd just want to lump this one in with those. From an implementation perspective, I suspect the PLINQ implementation would just be the code @roji wrote in his opening comments, with that operator implemented by delegating to GroupJoin/SelectMany. Doing a full-fledged open-coded implementation of PLINQ's would be a lot of difficult to validate code.
When doing an open-coded implementation, sure. But anyone can layer an implementation on top of the existing operators, just as with LINQ, e.g. public static ParallelQuery<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this ParallelQuery<TOuter> outer, ParallelQuery<TInner> inner,
Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner?, TResult> resultSelector) =>
outer.GroupJoin(inner, outerKeySelector, innerKeySelector, (outerItem, innerItem) => (outerItem, innerItem))
.SelectMany(joinedSet => joinedSet.innerItem.DefaultIfEmpty(), (joinedSet, innerItem) => resultSelector(joinedSet.outerItem, innerItem)); |
@stephentoub I know very little about PLINQ, but assuming an implementation of Join already exists, wouldn't an implementation of LeftJoin be very similar (just as the non-PLINQ implementation of LeftJoin is very similar to the Join implementation)? |
Likely. But to put it in context, this is the LINQ implementation of Join: runtime/src/libraries/System.Linq/src/System/Linq/Join.cs Lines 48 to 74 in 342936c
and this monster is the PLINQ implementation: Lines 16 to 282 in 342936c
|
Point taken :) Maybe we can have a common implementation with a flag/enum that determines left vs. inner... If that works.. |
Please add leftjoin for query syntax. (And crossjoin and rightjoin) This is desperately needed |
Please please please don't skip query syntax! Operators that introduce new variables (SelectMany, Join, GroupBy) are a pain to both read and write using method syntax. |
A version having no result selector returning an |
Language change requests (like query syntax) would need a discussion opened for that idea at dotnet/csharplang |
@OJacot-Descombes this proposal intentionally follows the existing Enumerable.Join API shape very closely (that's the right model here, rather than Enumerable.Index). I'd prefer keeping additional new APIs as a separate, later discussion for both Join and LeftJoin, assuming the latter makes it in. @CyrusNajmabadi yep. I'd like to get sign-off on the operator here first, once that happens I'll start the language conversation. |
I know this just reflects convention in existing methods such as |
Updated the proposal above, with the signatures for the Queryable variants, for RightJoin (in case we decide to add it), and with notes and alternative API proposals based on a conversation with @eiriktsarpalis on the ambiguity caused by the operator passing I agree that there's an ambiguity here in the general caes, though I think the cases where it matters are generally rare/contrived to warrant a more complex API... See above for more notes. |
Have you looked at alternative implementations, such as https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/LeftJoin.cs ? P.S. People ask for cross join, right join in query syntax. If you would consider it, please, consider also adding support for theta joins (that is, joining not on equality (equijoins), but arbitrary conditions) |
namespace System.Linq;
public static class Enumerable
{
public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner?, TResult> resultSelector);
public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner?, TResult> resultSelector,
IEqualityComparer<TKey>? comparer);
public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter?, TInner, TResult> resultSelector);
public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter?, TInner, TResult> resultSelector,
IEqualityComparer<TKey>? comparer);
}
public static class Queryable
{
public static IQueryable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IQueryable<TOuter> outer,
IEnumerable<TInner> inner,
Expression<Func<TOuter, TKey>> outerKeySelector,
Expression<Func<TInner, TKey>> innerKeySelector,
Expression<Func<TOuter, TInner?, TResult>> resultSelector);
public static IQueryable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IQueryable<TOuter> outer,
IEnumerable<TInner> inner,
Expression<Func<TOuter, TKey>> outerKeySelector,
Expression<Func<TInner, TKey>> innerKeySelector,
Expression<Func<TOuter, TInner?, TResult>> resultSelector,
IEqualityComparer<TKey>? comparer);
public static IQueryable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
this IQueryable<TOuter> outer,
IEnumerable<TInner> inner,
Expression<Func<TOuter, TKey>> outerKeySelector,
Expression<Func<TInner, TKey>> innerKeySelector,
Expression<Func<TOuter?, TInner, TResult>> resultSelector);
public static IQueryable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
this IQueryable<TOuter> outer,
IEnumerable<TInner> inner,
Expression<Func<TOuter, TKey>> outerKeySelector,
Expression<Func<TInner, TKey>> innerKeySelector,
Expression<Func<TOuter?, TInner, TResult>> resultSelector,
IEqualityComparer<TKey>? comparer);
} |
Linq2db already also added extension methods like this for query syntax: |
linq2db has: LeftJoin, RightJoin, InnerJoin, FullJoin and CrossJoin they are defined here: https://github.com/linq2db/linq2db/blob/507843d091b4d28c1808d568a729a456de4dde53/Source/LinqToDB/LinqExtensions.cs#L3438 |
Yep. One comment on that API is that it seems overly complex - requiring both a firstSelector and a bothSelector; it does account for the case of disambiguating between "not found" and "found but default" (see "Value types and alternative LeftJoin API shapes" in the OP), but as with the other proposals, it seems to make the basic API more complicated for a 1% edge case.
Left join has received overwhelmingly more requests than the others, which is why this proposal concentrates on it (note that RightJoin is included as well). Cross join specifically is already quite easy to express ( But of course, the fact that other operators aren't included in this specific proposal doesn't mean that they won't be added in the future - I'm just not sure we've seen lots of demand for them. |
Decl: So you have a |
@obiwanjacobi this works exactly like the existing (non-left) Join(), and the signature expresses this: public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner?, TResult> resultSelector); In other words, |
@roji but with a normal |
Indeed. The inner should be the non-nullable entries, and the outer should be the optional join. RightJoin join would be the opposite and CrossJoin would be just "left and right" with both being optional. |
Maybe I'm missing something but in regular SQL it looks like this: SELECT *
FROM Outer o
LEFT JOIN Inner i ON i.Id = O.Id In C# it looks like this: var results = dbContext.Outer.LeftJoin(dbContext.Inner, o => o.Id, i => i.Id, (o, i) => (Outer: o, Inner: i)); The same argument can be made for The confusing bit I guess is that in SQL, the join types use |
No, that's SELECT ...
FROM <table reference>
INNER JOIN <other reference>
ON 1 = 1 Both sides are always required. (This capability is rarely useful with tables with real data, instead more often being used with subquery result sets and to build other such computed sets. This is particularly useful when constructing ad-hoc buckets for date/time window analysis). |
We already have a cross join (kinda) with Also, in standard SQL you can actually express a cross join: SELECT *
FROM Outer o
CROSS JOIN Inner i |
I don't know any SQL (I do - but suppose) so a function |
@obiwanjacobi naming is hard. First, LINQ already follows SQL naming with most of its operators: verbs such as Select and Where seem intuitive, but they also follow SQL naming (various other languages uess e.g. map and filter instead). Also, the majority of programmers have some knowledge of SQL, probably to the level where the concept of left joins is familiar. That's a major advantage of this naming - if you just know a bit of SQL you know exactly what the function does. I'm also not sure if another naming would do any better. We can't replace the name LeftJoin with OuterJoin, since RightJoin would then have to be called InnerJoin, which would be completely in conflict with the concept of inner joins (the current Join operator). Any other name would also be unlikely to instantly convey the meaning anyway, so users would have to look at documentation regardless - LeftJoin at least has the advantage of being transparent for anyone with a passing knowledge of SQL. |
... The overlap with SQL is even more pronounced then just the method names, since the set-math behavior of |
FWIW, there many examples where methods are used as infix operators, like Is that ideal? Maybe not, but as @roji said, naming is hard and there are trade offs to any naming convention. Our overriding principle is consistency with what is already there and Linq (for better or worse) uses SQL-like naming. My personal opinion is that it's not ideal because some concepts don't translate well, such as |
It would be good to have such an API, if it can be implemented, of course.
|
@Mr0N I'm very unclear on what it is you're asking for there, some usage examples may help clarify. |
It seems that the LeftJoin interface was meant to work like in this method. In SQL, for example, you don’t have to first select the fields to compare and then compare them separately; you can directly select the fields and compare them in one function. This results in the join of two tables, which can’t be changed. In the interface above, you first need to choose the fields to join, and then in a separate function, you compare those fields, which doesn’t seem very logical, because it could be done in one function.
You can reduce the entire LeftJoin method interface from three functions to one.
|
Well, it's not that easy to implement so that with the help of a single function two collections can be joined using |
Both the existing Join() operator and the proposed new LeftJoin() operator represent equijoin operations only; while SQL allows you to join over any expression function (e.g. In other words, the fact that LINQ Join and LeftJoin are constrained to equijoins only - and their signatures look the way they do - allows them to perform much, much better than if they accepted an arbitrary bool-returning lambda, as in your example. By the way, SQL works the same way: the database can perform hash joins for equijoins, but the moment you use a non-equality operator in your join condition, that join strategy cannot be used, and you may end up with a far slower implementation (though what exactly happens varies). We could consider introducing an additional overload, which looks like what you propose (i.e. accepts a bool-returning lambda). This would unrelated to this LeftJoin proposal, since we'd also do it for Join and GroupJoin; it would also probably be quite a pit of failure, as users would be tempted to use it and get much worse performance compared to the existing overloads. |
I would like contributing to the implementation of these new LINQ operators. |
@manandre that's appreciated, but I have a PR mostly already ready (I did the work so that I could benchmark - mostly tests and cleanup remain)... |
C# query syntax proposal: dotnet/csharplang#8892 |
Background and motivation
Background
LINQ has a Join operator, which, like its SQL INNER JOIN counterpart, correlates elements of two sequences based on matching keys; the LINQ join implementation internally creates a Lookup for the inner sequence, and then loops over the outer sequence, doing a lookup for the matching inner elements. In SQL database parlance, this is known as the hash join strategy (SQL Server docs, PostgreSQL docs as well as this useful post).
In addition to the above, SQL also has LEFT JOIN, which returns outer elements even if there's no corresponding inner ones; LINQ, in contrast, lacks this operator. The LINQ conceptual documentation shows how to combine existing operators to achieve a left join:
There are two issues with the above suggestion:
Proposal
This proposes introducing a 1st-class LeftJoin operator, which operates very similar to Join, except that it returns outer elements for which no inner element could be correlated. Aside from being much simpler to use than GroupJoin/SelectMany, it would also simply use Lookup internally - just like Join - and would therefore be much faster.
An initial implementation shows significant performance improvement compared to GroupJoin/SelectMany; LeftJoin is always faster than the equivalent GroupJoin/SelectMany construct, since GroupJoin itself constructs and uses a Lookup internally to implement an inner join internally - just like LeftJoin does - but also adds additional work on top.
Benchmark code
LeftJoin operator prototype implementation
Note: The current LINQ documentation for GroupJoin/SelectMany shows using AsQueryable, for no apparent reason. The addition of AsQueryable here adds very significant perf overhead - see dotnet/docs#43807 for benchmarks and a proposal to remove AsQueryable from that code sample.
Additional Notes
default
when an outer has no inners; this makes it impossible to distinguish between an inner not being found, and the inner being found but being null. This is similar to e.g. FirstOrDefault; although it's quite contrived for LeftJoin, see below for some notes and alternative API designs.RightJoin()
, which is the reverse ofLeftJoin()
(i.e. elements from the inner sequence are returned if no correlated outer is found). Right joins are seldom used in SQL, and it's always possible to flip the sequences around to express the join as a left join instead.join
) - but this is optional. Proposal: Proposal: introduce left and right join clauses to C# query expression syntax csharplang#8892./cc @jeffhandley @dotnet/area-system-linq @dotnet/efteam
API Proposal
API Usage
Value types and alternative LeftJoin API shapes
As pointed out above, the fact that the result selector accepts a defaultable inner means that it's impossible to distinguish between no inner being found for an outer, and the situation where the inner itself happens to be the default (thanks for discussion on this, @eiriktsarpalis). This problem isn't specific to this proposal, other operators (e.g. FirstOrDefault) have the same problem.
An alternative API to address this would pass a boolean to the result selector, representing whether an inner was found or not:
Alternatively, with the upcoming introduction of discriminated unions to .NET, an Optional type could allow the same thing:
However, it seems that cases where the distinction between "not found" and "found but default" matters are especially rare for joining; the inner key selector would have to accept a null/default inner and "extract" a key out of that (matching the outer key); not impossible, but definitely feels contrived. It's also possible to work around the ambiguity (in some cases) by switching to a nullable value type.
In other words, we should IMHO avoid making the API more complex/heavy for everyone because of a 1% case.
Previous related issues
The text was updated successfully, but these errors were encountered: