Make a simple safer version of VLOOKUP and INDEX MATCH
The majority of Excel users use VLOOKUP to get exact matches from a row or column, It would be good if they all used INDEX MATCH but it's more complex so how about this simpler version for the masses.
=GETMATCH(ReferenceToMatch, MatchColumn/Row, GetColumn/Row, Optional Value if no match)
If seen a few other references to updating VLOOKUPS and there are lots of opportunities, but it would be good to address the simple most widely used issue first.
Thanks for the votes! Based on your feedback/votes we’ve just announced the new XLOOKUP function. You can read more about it here: https://techcommunity.microsoft.com/t5/Excel-Blog/Announcing-XLOOKUP/ba-p/811376
And have an option to search for multiple columns. So first argument (ref to match) could be a 1x2 range, and the second (match column) would be a nx2 range.
Corey Becker commented
I think it would be cool if there was an input for which match to return (0 to return an array of the matched values, -1 to return the last occurrence, 1+ to return the corresponding match). Hopefully that would be an optional input and wouldn't hurt performance for most use cases.
Charles Williams commented
I agree that a simple form of exact match lookup is needed that is optimised for both unsorted and sorted data. Performance can be significantly improved for linear search by storing the last index found and rechecking it next time. My suggested implementation of the simple form (MEMLOOKUP and MEMMATCH) handles both sorted and unsorted data, always gives an exact match and allows the column to be either a number (for compatibility with existing functions) or a column label. More complex variations of LOOKUPS are handled by the other members of the AVLOOKUP family (the function gets too complex to use if you try to satisfy all the options with a single function). You can try out these implementations - download from http://www.decisionmodels.com/Downloads/InstallFxlV3.zip
For your "re tangent", the =+ notation isn't the fault of Lotus 1-2-3 users. There's nothing wrong with using + to start a formula. The question is: why does Excel keep the +? I think that you could make a suggestion out of that.
Great to make it to page 1 of the ideas. Please encourage others to vote to move this into the top 10
Excel will return something for MATCH(1,(Rng=Criteria)^-1), but it's the index of the last item in Rng which satisfies Criteria, and it only does that due to the bracketing binary search requires. It's useful for getting the LAST index, but it's debatable that's what most Excel users want.
Binary search on unsorted data is possible
MATCH(1,(Rng=Criteria)^-1) - CSE
The problem is on large data the expression (Rng=Criteria)^-1 become slow
If there is a faster way of generating this array then we can utilize the speed of the binary search in Match
First re tangent: I have a utility macro which does just 2 things: replaces all instances of =+ with = (@#$%&*! 1-2-3 users), then deletes all non-3D-reference instances of the worksheet's name in formulas. I thought about creating Application-level event handlers and putting that in BeforeSave, but I decided not to risk long shutdowns.
If you have multiple arrays of booleans, don't bother with N(booleanarray1)*N(booleanarray2). Just use (booleanarray1)*(booleanarray2). Excel converts TRUE/FALSE to 1/0 in arithmetic expressions. Also, it has limitations. If you had a RANGE of boolean values, say, X3:X10, N(X3:X10) doesn't return an 8x1 array, just the first cell converted to 1/0.
Multiple criteria and linear search: binary search REQUIRES the ability to bracket the value sought, that is, find contiguous subsets of the lookup set where ALL values before (location) the value sought are less than (value comparison) the value sought and ALL values after (location) the value sought are greater than (value comparison) the value sought. Multiple criteria would often if not usually produce the same composite FALSE value on both sides of the value sought, possibly even within 3-items subsets (FALSE, TRUE, FALSE). Even if you have a table sorted on col A then col B, the composite criteria A:A,"<10",B:B,"=*X*" need not be sorted, which in the context of boolean values means all FALSE values before all TRUE values or vice versa. The A:A,"<10" portion would provide an initial subset, but B:B,"=*X*" wouldn't necessarily be sorted, so would require linear search.
Maybe it's possible to determine sorted state for several fields as a composite, but I figure the expected instances of sorted composites would be well below 1% of all uses. If so, the overhead to determine composite sorted state would overwhelm the rare advantage binary search would provide.
I should mention that to simulate Corey's GETMATCH, I use something like this:
N() converts TRUE and FALSE to 1 and 0, respectively. The INDEX(,0) forces array evaluation of the N()*N(). N()*N() simulates AND. MATCH(1,,0) returns the position the first match to both criteria. The outer INDEX uses this position on the return column.
Corey's GETMATCH is definitely simpler.
I should remind you that the Google Sheets FILTER function is not as generalized as they get. If you are using FILTER multiple times, you are probably repeating many of the same criteria, so to speed things up, you should be able to save and reuse common criteria. This is difficult when FILTER uses multiple arguments to represent AND. See my post below:
I also wouldn't say that multi-criterion lookups using Corey's GETMATCH necessarily mean a linear search. We'd just need radix sort detection as opposed to sort detection. As with my previous sort detection comments, the radix sort states would have to be precalculated and used for multiple GETMATCH to have any benefit.
Tangent first. I agree about the annoyance of Excel STUPIDLY adding the worksheet name to cell references in the same worksheet as the formula. If you're in SheetX!X99, start typing a formula, select a range in another worksheet, then select a range in SheetX, Excel should be smart enough to understand that it doesn't need the worksheet qualifier. At the very least there should be an option to drop the unnecessary worksheet qualifier.
MATCHIFS? What would it be? Would it return the topmost/leftmost row/column index of the first instance in which all criteria were satisfied? Or would it be a generalized lookup? The former might be interesting, but would always have to use linear search, though it could also use short-circuit evaluation, so if the Nth item in range1 didn't satisfy criteria1, the corresponding items in the other ranges wouldn't need to be checked. Yet another way to perform lookups is problematic, and it gets back to something like the FILTER function in Google Sheets. For me, as general as possible beats all alternatives.
Good suggestion Corey, a MATCHIFS could be a good option. I would prefer the ability to select the lookup value first as creating the formula flows better and avoids that Sheet1! annoyance when you jump back to a sheet while creating a formula. However, I can see the logic in consistency with SUMIFS and COUNTIFS
Corey Becker commented
I think it would make more sense to keep in consistent with SUMIFS/COUNTIFS and allow multiple criteria.
=GETMATCH(return_range, criteria_range1, criteria1, [criteria_range2, criteria2], ...)
In that case, my sort arrow suggestion would have complemented your suggestion perfectly… if I didn't delete it. -_-' It would allow for fast searches on sorted ranges without putting any burden on the formula user.
I would really like a simple and safe formula that does not require a match type. I.e exact match only. Speed is not an issue for 90% of users, simplicity is. I agree another formula similar to GetMatch with a match type would be useful but that's not what I'm proposing or voting for here
The dynamic match_type is meant to overcome the limitations of using a constant match_type. Assuming a constant match_type, MATCH(,,0) is slow even on sorted arrays while MATCH(,,±1) relies on the lookup array always being sorted. The dynamic match_type solves this by always using the fastest search method that still returns the correct result (i.e. ±1 for sorted lookup arrays and 0 otherwise).
As long as you are using helper cells, the dynamic match_type only needs to be calculated once per lookup column, not once per MATCH. Thus a dynamic match_type only costs us 2(N − 1) comparisons, or about 4 average-case exact-match searches, for any number of MATCHes. BTW, I updated the file that I linked.
For anyone thinking that I am getting off-topic, my current stance is that GETMATCH should have a match_type. Without a match_type, GETMATCH would be forced to always use a linear search, which is not the fastest search method for sorted arrays. If GETMATCH has a match_type, then we can provide a dynamic match_type to speed up exact-match searches on sorted lookup arrays without relying on the lookup array always being sorted.
Your dynamic sort state defeats any benefits from binary search.
For unsorted matching, all MATCH does is iterate through its 2nd arg (which should be a 1D list) comparing it to its 1st arg, stops when it finds a match, and returns the match's index. If no match is found, it returns #N/A.
For ascending sorted matching, all MATCH does is check that its 1st arg is >= the first item in its 2nd arg (if not it returns #N/A quickly), then checks if its 1st arg is > than the last item in its 2nd arg (if so, returns the last item's index quickly). If still going, MATCH checks its 1st arg against the middle item in its 2nd arg list, and effectively replaces the whole list with the portion within which its 1st arg falls. It repeats this last bit until it finds an exact match or until item K is < 1st arg and item K+1 > 1st arg. Descending sorted matching works the opposite of this.
With N items in the list, unsorted matching involves no more than N comparisons, and sorted matching involves no more than INT(LOG(N,2))+2 comparisons. Your dynamic matching would involve 2(N-1) comparisons before the matching would begin. How could that aid efficiency?
I'm an idiot. My main argument for not using match types was that if you use, say, MATCH(,,1), then you are assuming that the lookup column will always be in ascending order. However, I never thought of making the match_type dynamic to adapt to the sort state of the lookup array. MATCH(,,±1) is fine. The way that I was using it is not. Now that I realize this, I see that my sort arrow suggestion was unnecessary, so I deleted it.
To set up a dynamic match_type, see the file linked below:
Hopefully the default match_type for GETMATCH is the exact match, since it does not require a sorted lookup array.
Maybe for you most large tables have filters. Not for me. One of the models I work with several times a day has a 2000+ row table in a hidden worksheet. Filters in hidden worksheets would be kinda pointless. There are over 6,000 lookup formulas into this table in other worksheets. I want that table sorted, and I DON'T want lookup functions which would only use binary search when there was an autofilter showing it was sorted.
You're talking about interrelating various pieces of functionality. I'd prefer keeping separate pieces of functionality as orthogonal as possible.
Also, tables can include columns with formulas. Autofiltered tables can be sorted on columns containing formulas, and the autofilter arrow will show sorted. However, those formulas could recalculate and change values, rendering the column no longer sorted, but the autofilter would still show it sorted. Point: relying on autofilter sorted state isn't robust. I've seen too many BIG MISTAKES arising from small assumptions in spreadsheets to be comfortable relying on any assumptions.
My Tables comment refers to looking up using columns other than the first one, and if those other columns were sorted, MATCH could use binary search. Your whole first paragraph in you previous comment showed you didn't understand that this was the situation to which I was referring.
Tables that are big enough for the binary search to save significant time usually have filters on. At some point, before you do the binary search, you need to sort the lookup range. What I am suggesting is that this sorted state is stored as the sort arrow, and if an edit breaks the sort, then remove the arrow.
This way, whenever the sort arrow is there, you know that the range is sorted. Then you don't have to check if the list is sorted by comparing each item with the next. You instead check the sort state by looking if there is a sort arrow, which takes constant time. The sort state wouldn't even use additional memory, since Excel already has a sort arrow, but it currently behaves more uselessly from what I'm suggesting.
I hope that that clears up my side of the argument. Also, here is a picture of the sort arrow:
Lastly, I'm not sure why you mentioned "Tables can be sorted on columns other than the first one.". I never made that assumption, but VLOOKUP does.