Feedback by UserVoice

How can we improve Excel for Windows (Desktop Application)?

Built-in RegEx / Regular Expressions Functions to Search, Extract, and Replace

I've constantly found myself installing and re-installing add-ins that add RegEx Functions to Excel. Everyone I introduce them too finds them extremely useful. There's over half a million results in google for the query ["excel" ("regular expression" OR "regex")].

353 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Anonymous shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    26 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • Dale Phurrough commented  ·   ·  Flag as inappropriate

        Wow, excel has fallen behind. Regex is a defacto standard feature for data parsing/manipulation. When implemented, must be one of the standard regex flavors - not Microsoft new. Flavor of PCRE or Javascript is preferred. [I just now looked into your upcoming excel javascript, but it is inside and the process is far too complex. So I instead uploaded my excel spreadsheet to google sheets, made the regex there in about 3 minutes, and now it is working great. One less spreadsheet in Excel -- one more spreadsheet in google.]

      • Mark Fitzpatrick commented  ·   ·  Flag as inappropriate

        It should be in both Native Excel and M.

        Dynamic Arrays will likely increase the need and power of a RegEx function.

      • Any Mouse commented  ·   ·  Flag as inappropriate

        Since REGEX is such a multifaceted animal, I wish to express my interest in at least having the functionality to cull a formatted sub-string from another cell. Something like =LEFT(A5, 11), except that I don't know where in the string the 11 characters that I want will be (based on format). This is for pulling specific information out of user entered data. The point is, I am not looking for find or find/replace, which is interactive, or for a filter; I am looking to extract consistently formatted information from inconsistently entered data, applied in every row of data provided.

      • jedikangaroo commented  ·   ·  Flag as inappropriate

        Absolutely. Today I needed a function to remove multiple unwanted characters, then replace spaces with dashes.
        Google Sheets:
        =REGEXREPLACE(REGEXREPLACE(C2,"[()\.&:\+',=?/]","")," +","\-")

        Excel:
        =SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(C2,"(",""),")",""),".",""),"&","")," ","-"),":",""),"+",""),"'",""),",",""),"=",""),"?",""),"/",""),"--","-")

        The Google sheets version will replace multiple spaces with a single dash. The Excel version will remove double dashes, but that is all.

      • Perry Wheeler commented  ·   ·  Flag as inappropriate

        Support this 100%.
        Don't over-complicate or attempt to simplify it, just implement a =REGEX(String) formula based in one of the industry-standard models (PCRE maybe?). I have a VBA version that I use all the time, but, as it uses the MS RegEx library 5.0 it is an incomplete implementation (no look-behind).
        For me, this is probably the single biggest missing game-changer function in Excel and is the one that I most have to build workarounds for. I know there are those that argue that most functionality can be duplicated using existing functions but, once you get past the basics, these become horrifically complex, difficult to debug, nested messes to emulate something that can be done in a few chars of regex.

      • Harlan Grove commented  ·   ·  Flag as inappropriate

        @Brett Ables

        Regular expressions (regex) would be a specialized feature with a limited potential user base, though I figure a much large user base than for the Bessel functions, so only a few regex functions would be needed. FIND and SUBSTITUTE alternatives would be all that's needed. Want to count the number of substrings matching \b[^a-z]+foo\b ?

        =LEN(REGEXSUBSTITUTE(x,"(\b[^a-z]+foo\b)","$1#"))-LEN(x)

        similar to finding the last \ in a pathname using current functions.

        Yes, indeed, I'd like better text functions, not least being able to specify position from the end of strings and instance number when there are multiple instances, but there are formula workarounds. There's no alternative to general regular expression support. LibreOffice Calc has limited support, Google Sheets has lots of support, but Excel is stuck in the late 1980s.

      • Brett Ables commented  ·   ·  Flag as inappropriate

        Translating regex capabilities to Worksheet Functions does require a lot of consideration since regex can be used in many different workflows. Matching (True/False), Splitting, token extraction, searching, and counting are all various workflows that can utilize regex pattern matching. This would require a family of functions that implement different capabilities, with some aimed at user-friendly simple usage and a couple other array functions for power users. (Think slope, intercept, forecast vs the power-user function linear).

        The most obvious use to me would be to replace the existing workflow to extract something from a string which is currently a painful combination of find, left, right, mid, and substitute. By default the entire match is returned; if the pattern contains tokens, and optional argument allows selecting which of the tokens to return. A more powerful array function would be capable of returning multiple matches containing multiple tokens as a 2D array which could either be given a worksheet range to populate or aggregated to a single result with a function like Index() like any other array function.

        Similar functions could perform replacement, counting, splitting, etc. Each regex related function would accept text, a pattern, and the standard regex options (ignorecase, dotall, multiline, etc.) plus additional arguments specific to each type of operation.

        While most languages these days have a somewhat object oriented approach to Regex returning complex Match or MatchCollection objects (to use VBA's nomenclature), this would need to be a more functional implementation similar to what Matlab does with regexp and regexprep.

      • Mike_884 commented  ·   ·  Flag as inappropriate

        YES please!
        For example a find-and-replace with preview based on regex.
        It would be great.
        I know of only 1 RegEx Find/Replace (Codedown) add-in, but there is no preview, which I believe is a little tricky.
        =

      • Stilez commented  ·   ·  Flag as inappropriate

        The idea's good and I would like to see it as well, but the suggested functions/implementations feel clumsy to me. Can they be refined?

        It's also useful to avoid array functions since a lot of this will already be sourced as table data or lists of text, with sorting+filtering between rows and text being broken up across.

        *** Proposal for a PARSE() function ***

        Suppose there were four functions like this:

        PARSE(text_data, parse_pattern, separator)
        The first arg is the raw data to be parsed. The second arg is a string that says how the string is to be parsed into chunks, and any parsed chunks to ignore - a possible syntax is below. The third arg is a separator to be used in the output.

        PARSECOUNT(parsed_string, separator)
        counts the number of chunks in a parsed string

        PARSESELECT(parsed_string, separator, [chunk#])
        or: PARSESELECT(parsed_string, separator, [chunk_from#], [chunk_to#])
        returns a giuven chunk, or a subset of the parsed chunks

        PARSESORT(parsed_string, separator, sort_string)
        reorders the parsed string. The sort_string would be something like "2,1,3-" to indicate chunk 2, then chunk 1, then chunks >=3.

        Clearly the parser syntax can't be too complicated, but regex is usable in Microsoft Word and for simple matching we use ^ or ~ in Word and Excel already, so this comment uses a simple regex-based syntax, although it only needs a few functions for good parsing ability.

        To keep the comment simple, I've put the syntax explanation at the end of this comment. In simple terms it breaks a string into chunks using the stated separator. (...) is used to demarcate the chunks, similar to regex named groups. ^(..) indicates that a group is matched but then ignored in the returned string. The entire string must be matched for simplicity so if there is junk at the 3end, it's matched using ^(?*) (match a chunk containing any character any number of repeats, and then ignore it)

        *** Example 1 ***

        Data in A1: September 23, 2003
        Function: PARSE(A1, "([a-z]+)^( )([0-9]{-2})^(, )([0-9]{4})", "/")
        Output: "September/23/2003"

        Function: PARSESORT(PARSE(A1, "([a-z]+)^( )([0-9]{-2})^(, )([0-9]{4})", " ")," ","2,1,3")
        Output: "23 September 2003"

        *** Example 2 ***

        Data in A1: 01/23/2016 "J:\data\mary\excel\project1\myfile.xlsx" 300KB
        Task: Extract drive letter, bare folder, bare filename and size in bytes
        Formulae in Excel:
        B1: PARSE(A1, "^([0-9/]{10} )([a-z])^(:")(!"+)^(" )([0-9]+)((KB|MB))",";")
        > B1 now contains: J;\data\mary\excel\project1\myfile.xlsx;300;KB
        C1: PARSESELECT(B1,";",2)
        > C1 now contains: \data\mary\excel\project1\myfile.xlsx
        D1: PARSECOUNT(C1,"\")
        > D1 now contains: 6
        > Being clever - we've parsed the original string to get the URI, then a second parse using the backslashes *already* in the filename to get the path and file name. (the initial "\", treated as part of a parsed string, implies a null 1st chunk)
        Then the desired extracted data is:
        > Drive letter is PARSESELECT(B1,";",1)
        > Folder is PARSESELECT(C1,"\",1,D1-1) (all chunks in C1 from the first except the last, using "\" to split chunks)
        > Filename is PARSESELECT(C1,"\",D1) (the last chunk in C1)
        > Byte size is PARSESELECT(B1,";",3)*if(PARSESELECT(B1,";",4)="KB",1024,1024^2)

        *** Syntax details for above example ***

        Escapable characters: ()[]{}|*+?!^~
        Escape character ~ (already used as escape character in Excel Find)
        () identifies "chunks" in the parser, and is nestable for selections (much like regex)
        []{}*+| alternative characters, repetitions, alternatives, same as basic regex, but use "{1-3}" not "{1,3}" for simplicity
        ? single "any character" as usual (equivalent to regex ".")
        ! negation of following match ("must not match")
        ^ match following chunk but don't include it in the returned output string of the function
        ~ escape, which will need ~~ to escape itself

        Example: If A1 contains the example string given by Mitch_Judd (including its prefix):
        1.6.2.8.7280.1.54656.1::evAttributes(FALS,FALS,32768.00,32768.00);
        then PARSE(A1, "([1-9.]+)^(::!~(*~))((![,]+)^(,))+^(~);)","@") would give
        1.6.2.8.7280.1.54656.1@FALS@FALS@32768.00@32768.00
        because
        ([1-9.]+) matches one or more "0-9." and returns a chunk 1.6.2.8.7280.1.54656.1
        ^(::!~(*~)) matches "::" followed by not-literal-( repeated 0+ times, followed by literal-(. The initial ^ means this must match but is then ignored, so "::evAttributes(" doesn't appear in the output
        ((!,+)^(,))+ matches >=1 chunks, each comprising >=1 non-commas followed by a comma, but the ^ means that the final comma isn't returned as part of the chunk. It returns these each separated by a separator.
        ^(~);) requires a match with literal-) followed by semicolon, but then doesn't return these as a chunk.
        The output is therefore 1.6.2.8.7280.1.54656.1@FALS@FALS@32768.00@32768.00

        I think that's all you'd need for a parsing function set. The syntax would need some review and tightening up - this was a brief idea version only.

      • aja commented  ·   ·  Flag as inappropriate

        Quite often I need to quickly extract certain words from lines of text. Text-to-columns is overkill for this because I usually have to delete all the columns that I'm not interested in.

        What I'd like to see is a function called "word" which extracts the nth word from the text in a cell, which can be used like this:

        A B C
        the cat sat on the mat =word(A1,2) =word(A1,6)

        The result is:

        the cat sat on the mat cat mat

        This is so much easier than hacking about with =mid(A1,1,find(" ",A1)) ...and similar acrobatics. A "word" is space-delimited text by default, and leading/trailing/multiple spaces should be insignificant. That is, when you ask for word 1, you should get the first space-delimited word and no surrounding spaces.

        The full function could allow a delimiter to be specified, for example =word(cellref,pos[,delimiter]) in which you could specify one or more delimiters. The delimiter, if omitted, should default to a space.

        Invalid word position numbers should simply cause the function to return empty text ("").

        A negative word position number should return the nth last word in the string. For example, word("the cat",-1) should return "cat".

        Specifying zero as the word position should return empty text ("").

        The function should never return an error. The result is not used for calculation, so an error indication is unnecessary. Empty text is an acceptable return value when an error occurs.

        The function should be called "word". A precedent for this is the almost identical function built-in to the Rexx Programming Language - see http://www.rexxinfo.org/html/functions.html#word.

      • javier commented  ·   ·  Flag as inappropriate

        Allow use regexp in functions and formats.

        (2016-07-14 Dan [MS]: updated title from "regexp" to be more clear)

      • Regis commented  ·   ·  Flag as inappropriate

        I agree. Other example: date entered as text, not in the format expected for Excel to recognize a date like 2015/9/22. I'd like to be able to easily find 22 using maybe RIGHTOF(<cellref>,"/",last) meaning return the content in <cellref>, to the right of the last occurrence of "/". This function could take a number instead of "last", that number would be the occurrence of the delimiter. RIGHTOF(<cellref>,"/",2) would also return 22 in this case (Right of <cellref> after the second occurrence of "/"). This would avoid having to calculate a bunch of things like RIGHT(<cellref>,LEN(<cellref>)-FIND(<string>,<cellref>,[startnum]))

        LEFTOF would work the same to avoid having to do things like LEFT(<cellref>,FIND(<string>,<cellref>)-1)

        Also MIDBETWEN(<cellref>,"/",<occurrence1>,<occurrence2>) where <occurrence1> and <occurrence2> can be an integer, or "first", or "last". MIDBETWEEN(<cellref>,"/",1,2) or MIDBETWEEN(<cellref>,"/",first,last) would return 9 in this example.

      • Igor commented  ·   ·  Flag as inappropriate

        VBA supports regex after you add the reference to "Microsoft VBScript Regular Expressions 5.5". And building a small function that can be used in cell formulas is not that difficult... Used it myself quite frequently.

      • Anonymous commented  ·   ·  Flag as inappropriate

        Definitely would love this, and in the Mac version too. I often export Excel files as .CSV and then work in Notepad++ to use the regex in there, before reimporting back into Excel. It'd be great to just keep it in Excel.

      • Tor Jensen commented  ·   ·  Flag as inappropriate

        I recently started using regular expressions in other programs and now that I can read the matrix, the lack of them in Excel is really frustrating! Would love to see it as a feature!

      ← Previous 1

      Feedback and Knowledge Base