Feedback by UserVoice

How can we improve Excel for Windows (Desktop Application)?

Built-in RegEx / Regular Expressions Functions to Search, Extract, and Replace

I've constantly found myself installing and re-installing add-ins that add RegEx Functions to Excel. Everyone I introduce them too finds them extremely useful. There's over half a million results in google for the query ["excel" ("regular expression" OR "regex")].

289 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Anonymous shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    23 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • jedikangaroo commented  ·   ·  Flag as inappropriate

        Absolutely. Today I needed a function to remove multiple unwanted characters, then replace spaces with dashes.
        Google Sheets:
        =REGEXREPLACE(REGEXREPLACE(C2,"[()\.&:\+',=?/]","")," +","\-")

        Excel:
        =SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(C2,"(",""),")",""),".",""),"&","")," ","-"),":",""),"+",""),"'",""),",",""),"=",""),"?",""),"/",""),"--","-")

        The Google sheets version will replace multiple spaces with a single dash. The Excel version will remove double dashes, but that is all.

      • Perry Wheeler commented  ·   ·  Flag as inappropriate

        Support this 100%.
        Don't over-complicate or attempt to simplify it, just implement a =REGEX(String) formula based in one of the industry-standard models (PCRE maybe?). I have a VBA version that I use all the time, but, as it uses the MS RegEx library 5.0 it is an incomplete implementation (no look-behind).
        For me, this is probably the single biggest missing game-changer function in Excel and is the one that I most have to build workarounds for. I know there are those that argue that most functionality can be duplicated using existing functions but, once you get past the basics, these become horrifically complex, difficult to debug, nested messes to emulate something that can be done in a few chars of regex.

      • Harlan Grove commented  ·   ·  Flag as inappropriate

        @Brett Ables

        Regular expressions (regex) would be a specialized feature with a limited potential user base, though I figure a much large user base than for the Bessel functions, so only a few regex functions would be needed. FIND and SUBSTITUTE alternatives would be all that's needed. Want to count the number of substrings matching \b[^a-z]+foo\b ?

        =LEN(REGEXSUBSTITUTE(x,"(\b[^a-z]+foo\b)","$1#"))-LEN(x)

        similar to finding the last \ in a pathname using current functions.

        Yes, indeed, I'd like better text functions, not least being able to specify position from the end of strings and instance number when there are multiple instances, but there are formula workarounds. There's no alternative to general regular expression support. LibreOffice Calc has limited support, Google Sheets has lots of support, but Excel is stuck in the late 1980s.

      • Brett Ables commented  ·   ·  Flag as inappropriate

        Translating regex capabilities to Worksheet Functions does require a lot of consideration since regex can be used in many different workflows. Matching (True/False), Splitting, token extraction, searching, and counting are all various workflows that can utilize regex pattern matching. This would require a family of functions that implement different capabilities, with some aimed at user-friendly simple usage and a couple other array functions for power users. (Think slope, intercept, forecast vs the power-user function linear).

        The most obvious use to me would be to replace the existing workflow to extract something from a string which is currently a painful combination of find, left, right, mid, and substitute. By default the entire match is returned; if the pattern contains tokens, and optional argument allows selecting which of the tokens to return. A more powerful array function would be capable of returning multiple matches containing multiple tokens as a 2D array which could either be given a worksheet range to populate or aggregated to a single result with a function like Index() like any other array function.

        Similar functions could perform replacement, counting, splitting, etc. Each regex related function would accept text, a pattern, and the standard regex options (ignorecase, dotall, multiline, etc.) plus additional arguments specific to each type of operation.

        While most languages these days have a somewhat object oriented approach to Regex returning complex Match or MatchCollection objects (to use VBA's nomenclature), this would need to be a more functional implementation similar to what Matlab does with regexp and regexprep.

      • Mike_884 commented  ·   ·  Flag as inappropriate

        YES please!
        For example a find-and-replace with preview based on regex.
        It would be great.
        I know of only 1 RegEx Find/Replace (Codedown) add-in, but there is no preview, which I believe is a little tricky.
        =

      • Stilez commented  ·   ·  Flag as inappropriate

        The idea's good and I would like to see it as well, but the suggested functions/implementations feel clumsy to me. Can they be refined?

        It's also useful to avoid array functions since a lot of this will already be sourced as table data or lists of text, with sorting+filtering between rows and text being broken up across.

        *** Proposal for a PARSE() function ***

        Suppose there were four functions like this:

        PARSE(text_data, parse_pattern, separator)
        The first arg is the raw data to be parsed. The second arg is a string that says how the string is to be parsed into chunks, and any parsed chunks to ignore - a possible syntax is below. The third arg is a separator to be used in the output.

        PARSECOUNT(parsed_string, separator)
        counts the number of chunks in a parsed string

        PARSESELECT(parsed_string, separator, [chunk#])
        or: PARSESELECT(parsed_string, separator, [chunk_from#], [chunk_to#])
        returns a giuven chunk, or a subset of the parsed chunks

        PARSESORT(parsed_string, separator, sort_string)
        reorders the parsed string. The sort_string would be something like "2,1,3-" to indicate chunk 2, then chunk 1, then chunks >=3.

        Clearly the parser syntax can't be too complicated, but regex is usable in Microsoft Word and for simple matching we use ^ or ~ in Word and Excel already, so this comment uses a simple regex-based syntax, although it only needs a few functions for good parsing ability.

        To keep the comment simple, I've put the syntax explanation at the end of this comment. In simple terms it breaks a string into chunks using the stated separator. (...) is used to demarcate the chunks, similar to regex named groups. ^(..) indicates that a group is matched but then ignored in the returned string. The entire string must be matched for simplicity so if there is junk at the 3end, it's matched using ^(?*) (match a chunk containing any character any number of repeats, and then ignore it)

        *** Example 1 ***

        Data in A1: September 23, 2003
        Function: PARSE(A1, "([a-z]+)^( )([0-9]{-2})^(, )([0-9]{4})", "/")
        Output: "September/23/2003"

        Function: PARSESORT(PARSE(A1, "([a-z]+)^( )([0-9]{-2})^(, )([0-9]{4})", " ")," ","2,1,3")
        Output: "23 September 2003"

        *** Example 2 ***

        Data in A1: 01/23/2016 "J:\data\mary\excel\project1\myfile.xlsx" 300KB
        Task: Extract drive letter, bare folder, bare filename and size in bytes
        Formulae in Excel:
        B1: PARSE(A1, "^([0-9/]{10} )([a-z])^(:")(!"+)^(" )([0-9]+)((KB|MB))",";")
        > B1 now contains: J;\data\mary\excel\project1\myfile.xlsx;300;KB
        C1: PARSESELECT(B1,";",2)
        > C1 now contains: \data\mary\excel\project1\myfile.xlsx
        D1: PARSECOUNT(C1,"\")
        > D1 now contains: 6
        > Being clever - we've parsed the original string to get the URI, then a second parse using the backslashes *already* in the filename to get the path and file name. (the initial "\", treated as part of a parsed string, implies a null 1st chunk)
        Then the desired extracted data is:
        > Drive letter is PARSESELECT(B1,";",1)
        > Folder is PARSESELECT(C1,"\",1,D1-1) (all chunks in C1 from the first except the last, using "\" to split chunks)
        > Filename is PARSESELECT(C1,"\",D1) (the last chunk in C1)
        > Byte size is PARSESELECT(B1,";",3)*if(PARSESELECT(B1,";",4)="KB",1024,1024^2)

        *** Syntax details for above example ***

        Escapable characters: ()[]{}|*+?!^~
        Escape character ~ (already used as escape character in Excel Find)
        () identifies "chunks" in the parser, and is nestable for selections (much like regex)
        []{}*+| alternative characters, repetitions, alternatives, same as basic regex, but use "{1-3}" not "{1,3}" for simplicity
        ? single "any character" as usual (equivalent to regex ".")
        ! negation of following match ("must not match")
        ^ match following chunk but don't include it in the returned output string of the function
        ~ escape, which will need ~~ to escape itself

        Example: If A1 contains the example string given by Mitch_Judd (including its prefix):
        1.6.2.8.7280.1.54656.1::evAttributes(FALS,FALS,32768.00,32768.00);
        then PARSE(A1, "([1-9.]+)^(::!~(*~))((![,]+)^(,))+^(~);)","@") would give
        1.6.2.8.7280.1.54656.1@FALS@FALS@32768.00@32768.00
        because
        ([1-9.]+) matches one or more "0-9." and returns a chunk 1.6.2.8.7280.1.54656.1
        ^(::!~(*~)) matches "::" followed by not-literal-( repeated 0+ times, followed by literal-(. The initial ^ means this must match but is then ignored, so "::evAttributes(" doesn't appear in the output
        ((!,+)^(,))+ matches >=1 chunks, each comprising >=1 non-commas followed by a comma, but the ^ means that the final comma isn't returned as part of the chunk. It returns these each separated by a separator.
        ^(~);) requires a match with literal-) followed by semicolon, but then doesn't return these as a chunk.
        The output is therefore 1.6.2.8.7280.1.54656.1@FALS@FALS@32768.00@32768.00

        I think that's all you'd need for a parsing function set. The syntax would need some review and tightening up - this was a brief idea version only.

      • aja commented  ·   ·  Flag as inappropriate

        Quite often I need to quickly extract certain words from lines of text. Text-to-columns is overkill for this because I usually have to delete all the columns that I'm not interested in.

        What I'd like to see is a function called "word" which extracts the nth word from the text in a cell, which can be used like this:

        A B C
        the cat sat on the mat =word(A1,2) =word(A1,6)

        The result is:

        the cat sat on the mat cat mat

        This is so much easier than hacking about with =mid(A1,1,find(" ",A1)) ...and similar acrobatics. A "word" is space-delimited text by default, and leading/trailing/multiple spaces should be insignificant. That is, when you ask for word 1, you should get the first space-delimited word and no surrounding spaces.

        The full function could allow a delimiter to be specified, for example =word(cellref,pos[,delimiter]) in which you could specify one or more delimiters. The delimiter, if omitted, should default to a space.

        Invalid word position numbers should simply cause the function to return empty text ("").

        A negative word position number should return the nth last word in the string. For example, word("the cat",-1) should return "cat".

        Specifying zero as the word position should return empty text ("").

        The function should never return an error. The result is not used for calculation, so an error indication is unnecessary. Empty text is an acceptable return value when an error occurs.

        The function should be called "word". A precedent for this is the almost identical function built-in to the Rexx Programming Language - see http://www.rexxinfo.org/html/functions.html#word.

      • javier commented  ·   ·  Flag as inappropriate

        Allow use regexp in functions and formats.

        (2016-07-14 Dan [MS]: updated title from "regexp" to be more clear)

      • Regis commented  ·   ·  Flag as inappropriate

        I agree. Other example: date entered as text, not in the format expected for Excel to recognize a date like 2015/9/22. I'd like to be able to easily find 22 using maybe RIGHTOF(<cellref>,"/",last) meaning return the content in <cellref>, to the right of the last occurrence of "/". This function could take a number instead of "last", that number would be the occurrence of the delimiter. RIGHTOF(<cellref>,"/",2) would also return 22 in this case (Right of <cellref> after the second occurrence of "/"). This would avoid having to calculate a bunch of things like RIGHT(<cellref>,LEN(<cellref>)-FIND(<string>,<cellref>,[startnum]))

        LEFTOF would work the same to avoid having to do things like LEFT(<cellref>,FIND(<string>,<cellref>)-1)

        Also MIDBETWEN(<cellref>,"/",<occurrence1>,<occurrence2>) where <occurrence1> and <occurrence2> can be an integer, or "first", or "last". MIDBETWEEN(<cellref>,"/",1,2) or MIDBETWEEN(<cellref>,"/",first,last) would return 9 in this example.

      • Igor commented  ·   ·  Flag as inappropriate

        VBA supports regex after you add the reference to "Microsoft VBScript Regular Expressions 5.5". And building a small function that can be used in cell formulas is not that difficult... Used it myself quite frequently.

      • Anonymous commented  ·   ·  Flag as inappropriate

        Definitely would love this, and in the Mac version too. I often export Excel files as .CSV and then work in Notepad++ to use the regex in there, before reimporting back into Excel. It'd be great to just keep it in Excel.

      • Tor Jensen commented  ·   ·  Flag as inappropriate

        I recently started using regular expressions in other programs and now that I can read the matrix, the lack of them in Excel is really frustrating! Would love to see it as a feature!

      • Austin commented  ·   ·  Flag as inappropriate

        I'm pumped that you're adding support for this, I use Regex frequently around the office so this should be a nice time saver.

        I have a quick question though, do you intend on adding support for lookbehinds? From what I've seen Javascript and VBA don't support it and the alternative is usually overly complicated.

        Thanks!

      • Mitch Judd commented  ·   ·  Flag as inappropriate

        Functions like Left, Mid, Right, Search are good for what they are intended for but lack the power to do real parsing. Need functions to extract, count and manipulate strings.

        I have to write massive formulas to get what I need from different types of strings from log files.
        Ex 1

        string ex. 1.6.2.8.7280.1.54656.1::evAttributes(FALS,FALS,32768.00,32768.00,32768.00,32768.00,32768.00,32768.00,32768.00,32768.00);

        I would like to be able to do something like =GET(A1,",",2,",",3) that will bring back everything between comma 2 and comma 3 that is in Cell A1

        Ex 2
        I used a short string here to save space but imagine it is 514 characters in length.
        1.10.1.46765::evData("2015/01/12-00:00:03","A40022120164D9FF1955CC45001BAE04F34500000000EAFDEAFD00000000000000000000D6EC0500E0882900A4007F00F345C445D154B354D154B35451D4A45451D4A454F345EC43F345E2471A47D1449D00000000000000000000000000000000000100000000000000");

        Like to have a way to tell text to columns To look for a pattern ex.
        BREAK(A1,",",1,")",-1,"224222464222242244422")

        This would tell text to column to make the first break at the first character after the comma "," +1, make the second break at ")" -1 then break the result up starting from left to right 2 characters, 2 characters, 4 characters, etc etc

        Lastly give us some RegX functions so our hands are not tied.

      ← Previous 1

      Feedback and Knowledge Base