Last modified: May 19, 2000 (Japanese Version)
Regular Expressions for Beginners

Yasumasa Someya


"Regular expressions" are combinations of special characters and symbols used for pattern matching.; i.e., you specify a particular combination of such characters and symbols (= regular expression) and the computer will search for that string of words through the text data (the BLC in our case). The following is a very short, and hopefully easy-to-follow, introduction to some of the most useful regular expressions that you want to know in using the BLC concordancer.
 

1.  Ordinary alphanumerals: Ordinary alphabets (a,b,c,...; A,B,C,...) and numerals (1,2,3...) will match "as is" -- as in the following examples:
 
you specify and it will match
a a
word word
123 123
this word this word

2. Major regular expression symbols and their meanings
 
RegEx Symbols Meaning
 * Match 0 or more times
 + Match 1 or more times
 . Match any single character, including space 
 ^ Match the beginning of a line (if used in the square brackets, this means "NOT")  => See Note below.
 [   ]  Character class
 (   ) Grouping
 | Alternation

Note: This RegEx symbol (called "caret") is not accepted by the current BLC concordancer due to the particular data structure of the corpus. It, however, can be used within the square brackets.

3. Examples
 
 you specify and it will...
 a*  match 0 or more times of the instance of "a" (e.g.  space, a, aa, aaa,....)
 a+  match 1 or more times of the instance of "a" (e.g. a, aa, aaa, ...)
 ... match any combinations of three characters, including space (by adding a space before and after this sequence, it means "any single word consisting of three characters").
 ^Word  match "Word" that appears at the beginning of a line/sentence (=> this symbol, however, is not accepted at the moment).
 [abc]  match either "a" or "b" or "c". 
 [a-z]  match any one of the lowercase alphabets.
 [A-Z]  match any one of the UPPERCASE alphabets.
 [0-9] match any one of the numbers 0 through 9.
 [a-zA-Z0-9]  match any one of the alphabets and numbers.
 [a-z]+  match a single word of any length consisting of lowercase alphabets.
 [A-Za-z]+ match a single word of any length,
 [^a-zA-Z]  match anything other than alphabets (i.e., space,  numbers, punctuation marks and symbols).
 (aaa|bbb|ccc)  match either "aaa" or "bbb" or "ccc".
 ab(c|cd|cde)  match either "abc" or "abcd" or "abcde".

Since the RegEX symbols (called "metacharacters") as those mentioned in section 2 above have special meanings, they must be properly "escaped" in case you want to quote them as they are. For instance, if you want to find any combinations of numbers headed by the plus mark,  your search string should be:
    \+[0-9]+ 
which will match, for instance, +123, +5427, and so on (Note: There's  no such instances in the BLC, however.). Likewise, if you want to search for instances of any single word within the round brackets,  your search string would be: 
    \([a-zA-Z]+\)
which will match, for instance, (s), (txt), (Japan), etc. Due to the particular data structure of the BLC, however,  you need to add the full stop (i.e. space-equivalent RegEx symbol) before and after all the punctuation marks in cases like this. Thus, your RegEx search string to match instances like ( s ), ( txt ), ( Japan ), etc. should be:
    .\(.[a-zA-Z]+.\). 
or, if you want to have two or more words within the brackets, you simply specify the same RegEx string without the end bracket mark, as follows:
    .\(.[a-zA-Z]+
I've tested all these regular expressions and in most cases they work fine and return the results as expected. If you didn't get what you want, make sure your regular expression is correct and try again, of simply forget it. 

Want to learn more about Regular Expressions? Click here.



Back to BLC Concordancer 
 

(c) 2000 Yasumasa Someya