Last updated:  May 5, 2007


I M P O R T A N T    N O T I C E
Date              Remarks
May 5, 07 I have added the following three different collections of the State-of-the-Union Speeches by U.S. Presidents:
18)  State of the Union Address 1 (1934-2006) (428,075 word tokens, from 1934-2006)
19)  State of the Union Address 2 (1934-1969) (214,723 word token, from 1934 to 1969 -- including Roosevelt, Truman, Eisenhower, Kennedy, and Johnson)
20)  State of the Union Address 3 (1970-2006) (213,377 word tokens, from 1970 to 2006 -- including Nixon, Ford, Carter, Reagan, Bush, Clinton, and G.W.Bush)
Oct. 30, 04 I have added U.S. Code Collection Search via Google. The U.S. Code Collection is a comprehensive colletion of laws passed by the U.S. Congress and is very useful to find samples of legal expressions (aka, legalese).
Oct. 7, 04 I have added Frequency Tables of meaningful 3-, 4- and 5-grams extracted from the BLC (Ver. 2000). The tables are provided as Excel spreadsheets and are sortable either by frequency or in alphabetical order. You may also download the tables for your research purposes.
Feb.8, 04 A complete PDF version of my 1999 MA Thesis, A Corpus-based Study of Lexical and Grammatical Features of Written Business English, is now available online. If you want to download the PFD files, send me email.
Jan 21, 04 Przemek Kaszubski, coordinator of the PICLE Project (the Polish part of the International Corpus of Learner English), currently teaching at the Adam Mickiewicz University in Pozna?, Poland, has set up an excellent online concordancing site, Search the PICLE Corpus. The site allows us to use the PICLE Corpus which contains a total of 330,000 running words taken from essays written by Polish advanced EFL students. Offers four different concordancers, including modified versions of my Online BLC Concordancer and Bigram PLUS. Good job, Przemek, and thank you!
Dec 10, 03  The Japanese version of the KWIC Concordancer mentioned below currently withdrawn due to technical reasons. Wait for further notice.
Dec 15, 02 Not much is going on lately, but I have added a Japanese version of the KWIC Concordancer. The system is still tentative and has some bags (which I'm not sure if I can fix ), but it works anyway. Currently, there are two corpora: one, a complete collection of Natsume Soseki (approx. 5 MB: 2.5 million Kana-Kanji characters) and the other, all the articles contained in the Journal of the Association for Interpretation Studies (No. 2, 2002). Hope to add other corpora in the near future.
July 13 Bigram Plus updated. You can now use regular expressions as your search strings.  My appreciation goes to Mr. Ohama for his help.
Jun 14 A new bigram search program (call it "Bigram Plus") has now been added.  The program lets you retrieve any two-word combinations that you want to find from among the corpora that are currently available on the Concordancer site. Still have some problems, but it works anyway. The development of this program was inspired by the similar bigram search program posted on the Corpora List by Jens Enlund on 24 Apr 2001. 
Mar 22 The system will now do "case sensitive matching" by default. This means that the search string, thank you, for instance, matches lower case "thank you" only. If you want to search for all the instances of "thank you" including the upper case "Thank you,"  your search string should be: (Thank|thank) you or (T|t)hank you
Mar 17,
2001
(Rev.
Mar 18)
On July 19 last year, I posted an AWK script for bigram count. Someone sent me email asking how to modify the script so that he can count trigrams (three-word sequence). Here's a modified version of the original script that I sent him. You may also wish to use it. 

# trigram.awk (Yasumasa Someya, March 17, 2001. Based on bigram.awk)
# function: count the numbers of trigrams (any three-word combinations)  in a text, and 
# print out the result in order of decreasing freq.
# usage: jgawk -f trigram.awk INFILE > OUTFILE

# $0 = tolower($0)                                   # [option] change to lowercase
   gsub(/[.,:;!?"<>\[\]#(){}]/,"")                     # [option] delete .,:;!?"<>[]#(){}
    for(i=1; i<=NF; i++){ 
       trigram = word1 " " word2" " $i         # get trigrams
       word1 = word2                                  # keep track of the previous word
       word2 = $i                                        # ditto
       count[trigram]++                               # count trigram frequencies
      }
       printf "\rCounting trigrams ... %6d lines done.", NR > "CON"
  }
END {for (w in count)
       print count[w], "\t" w | "sortf -rn"         # sorted in decending order (need "sortf")
 }

=======
By replacing the four lines after the "for( )" line with the following, you can change it to a "4-gram" count script. Note that the more the number, the more memory the script requires.

      fourgram = word1 " " word2 " " word3 " " $i 
       word1 =word2 
       word2 = word3 
       word3 = $i 
       count[fourgram]++ 

The AWK version I'm using is jgawk 2.11.1 + 3.0 (Japanaized Gnu Awk for MS-DOS). You also need the "sortf" program , which can be downloaded from: http://download.vector.co.jp/pack/dos/util/text/sort/srtf10at.lzh

=======
The following are the first 10 most frequent trigrams and 4-grams in Alice's Adventure in Wonderland and the Personal Letter Corpus (PCL2000).
 
Alice in Wonderland.   Personal Letter Corpus   
(trigrams) (4-grams) (trigrams) (4-grams)
48  the Mock Turtle
29  said the King
28  the March Hare
20  the White Rabbit
20  said the Hatter
19  said to herself
19  said the Mock
18  said the Caterpillar
17  she said to
17  said the Gryphon
19  said the Mock Turtle
16  she said to herself
11  a minute or two
  8  said the March Hare
  7  said Alice in a
  6  in a tone of
  6  as well as she
  6  in a great hurry
  6  well as she could
  6  you won't you will
133  Thank you for
  58  Would you please
  55  look forward to
  52  I'd like to
  51  let me know
  49  Very truly yours
  46  I want to
  44  a copy of
  39  I hope you
  35  be able to
64  Thank you for your
30  I look forward to
24  I would like to
20  Please let me know
18  look forward to hearing
18  to hear from you
17  forward to hearing from
17  to hearing from you
17  to let you know
17  Would you please send

Note: If you are interested in bigram statistics, go to Ted Pedersen's website.
 

Mar 14, 
2001
The following two new corpora have been added:

ID. No.: CPR46
Title: It's a Wonderful Life (Columbia Pictures, 1946)
Directed by: Frank Capra
Screenplay by: Frank Capra
Source: Screenplay Public Domain Database
No. of Words: 17,066

ID. No.: HCK40
Title: REBECCA (Selznick International Pictures, 1940)
Directed by: Alfred Hitchcock
Screenplay by: Robert E. Sherwood;
Source: Screenplay Public Domain Database
No. of Words: 16,062

Mar 13, 
2001
A POS tagged version of the PLC (Personal Letter Corpus) uploaded. Tagging was done with a revised version of the original Brill Tagger (ver.1). AWK scripts used to assist tagging were: 

open_con.awk . . . open contractions.
ad_sp.awk . . . add space after/before punctuation marks.
if#.awk . . .  delete POS tags in comment lines starting with the # mark.
ltr_nbr.awk . . . add ID number to each letter in the corpus. (a modified version of txt_id.awk)

Send me email if you want the above scripts.

Mar 11, 
2001
The following four new corpora have been added:

ID. No.: Carroll-99
Title: Alice's Adventures in Wonderland (1865) 
Author: Lewis Carroll
No. of words: 26,949

ID. No.: Carroll-101
Title: Through the Looking Glass and What Alice Found There  (1872)
Author: Lewis Carroll
No. of words: 29,888

ID. No.: Twain-27
Title: The Adventures of Tom Sawyer  (1876) 
Author: Mark Twain
No. of words: 65,942

ID. No.: Twain-28
Title: The Adventures of Huckleberry Finn (1884) 
Author: Mark Twain
No. of words: 110,865

Mar 5, 
2001
XREFER (a search engine for Business, Law, Economics and Finacial terms) has been added to the toppage. Also useful as a reference for Linguistics terms and English grammar & usage. 
Feb 2 , 
2001
The Concordancer toppage redesigned again. This time, the instructions page was separated from the main page
11-22 The Concordancer toppage redesigned. Link to Merriam-Webster Online Dictionary added to the concordance output page, so that you may look up words without going back to the previous page.
8-14 The DDW (Data-driven Writing) Project is now completed. If you are interested in reading a preliminary report, click here.
7-19 Here's an AWK version of the bigram count script (See "bigram.pl" of my note of May 18).

# bigram.awk (Yasumasa Someya, July 10, 2000)
# function: count the numbers of bigrams (i.e. any two-word combinations) in a text, and 
# print out the result in order of decreasing freq.
# usage: jgawk -f bigram.awk INFILE > OUTFILE

# $0 = tolower($0)                            # [option] change to lowercase
  gsub(/[.,:;!?"<>\[\]#(){}]/,"")               # [option] delete .,:;!?"<>[]#(){}
    for(i=1; i<=NF; i++){ 
      bigram = word " " $i                   # get bigrams
      word = $i                                    # keep track of the previous word
      count[bigram]++                         # count trigram frequencies
      }
      printf "\rCounting bigrams ... %6d lines done.", NR > "CON"
      }
END { for (w in count)
      print count[w], "\t"w | "sortf -rn"    # sorted in decending order (=> need "sortf")
}

# "sortf" can be downloaded from: http://download.vector.co.jp/pack/dos/util/text/sort/srtf10at.lzh
# The AWK version I'm using is jgawk 2.11.1 + 3.0 (Japanaized Gnu Awk for MS-DOS).
 

6-23 The URL of the current Concordancer page has been changed to: http://isweb9.infoseek.co.jp/school/ysomeya
6-18
Rev.
6-24
1. All the excess space before and/or after punctuation marks and symbols  (needed for POS tagging) have now been deleted, except for the space before/after < angled blackets > and " quotation marks " for technical reasons. 
2. All the contractions and possessive nouns that had been separated into two units (ditto) in the plain text version of the BLC have now been restored to their usual forms. This means that you can now search for such instances as I'm, we'd, you'll,don't, can't, owner's, etc. directly by typing these strings as they are. (They remain separated in the POS-tagged version of the BLC. e.g. I_PRP 'm_BE; you_PRP 'll_MD, do_DO n't_NEG, owner_NN 's_POS, etc. ).
6-16 Five new corpora containing personal and professional letters by Thomas Jefferson, George Henry Borrow, General Robert.E. Lee, Charles Darwin, and Robert Louis Stevenson have been added. Click here for more details. 
5-19 There were some errors (not really errors, but...) in the "Regular Expressions for Beginners". Everything should be OK now. 
5-18 Another request, and here's my response. 

> Suppose you've got the following concordances line (306 lines in total) in response to a search string "[a-z]+_VB [a-z]+_VBG" which matches any sequence of a present-tense verb immediately followed by another verb in the -ing form.

    1  been_BEN able_JJ to_TO absorb_VB rising_VBG costs_NNS by_IN economies_NNS 
  6  N plan._NN I_PRP 'd_MD appreciate_VB knowing_VBG a_ART little_JJ more_JJR
  7         We_PRP would_MD appreciate_VB knowing_VBG as_RB soon_RB as_IN poss
  8  N ,_, we_PRP should_MD appreciate_VB knowing_VBG it_PRP ,_, but_CC if_IF 
  9  NN ,_, and_CC shall_MD appreciate_VB receiving_VBG ,_, at_IN your_PRP$ ea
 ... (omitted)
 41                         Avoid_VB making_VBG personal_NN calls_VBZ unless_I
 42  B that_IN I_PRP may_MD avoid_VB taking_VBG steps_NNS which_WDT neither_RB
 43                   To_TO avoid_VB wasting_VBG your_PRP$ time_NN ,_, may_MD 
 44  ro_NNP Dan_NNP will_MD become_VB acting_VBG manager_NN of_IN the_ART West
 45  NP 1_CD we_PRP will_MD begin_VB assigning_VBG the_ART booth_NN spaces_NNS
 46  d_VBD that_WDT you_PRP begin_VB attending_VBG AA_NNP meetings_NNS each_DT
 47   will_MD have_HV to_TO begin_VB charging_VBG a_ART fee_NN for_IN this_DT 
 ... (omitted)
 91  N ,_, would_MD you_PRP consider_VB making_VBG a_ART donation_NN so_IN we_
 92  pe_VBP you_PRP will_MD consider_VB making_VBG another_DT order_NN with_IN
 93  RP$ family_NN would_MD consider_VB moving_VBG to_TO <_< new_JJ city_NN >_
 94  pe_VBP you_PRP will_MD consider_VB ordering_VBG the_ART 184CZ_CD ,_, whic
...  (omitted)
165  ion._NN you_PRP 'll_MD enjoy_VB working_VBG with_IN him_PRP ._.
166  hat_IN you_PRP will_MD enjoy_VB working_VBG with_IN them_PRP ._.
167  We_PRP shall_MD all_DT enjoy_VB working_VBG with_IN you_PRP ,_, I_PRP 'm_
...  (omitted)
305  vn't_NEG have_HV to_TO work_VB delivering_VBG pizzas_NNS ._.
306  PRP$ dollars_NNS to_TO work_VB earning_VBG the_ART financial_JJ independe

Suppose also that what you want to do from here is to COUNT the numbers of occurrences of respective "VB+VBG" sequences, and, of course, you don't want to do it old way when there are so many to count. 

Here's a Perl script that will do the counting for you. Note that you should first delete all the POS tags (use "del_tag.awk" for this purpose -- see my note of May 2 below) from your text before feeding it to the followind Perl script, so that things will be much easier later. Also, this is a general purpose script and counts the numbers of ANY and ALL two-word combinations, not just the "VB+VBG" sequence.

# bigram.pl (Yasumasa Someya, May 10, 2000)
# function: count the numbers of bigrams (i.e. any two-word combinations) in a text, and 
# print out the result in order of decreasing freq.
# usage: perl bigram.pl INFILE > OUTFILE

while(<>) {
    chop;
    tr/A-Z/a-z/;                                     # [option] change to lower case
    tr/.,:;!?"<>[]#(){}//d;                        # [option] delete punctuation marks and symbols
    foreach $word1 (split) {                 # split line into words
        $bigram = "$word2 $word1";    # get bigrams 
        $word2 = $word1;                     # keep track of the previous word
        $count{$bigram}++;                  # count bigrams  frequencies
    }
}
foreach $bigram (sort numerically keys %count) {     # sort bigrams numerically
   print "$count{$bigram}\t $bigram\n";                     # print "frequency \t trigram \n"
}
sub numerically {                               # compare two words numerically
   $count{$b} <=> $count{$a};            # decreasing order
}
# end of script

The script will return a simple bigram frequency list like the following (only first 10 lines are quoted):

24   appreciate receiving
16   we would
16   to continue
14   we will
14   would appreciate
14   you will
13   to begin
13   would you
10   working with
10   to consider

From this list, you pick up only what you want (Need another scripts? Well, what I recommend for further processing of the data like sorting and counting, etc. is to import the table to MS Excel. Things will be much easier this way), and you get a new list like the following (only first 18 lines are quoted in decending order of frequency):

24   appreciate receiving
9    enjoy working
4    start making
4    continue serving
3    enjoy serving
3    start taking
3    start working
3    begin using
3    begin operating
3    appreciate knowing
3    stop envying
2    consider working
2    continue providing
2    consider recommending
2    stop talking
2    enjoy dealing
2    consider centralizing
2    consider raising

The following is a trigram version of the same script -- in case you want to pick up any three-word combinations:

# trigram.pl (Yasumasa Someya, May 10, 2000)
# function: count the numbers of trigrams (any three-word combinations) in a text, and 
# print out the result in order of decreasing freq.
# usage: perl trigram.pl INFILE > OUTFILE

while(<>) {
    chop;
    tr/A-Z/a-z/;                                                     # [option] change to lower case
    tr/.,:;!?"<>[]#(){}//d;                                         # [option] delete punctuation marks and symbols
    foreach $word1 (split) {                                  # split line into words
        $trigram = "$word3 $word2 $word1";        # get trigrams and count their frequencies
        $word3 = $word2;                                      # keep track of the previous word
        $word2 = $word1;                                      # dotto
        $count{$trigram}++;                                   # count trigram frequencies
    }
}
foreach $trigram (sort numerically keys %count) {     # sort trigrams numerically
   print "$count{$trigram}\t $trigram\n";                     # print "frequency \t trigram \n"
}
sub numerically {                                              # compare words numerically
   $count{$b} <=> $count{$a};                           # decreasing order
}
# end of script

5-17 Although I still haven't received any further warning/action from the Webmaster (see note of March 14), I plan to move my entire website including the Concordancer page to a new site operated on a commercial basis, hoping to improve accessibility and connection speed. Watch this page for further notice.
5-17 Someone mailed me several days ago saying that the regular expression page referred to (and linked) in the BLC Concordancer top page was a bit too cumbersome, if not too difficult, to read through. Well, I agree. So I have added a new page entitled "Regular Expressions for Beginners" (both in English and Japanese). Hope this helps.
5-01 Another modification to the design of the BLC Concordancer top page. Some of the errors (and inconsistency) in the POS-tagged BLC  have been corrected
5-02
Rev.
5-17
One of the users of the Online BLC Concordancer has asked me if there's any way to delete all the TAG data automatically from the output. Well, it's not that difficult if what you want is only to delete the POS tags. Here's an AWK script I wrote for that purpose. (The version of the AWK used is JGAWK 2.11.1 + 3.0 [= Japanaized GNU AWK for MS-DOS], but the script should also work with GAWK). 

# del_tag.awk (formally "prn_txt.awk"; Yasumasa Someya, May 2, 2000)
# function: delete POS tags from a text tagged in the "word_TAG" format.
# usage: jgawk -f prn_txt.awk INFILE > OUTFILE
{
  gsub("_", " _")           # insert space before the underbar to separate WORD from TAG
  gsub(/ _[A-Z]+/,"")     # delete all the tags heahed by the underbar
  gsub(/( _,| _:| _>| _<| _\.| _\)| _\(| _;| _\?)/,"")    # delete all the symbol tags (e.g. ,_, :_:  >_>, etc.)
  gsub(/ _/,"")               # delete stranded underbar
  gsub(/\$/,"")               # delete stranded $ mark (as in "NNP$")
  printf "\rExtracting text data. Please wait... %7d",NR > "CON" 
  print
}
# end of script

(Sample Input = Concordance Output from BLC POS-tagged Corpus) --------------------------------------------

2   my_PRP$ correspondence_NN ,_, it_PRP is_BE clear_JJ that_IN you_PRP were_BE 
3                                 It_PRP is_BE critical_JJ that_IN you_PRP do_DO 
4   G a_ART moderate_JJ income_NN it_PRP is_BE doubtful_JJ that_IN I_PRP will_MD 
5                                 It_PRP is_BE essential_JJ that_IN a_ART certai
6                                 It_PRP is_BE essential_JJ that_IN we_PRP act_VBP

(Sample Output) -----------------------------------------------------------------------------------------------------------------

2   my correspondence , it is clear that you were 
3                                 It is critical that you do 
4   G a moderate income it is doubtful that I will 
5                                 It is essential that a certai
6                                 It is essential that we act
----------------------------------------------------------------------------------------------------------------------------------------

Note: The original KWIC format will not be maintained in the output, as shown in the above example.
 

4-01 I've made some minor modification in the design of the BLC Concordancer top page to make it look a bit better. Also, I have added a complete list of part-of-speech tags used in the current POS tagged BLC
4-01 A Personal Letter Corpus has been added to the Online BLC Concordancer. The corpus contains a total of 141,608 word tokens of American English. The total number of letters contained in the PLC is 1,037.
4-01 Someone sent me email asking if he could use my CGI Perl script. My answer is yes.  If you also wish to install your own KWIC Concordancer onto your Website and want to use my CGI Perl script for that purpose, please feel free to contact me. (Sorry, but no technical assistance will be provided. Also, this is for educational and/or research purposes only).
4-01
Rev.
4-14
The top page of the Concordancer has become all English as of today at the many requests of (and some complaints from) international users. This means that  the character set used in the BLC KWIC Concordancer is now defined as "charset=ISO-8859-1" program internally. The Yen mark, therefore, will all be replaced by the backslash (\) when the search result is displayed on your terminal. For Japanese users, this also means that the Yen-mark regular expression escape code is no longer valid (you have to use the backslash instead). Any two-byte characters and symbols contained in the original corpus (such as the pound mark) will not be shown properly. 
3-21 So far, there's no further warning/action from the Webmaster and the Concordancer is working fine.
3-14 Y2K Access to this Concordancer may soon be suspended (forbidden), because, according to the Webmaster of freeweb.co.jp, the CGI script used for the Concordancer requires more Web resources than is normally allowed for a FREE Web page they are providing. Such being the case, I may have to close this page unless I find other free Web space in the very near future. Any suggestion?