|Last updated: May 5, 2007|
|May 5, 07||I have added the following three different collections of the State-of-the-Union Speeches by U.S. Presidents:
18) State of the Union Address 1 (1934-2006) (428,075 word tokens, from 1934-2006)
19) State of the Union Address 2 (1934-1969) (214,723 word token, from 1934 to 1969 -- including Roosevelt, Truman, Eisenhower, Kennedy, and Johnson)
20) State of the Union Address 3 (1970-2006) (213,377 word tokens, from 1970 to 2006 -- including Nixon, Ford, Carter, Reagan, Bush, Clinton, and G.W.Bush)
|Oct. 30, 04||I have added U.S. Code Collection Search via Google. The U.S. Code Collection is a comprehensive colletion of laws passed by the U.S. Congress and is very useful to find samples of legal expressions (aka, legalese).|
|Oct. 7, 04||I have added Frequency Tables of meaningful 3-, 4- and 5-grams extracted from the BLC (Ver. 2000). The tables are provided as Excel spreadsheets and are sortable either by frequency or in alphabetical order. You may also download the tables for your research purposes.|
|Feb.8, 04||A complete PDF version of my 1999 MA Thesis, A Corpus-based Study of Lexical and Grammatical Features of Written Business English, is now available online. If you want to download the PFD files, send me email.|
|Jan 21, 04||Przemek Kaszubski, coordinator of the PICLE Project (the Polish part of the International Corpus of Learner English), currently teaching at the Adam Mickiewicz University in Pozna?, Poland, has set up an excellent online concordancing site, Search the PICLE Corpus. The site allows us to use the PICLE Corpus which contains a total of 330,000 running words taken from essays written by Polish advanced EFL students. Offers four different concordancers, including modified versions of my Online BLC Concordancer and Bigram PLUS. Good job, Przemek, and thank you!|
|Dec 10, 03||The Japanese version of the KWIC Concordancer mentioned below currently withdrawn due to technical reasons. Wait for further notice.|
|Dec 15, 02||Not much is going on lately, but I have added a Japanese version of the KWIC Concordancer. The system is still tentative and has some bags (which I'm not sure if I can fix ), but it works anyway. Currently, there are two corpora: one, a complete collection of Natsume Soseki (approx. 5 MB: 2.5 million Kana-Kanji characters) and the other, all the articles contained in the Journal of the Association for Interpretation Studies (No. 2, 2002). Hope to add other corpora in the near future.|
|July 13||Bigram Plus updated. You can now use regular expressions as your search strings. My appreciation goes to Mr. Ohama for his help.|
|Jun 14||A new bigram search program (call it "Bigram Plus") has now been added. The program lets you retrieve any two-word combinations that you want to find from among the corpora that are currently available on the Concordancer site. Still have some problems, but it works anyway. The development of this program was inspired by the similar bigram search program posted on the Corpora List by Jens Enlund on 24 Apr 2001.|
|Mar 22||The system will now do "case sensitive matching" by default. This means that the search string, thank you, for instance, matches lower case "thank you" only. If you want to search for all the instances of "thank you" including the upper case "Thank you," your search string should be: (Thank|thank) you or (T|t)hank you.|
|On July 19 last year, I posted an AWK script for bigram
count. Someone sent me email asking how to modify the script so that he
can count trigrams (three-word sequence). Here's a modified version of
the original script that I sent him. You may also wish to use it.
# trigram.awk (Yasumasa Someya,
March 17, 2001. Based on bigram.awk)
= word1 " " word2 " " word3 " " $i
The AWK version I'm using is jgawk 2.11.1 + 3.0 (Japanaized Gnu Awk for MS-DOS). You also need the "sortf" program , which can be downloaded from: http://download.vector.co.jp/pack/dos/util/text/sort/srtf10at.lzh
Note: If you are interested in bigram statistics, go to
|The following two new corpora have been added:
ID. No.: CPR46
ID. No.: HCK40
|A POS tagged version of the PLC (Personal
Letter Corpus) uploaded. Tagging was done with a revised version of the
original Brill Tagger (ver.1). AWK scripts used to assist tagging were:
open_con.awk . . . open contractions.
Send me email if you want the above scripts.
|The following four new corpora have been added:
ID. No.: Carroll-99
ID. No.: Carroll-101
ID. No.: Twain-27
ID. No.: Twain-28
|XREFER (a search engine for Business, Law, Economics and Finacial terms) has been added to the toppage. Also useful as a reference for Linguistics terms and English grammar & usage.|
|Feb 2 ,
|The Concordancer toppage redesigned again. This time, the instructions page was separated from the main page|
|11-22||The Concordancer toppage redesigned. Link to Merriam-Webster Online Dictionary added to the concordance output page, so that you may look up words without going back to the previous page.|
|8-14||The DDW (Data-driven Writing) Project is now completed. If you are interested in reading a preliminary report, click here.|
|7-19||Here's an AWK version of the bigram count script (See
my note of May 18).
# bigram.awk (Yasumasa Someya,
July 10, 2000)
# "sortf" can be downloaded from: http://download.vector.co.jp/pack/dos/util/text/sort/srtf10at.lzh
|6-23||The URL of the current Concordancer page has been changed to: http://isweb9.infoseek.co.jp/school/ysomeya|
|1. All the excess space before and/or after punctuation
marks and symbols (needed for POS tagging) have now been deleted,
except for the space before/after < angled blackets > and " quotation
marks " for technical reasons.
2. All the contractions and possessive nouns that had been separated into two units (ditto) in the plain text version of the BLC have now been restored to their usual forms. This means that you can now search for such instances as I'm, we'd, you'll,don't, can't, owner's, etc. directly by typing these strings as they are. (They remain separated in the POS-tagged version of the BLC. e.g. I_PRP 'm_BE; you_PRP 'll_MD, do_DO n't_NEG, owner_NN 's_POS, etc. ).
|6-16||Five new corpora containing personal and professional letters by Thomas Jefferson, George Henry Borrow, General Robert.E. Lee, Charles Darwin, and Robert Louis Stevenson have been added. Click here for more details.|
|5-19||There were some errors (not really errors, but...) in the "Regular Expressions for Beginners". Everything should be OK now.|
|5-18||Another request, and here's my response.
> Suppose you've got the following concordances line (306 lines in total) in response to a search string "[a-z]+_VB [a-z]+_VBG" which matches any sequence of a present-tense verb immediately followed by another verb in the -ing form.
1 been_BEN able_JJ to_TO
absorb_VB rising_VBG costs_NNS by_IN economies_NNS
Suppose also that what you want to do from here is to COUNT the numbers of occurrences of respective "VB+VBG" sequences, and, of course, you don't want to do it old way when there are so many to count.
Here's a Perl script that will do the counting for you. Note that you should first delete all the POS tags (use "del_tag.awk" for this purpose -- see my note of May 2 below) from your text before feeding it to the followind Perl script, so that things will be much easier later. Also, this is a general purpose script and counts the numbers of ANY and ALL two-word combinations, not just the "VB+VBG" sequence.
# bigram.pl (Yasumasa Someya,
May 10, 2000)
The script will return a simple bigram frequency list like the following (only first 10 lines are quoted):
24 appreciate receiving
From this list, you pick up only what you want (Need another scripts? Well, what I recommend for further processing of the data like sorting and counting, etc. is to import the table to MS Excel. Things will be much easier this way), and you get a new list like the following (only first 18 lines are quoted in decending order of frequency):
24 appreciate receiving
The following is a trigram version of the same script -- in case you want to pick up any three-word combinations:
# trigram.pl (Yasumasa Someya,
May 10, 2000)
|5-17||Although I still haven't received any further warning/action from the Webmaster (see note of March 14), I plan to move my entire website including the Concordancer page to a new site operated on a commercial basis, hoping to improve accessibility and connection speed. Watch this page for further notice.|
|5-17||Someone mailed me several days ago saying that the regular expression page referred to (and linked) in the BLC Concordancer top page was a bit too cumbersome, if not too difficult, to read through. Well, I agree. So I have added a new page entitled "Regular Expressions for Beginners" (both in English and Japanese). Hope this helps.|
|5-01||Another modification to the design of the BLC Concordancer top page. Some of the errors (and inconsistency) in the POS-tagged BLC have been corrected|
|One of the users of the Online BLC Concordancer has asked
me if there's any way to delete all the TAG data automatically from the
output. Well, it's not that difficult if what you want is only to delete
the POS tags. Here's an AWK script I wrote for that purpose. (The version
of the AWK used is JGAWK 2.11.1 + 3.0 [= Japanaized GNU AWK for MS-DOS],
but the script should also work with GAWK).
# del_tag.awk (formally "prn_txt.awk";
Yasumasa Someya, May 2, 2000)
(Sample Input = Concordance Output from BLC POS-tagged Corpus) --------------------------------------------
2 my_PRP$ correspondence_NN ,_, it_PRP
is_BE clear_JJ that_IN you_PRP were_BE
(Sample Output) -----------------------------------------------------------------------------------------------------------------
2 my correspondence , it is clear that
Note: The original KWIC format will not be maintained
in the output, as shown in the above example.
|4-01||I've made some minor modification in the design of the BLC Concordancer top page to make it look a bit better. Also, I have added a complete list of part-of-speech tags used in the current POS tagged BLC|
|4-01||A Personal Letter Corpus has been added to the Online BLC Concordancer. The corpus contains a total of 141,608 word tokens of American English. The total number of letters contained in the PLC is 1,037.|
|4-01||Someone sent me email asking if he could use my CGI Perl script. My answer is yes. If you also wish to install your own KWIC Concordancer onto your Website and want to use my CGI Perl script for that purpose, please feel free to contact me. (Sorry, but no technical assistance will be provided. Also, this is for educational and/or research purposes only).|
|The top page of the Concordancer has become all English as of today at the many requests of (and some complaints from) international users. This means that the character set used in the BLC KWIC Concordancer is now defined as "charset=ISO-8859-1" program internally. The Yen mark, therefore, will all be replaced by the backslash (\) when the search result is displayed on your terminal. For Japanese users, this also means that the Yen-mark regular expression escape code is no longer valid (you have to use the backslash instead). Any two-byte characters and symbols contained in the original corpus (such as the pound mark) will not be shown properly.|
|3-21||So far, there's no further warning/action from the Webmaster and the Concordancer is working fine.|
|3-14 Y2K||Access to this Concordancer may soon be suspended (forbidden), because, according to the Webmaster of freeweb.co.jp, the CGI script used for the Concordancer requires more Web resources than is normally allowed for a FREE Web page they are providing. Such being the case, I may have to close this page unless I find other free Web space in the very near future. Any suggestion?|