So I just got back from Korea, and that's a bit of a story, but more importantly, my program is complete!
(defun tokenizer (st)
(let ((x 0) (total nil)) (while (< x (length st))
(let ((y x))
(while (and (< y (length st))
(not (eq (char st y) #\Newline))
(not (eq (char st y) #\Space)))
(setq y (+ y 1)))
(setq total (cons (subseq st x y) total))
(setq x (+ 1 y))))
(reverse total))
(defun tokenizer-reverse (st)
(let ((x 0) (total nil)) (while (< x (length st))
(let ((y x))
(while (and (< y (length st))
(not (eq (char st y) #\Newline))
(not (eq (char st y) #\Space)))
(setq y (+ y 1)))
(setq total (cons (subseq st x y) total))
(setq x (+ 1 y))))
(defun find-final (megalist str)
(let ((n 0) (lis nil))
(while (< n (length megalist))
(if (equal str
(subseq (nth n megalist) (- (length (nth n megalist)) (length str))
(length (nth n megalist))))
(setq lis (cons (nth n megalist) lis)))
(setq n (+ 1 n)))
lis
))
It's a pretty simple and inefficient program, but I like it.
It is apparently incapable of recognizing words that are exactly the searched phrase, but oh well. Update: the problem is case, not the length of the word, so "Ku" with a capital "K" is not equal to "ku" with a lower case "k". An easy fix, but oh well. The reverse version of the tokenizer function just leaves the list in reverse order, which is fine. Flipping it requires a ton of memory and time, so it is not worth it and doesn't help anything in this case.
I pulled a list of words from:
http://www.nicklea.com/articles/wordlist.txt
Some 78,205 words
Update: This list contains words and their conjugates, e.g. "run", "runs", and "running" and "aardvark", "aardvarks", and "aardvark's"; so that may slant the results a bit, depending on how you view conjugates.
Of course you're all thinking "but why?" The Japanese language has some rather restrictive phonetics. So, when a Japanese student tries to spell an English word, they tend to add in a bunch of extra vowels. After seeing my students write "Baraku Obama" and "ketchupu" or other u-ending variants, I got to thinking - how many words in the English language end in ku? or pu? or even u at all? The Japanese tend to us "to" or "do" for t/d-terminating words respectively, so I tried that, too.
78,205 words aren't all the words in the English language, but it probably covers almost all of what my students will be using in their lives.
The results:
Of the 81 words that end in u,
"adieu", "Ainu", "Baku", "Bantu", "bateau", "bayou", "beau", "Bissau", "bureau", "caribou",
"chateau", "Chou", "coypu", "cpu", "du", "Fizeau", "flu", "Fontainebleau", "Frau", "Fujitsu",
"gnu", "guru", "haiku", "Hindu", "Honolulu", "Honshu", "impromptu", "juju", "Juneau",
"Katmandu", "Kikuyu", "kinkajou", "Kitakyushu", "kivu", "kombu", "Ku", "kudzu", "landau",
"lieu", "Lou", "lulu", "Malibu", "Marceau", "Maseru", "Mathieu", "menu", "milieu", "mu",
"Nassau", "Nehru", "nouveau", "nu", "Ouagadougou", "parvenu", "Peru", "plateau",
"portmanteau", "Rousseau", "Shu", "situ", "snafu", "sou", "tableau", "tabu", "tau", "Thimbu",
"Thoreau", "thou", "thru", "Thu," "tofu", "Trudeau", "Tsunematsu", "tutu", "u", "Urdu",
"Vishnu", "Wausau", "Wu", "you", and "Zulu"
3 end in "ku" - "Baku", "haiku", "ku"
Baku, it turns out is a city in Azerbaijan, part of the old USSR
Haiku is cheating since that's obviously a loan word from Japanese
Ku doesn't show up in my dictionary and I have never used it, so it is not a particularly helpful word
But 670 end in "k"
2 end in "pu" - "coypu" and "cpu"
Coypus are large, South American, aquatic rodents
CPU is also cheating since it is an abbreviation
But 452 end in "p"
0 end in "gu"
But 6707 end in "g"
5 end in "ru" - "guru", "Maseru", "Nehru", "Peru", and "thru"
Guru is another loan word
The middle three are places
Thru is just a lazy way of spelling a real word
But 5939 end in "r"
3 end in "shu" - "Honshu", "Kitakyushu", and "Shu"
The first two are places in or parts of Japan (Honshu is one of the four main islands and Kitakyushu is a city)
Shu is the Confusian version of the Golden Rule or the Egyptian god of air
But 279 end in "sh"
1 ends in "ju" - "juju"
A loan word from Africa, something I doubt my students will ever learn, let alone use
Though to be fair, only 3 words end in "j"
0 end in "chu"
But 238 end in "ch"
1 ends in "yu" (which also covers all which might be "kyu", "gyu", "nyu", "myu", etc) - "Kikuyu"
A language and people from Kenya, probably not helpful to my students
Any word ending in y on its own would be transliterated to "i" or something probably, so I won't look up the comparison
1 ends in "hu" that doesn't end in "shu" - "Thu"
Thu does not even show up in the dictionary I am using, so I can guess not too useful of a word
But 912 end in "h"
2 end in "fu" (a more common way of transliterating "hu") - "tofu" and "snafu"
Tofu is another loan word
Snafu counts as a word I guess
But 173 words end in "f"
4 end in "bu" - "Thimbu", "tabu", "Malibu", and "kombu"
Thimbu and Malibu are places
Tabu is just a different (and less common) spelling of taboo
Kombu is another loan word from Japanese - a type of brown seaweed (also spelled konbu)
But 143 words end in "b"
2 end is either "tsu" or "su" (both "tsu") - "Tsunematsu" and "Fujitsu"
Both are presumably Japanese place names or surnames, even a Japanese teacher was not sure
But 24259 words end in "s", and of them, 1675 with "ts"
4 end in "du" (which is rarely used in Japanese and 'do' would probably be used before 'du') - "Urdu", "Katmandu", "Hindu", and "du"
Urdu is a language, and Hindu is a religion. They aren't native words, but they are somewhat common.
Katmandu is a city
Du is found in some French names or an abbreviation of Duke or Dutch
But 8734 end in "d"
1 ends in "zu" or "dzu" - "kudzu"
Kudzu is a type of Japanese and Chinese vine, so another loan word
But 82 end in "z", but of them, 0 end in "dz"- so there's one time a u-terminating word beats out the consonant version!
1 ends in "mu" - "mu"
Mu is a Greek letter and used as a coefficient of friction
or apparently a lost continent that sunk when Atlantis did
But 972 end in "m", though it would probably just be transliterated to "ん" anyway
If a word ends in n, the Japanese language can represent this like "m" as "ん", so I won't go through that list. The remaining words, too, cannot be directly transliterated or end in a vowel followed by "u" so you can look them up on your own if you care.
"to" and "do" are solutions to transliterating t/d-terminating English words into Japanese, and a lot more "to" and "do" words exist.
545 end in "o", of which 82 are "to" and 48 are "do", but 3542 end in "t" and 8734 end in "d".
Of the 670 words ending in "k", 287 end in "ck" and 162 end in "ke", but 1018 end in "c".
There are a bunch of other interesting phrases to look for, like "r" vs "l" or "y" vs "i", but this'll do for this post.
Update: I found myself curious about the distribution of final letters and have found these numbers:
Vowels - Of the 10,059 words that end in vowels,
1,181 are "a",
7,999 are "e",
253 are "i",
545 are "o",
and of course,
81 are "u".
Here is the rest of the alphabet:
"a" - 1181
"b" - 143
"c" - 1,018
"d" - 8,734
"e" - 7,999
"f" - 173
"g" - 6,707
"h" - 912
"i" - 253
"j" - 3
"k" - 670
"l" - 2,108
"m" - 973
"n" - 4,506
"o" - 545
"p" - 452
"q" - 4
"r" - 5,939
"s" - 24,259
"t" - 3,542
"u" - 81
"v" - 25
"w" - 238
"x" - 172
"y" - 7,258
"z" - 82
Because of the conjugates (e.g. "aardvarks" and "aardvark's"),
"s" comes way out on top with 24,259 words.
("g" probably has so many results because of these conjugates, too (e.g. "running").)
"j" bottoms out the list at only 3 words, but
"q" is a close follower with only 4 words.
"j", "q", and "v" are the only letters less likely to end a word than "u", with
"z" only 1 word more likely.
The point of course is that almost no words end in "u" in English. Of the 81 words that do, only "you" and "thou" appear in the list of top 2,126 words used in American newspapers, and I had to look up almost all of the words from the list of 81 words that did end in "u". So if you're a Japanese student of English and find yourself wondering if you should end a word with "u", I think you can assume it doesn't.