Tense Prediction for English speech

Background
Past and present tense sound alike for most regular verbs in English. This makes it difficult to distinguish the two forms in speech recognition tasks. The text should contain information to predict which is used with a considerable degree of accuracy. Most notably, other verbs in the text should indicate if the text is in past or present tense.

Tense Progression N-Grams?
One idea we talked about previously and which Satoshi apparently tested was using tense n-grams. Now, n-grams model very local context. This is a problem, since many verbs in a text do not follow the text tense model at all: e.g. all verbs in indirect contexts such as quoted speech, and verbs in attributive constructions or in subordinate clauses. Rather than n-grams, Christer Samuelsson suggested I use the ratio of past to present in the entire text as a distinguishing criterion: this measure should be more global.

Material
I looked at one issue of the Wall Street Journal, processed by Engcg (wsj_08*.engcg). The ratio of past to present tense form frequencies varies from 0.1 (e.g. WSJ900801-0075: 9 present, 78 past: 0.115) to over 5 (e.g. WSJ900801-0061: 11 present, 2 past: 5.5), with an average of 1.2 – suggesting that there is plenty of information to be tapped. The distribution of ratios for one issue of the Wall Street Journal is in the following graph.

0 ****************************************
0.16 **********
0.2 ***
0.25 *****
0.33 *****
0.5 *******************
1 *************************************
2 *************************
3 ************
4 *****
5 ****
6 ***
> 6 *********
So, given a text, I look at the ratio of past to present in auditively different cases: there are plenty of irregular verbs (go – went) and verbs with regular stem changes (win – won), as well as verbs which end in dental stops (-t, -d) to use as fix points.
I extract probabilities (train.perl) on the training material (wsj_0702.engcg) using several different settings.

Details
The probabilities are based on the distribution of observed relative frequencies. if f(pres,doc) and f(past,doc) are the frequencies of present and past tenses in a document, respectively, and q(doc) is the ratio f(pres,doc)/(f(past,doc)+1) the q:s for the training set are tabulated as in the following table.

q(doc)>k
k | 6 5 4 3 2 1 0.5 0.33 0.25 0.20 0.17 0 | sum
——————————————————————————-
f(pres) | 2007 427 445 1151 719 1997 1554 743 193 78 9 301 | 9624
f(past) | 322 104 115 465 325 1548 1966 1374 481 235 24 1117 | 8076
——————————————————————————-
sum per q | 2329 531 560 1616 1044 3545 3520 2117 674 313 33 1418 | 17700

The probabilities for a verb in a text being of either tense given observed tense frequencies are then simply approximated to be the relative frequency of f(pres) to f(past). If, e.g. a text hitherto has observed 40 present tense forms and 6 past tense forms:

q(doc) = 40/(6+1) = 5.7

and

p(pres|q(doc)>5) = f(pres)/(f(past)+f(pres)) = 427/531 = 0.80
p(past|q(doc)>5) = f(pres)/(f(past)+f(pres)) = 427/531 = 0.20

Settings used
08probs.prespret
using all verbs as observed criterion. in the real case this is no good: we will not have this info for the test material. (prespret? pres is for present, pret for preterite – another name for past.)
f(pres)/f(past) p(pres) p(past)

0 0.085 0.915
0.17 0.167 0.833
0.2 0.186 0.814
0.25 0.233 0.767
0.33 0.302 0.698
0.5 0.425 0.575
1 0.587 0.413
2 0.720 0.281
3 0.783 0.217
4 0.827 0.173
5 0.857 0.143
6 0.906 0.094
08probs.ab
Uses “strong” verbs: stem change or other irregularity. I filtered out those with identical present and past forms (e.g. put – put). The list is in the code: it should be obtained from comlex or some other better thought through source. (“ab” stands for “ablaut”, one of the paradigms for regular stem change.)
08probs.td
Uses only verbs ending in dental stops. These should have auditively very different past forms.
08probs.tdab
Uses a simple sum of td + ab. This proved not to help. The sources should be merged differently.
08probs.ab.skip
Skip every finite verb immediately after a relative pronoun (which, that), on the assumption that they will not be part of the overall text tense system. This proved not to help.
Results
Using bare probability estimates
vrbs corr miss
results.prespret: 6973 4139 2834 59%
results.ab: 3194 1971 1223 62%
results.td: 5348 3188 2160 60%
results.tdab: 1569 972 591 62%
results.ab.skip: 2986 1655 1331 55%
A baseline estimate would be to always guess the more common form. For the material at hand this would be present tense, and would give us between 54 and 56% correct guesses.
Playing it safe
If we back off from making predictions in cases where we have inconclusive probabilities we will be better off in terms of accuracy at cost of coverage. This may be a sensible method to stave off some of the more embarrassing errors.

strong verbs

p > 0.5 3194 1971 1223 62%
p > 0.6 2058 1339 719 65%
p > 0.7 1603 1062 541 66%
p > 0.8 583 436 147 75%

dental stop verbs

p > 0.5 5348 3188 2160 60%
p > 0.6 2717 1734 983 64%
p > 0.7
p > 0.8 529 376 153 71%

What next?
Should do indirect context spotting test.
Use a better relative clause spotter.
Many errors could be solved using an adverbial spotter: a text which mainly reports in the present tense will have past tense clauses at unpredictable intervals clearly marked with an adverbial:
… most justices agree that … But, as retired Justice Lewis Powell
warned in a speech last year, … (WSJ900702-0182)
New York University, December 1996

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s