Another Kind of Character
In some situations, the relationships between quantities allow us to make predictions. This text will explore how to make accurate predictions based on incomplete information and develop methods for combining multiple sources of uncertain information to make decisions.
As an example of visualizing information derived from multiple sources, let us first use the computer to get some information that would be tedious to acquire by hand. In the context of novels, the word "character" has a second meaning: a printed symbol such as a letter or number or punctuation symbol. Here, we ask the computer to count the number of characters and the number of periods in each chapter of both Huckleberry Finn and Little Women.
# In each chapter, count the number of all characters; # call this the "length" of the chapter. # Also count the number of periods. chars_periods_huck_finn = Table().with_columns([ 'Huck Finn Chapter Length', [len(s) for s in huck_finn_chapters], 'Number of Periods', np.char.count(huck_finn_chapters, '.') ]) chars_periods_little_women = Table().with_columns([ 'Little Women Chapter Length', [len(s) for s in little_women_chapters], 'Number of Periods', np.char.count(little_women_chapters, '.') ])
Here are the data for Huckleberry Finn. Each row of the table corresponds to one chapter of the novel and displays the number of characters as well as the number of periods in the chapter. Not surprisingly, chapters with fewer characters also tend to have fewer periods, in general – the shorter the chapter, the fewer sentences there tend to be, and vice versa. The relation is not entirely predictable, however, as sentences are of varying lengths and can involve other punctuation such as question marks.
|Huck Finn Chapter Length||Number of Periods|
... (33 rows omitted)
Here are the corresponding data for Little Women.
|Little Women Chapter Length||Number of Periods|
... (37 rows omitted)
You can see that the chapters of Little Women are in general longer than those of Huckleberry Finn. Let us see if these two simple variables – the length and number of periods in each chapter – can tell us anything more about the two books. One way for us to do this is to plot both sets of data on the same axes.
In the plot below, there is a dot for each chapter in each book. Blue dots correspond to Huckleberry Finn and gold dots to Little Women. The horizontal axis represents the number of periods and the vertical axis represents the number of characters.
plots.figure(figsize=(6, 6)) plots.scatter(chars_periods_huck_finn.column(1), chars_periods_huck_finn.column(0), color='darkblue') plots.scatter(chars_periods_little_women.column(1), chars_periods_little_women.column(0), color='gold') plots.xlabel('Number of periods in chapter') plots.ylabel('Number of characters in chapter');
The plot shows us that many but not all of the chapters of Little Women are longer than those of Huckleberry Finn, as we had observed by just looking at the numbers. But it also shows us something more. Notice how the blue points are roughly clustered around a straight line, as are the yellow points. Moreover, it looks as though both colors of points might be clustered around the same straight line.
Now look at all the chapters that contain about 100 periods. The plot shows that those chapters contain about 10,000 characters to about 15,000 characters, roughly. That's about 100 to 150 characters per period.
Indeed, it appears from looking at the plot that on average both books tend to have somewhere between 100 and 150 characters between periods, as a very rough estimate. Perhaps these two great 19th century novels were signaling something so very familiar us now: the 140-character limit of Twitter.