Last Friday, I’ve been tinkering TextTeaser to improve its backend. It’s been a while since I modify something in it. But the modifications and improvements is for a different blog post.
While I’m doing something related to TextTeaser, I just remembered something that I want to do for a while now. I wanted to test TextTeaser how well it can summarize a book. And no book is better to try out than The Hunger Games. It is the best because I read, and know the book series. I’m not really a book reader but I managed to finished the series.
Preparing for summarization:
I managed to download a text file of The Hunger Games. TextTeaser needs to be modified for it to be able to summarize very large chunk of text. The setup is that TextTeaser will read the text file and assign it to a variable.
I also disabled the “learning” mechanism to speed up the process. I did it for speed reasons but it turns out that it’s more than just that. Will discuss it later.
I minimized the summary count into three just for fun.
After preparing TextTeaser, I then run the Scala script. I thought it will be a long wait even without the “learning” mechanism. I was wrong, I think the summary was generated for about less than ten seconds.
Here it is:
- Bright and bubbly as ever, Effie Trinket trots to the podium and gives her signature, Happy Hunger Games!
- The Hunger Games aren’t a beauty contest, but the best-looking tributes always seem to pull more sponsors.
- Stop! Stop! Ladies and gentlemen, I am pleased to present the victors of the Seventy-fourth Hunger Games, Katniss Everdeen and Peeta Mellark!
I think it’s kinda good summary. The three sentences stated the three parts of the story: the selection of tributes, the description of Hunger Games, and the end result. It was actually cool that TextTeaser managed to extract those sentences.
Of course, there’s an analysis part. So how does TextTeaser manages to extract does sentences?
Another reason why The Hunger Games is a good book for this experiment is that the title really tells the story. The title “The Hunger Games” tells that the book is of course about the hunger games. Unlike the other books in the series: Catching Fire, and Mokingjay doesn’t equates to the story of those books.
TextTeaser heavily credits the relationship of the title to the content. That’s why it performs really good with news articles. And that is also why “Hunger Games” appears in all lines of the summary.
I also mentioned a while ago that the “learning” mechanism affects more than just the speed. In fact, it didn’t affect the speed at all. I tried summarizing the again with the “learning” mechanism and the speed to generate a summary is almost the same. But… The resulting summary is largely different. It’s bad.
Here’s the summary with the “learning” mechanism:
Yeah. What the hell. Katniss is too obsessed in shouting Peeta’s name. She calls Peeta a lot during the game and through out the whole story. While TextTeaser puts a lot of credit to the title, it puts a lot more credit to repeating words. And it puts a lot of credit to the word “Peeta”. I checked the database where keywords are stored and saw that “Peeta” is in the top of the list and occurs more than 400. No relevant keyword came close and only “Haymitch” is the name that is part of the list.
Another reason is that calling “Peeta!” is considered as a sentence because it ends with a proper punctuation. While it gives a low score to low word count in the sentence, it also gives word count a low credit.
This single experiment doesn’t really shows the capability of TextTeaser in summarizing books. But it does show its differences compared to summarizing news articles. While TextTeaser can somehow used to summarize books, it still needs heavy modifications for it to work decently.
Let’s also not take away that I only extract three sentences. What if I extract the normal five sentences? Or maybe 20. But I think we need to extract more sentences to determine the effectiveness and to call it a real summary.
"Peeta" ruins the "learning" mechanism. But what if we disregard the word "Peeta" and add it to the stop words? Will it affect the quality of the result?
Well we don’t know. But this is just an initial test. I’m looking forward to test “The Hunger Games” again with a different setup. And I’m definitely looking forward to test it to other books, or other documents as well.
I’m on the process of upgrading the backend of TextTeaser. It’s coming soon!
Check out TextTeaser in its website.
It’s also open sourced in Github.April 14, 2014