Help on a book-titles algorithm

DiscussiePurely Programmers

Sluit je aan bij LibraryThing om te posten.

Help on a book-titles algorithm

Dit onderwerp is gemarkeerd als "slapend"—het laatste bericht is van meer dan 90 dagen geleden. Je kan het activeren door een een bericht toe te voegen.

1timspalding
jul 5, 2008, 1:00am

I'm trying to come up with a better algorithm for work titles. The previous algorithm was purely democratic, except that it favored library titles over Amazon titles.

I want to move away from the Amazon/library dichotomy, and add a concept of "shared core." For example, if LT knows these:

The Hobbit: Or there and back again (2 copies)
The Hobbit (Movie Tie-in Edition) (2 copies)
The Hobbit (1 copy)

I want it to pick "The Hobbit" as the title. Maybe it should pick that even if there wasn't a single copy without the title.

The trick is that it's hard to define "shared core" well. Take this example:

Harry Potter and the Sorcerer's Stone (Movie-Tie In Edition)
Harry Potter and the Sorcerer's Stone
Harry Potter and the Philosopher's Stone

The "shared core" is Harry Potter and the, period. Even if I required the title to already exist, it would probably end up with Harry Potter, since some user has no doubt titled their book that way.

There are some ways around this: I can require some book to have core. I can avoid looking for the core and merely create cores by chopping off parentheses. Of course, there are some books with parentheses in their title. And to prevent too many false positives I can log Harry Potter and the Sorcerer's Stone (Movie-Tie In Edition) as 10 copies in full and five without the parenthesis, so it would only choose it if the core was "pervasive" among all editions.

Anyone have a better, cleaner solution?

2dchaikin
jul 5, 2008, 1:39am

There are so many different naming variations out there, that I think any method will have lots of problems. As I see it, in the second example the books simply have two completely different names. There isn't a "shared core", the concept doesn't apply. How could you identify where shared core does and doesn't apply?

Whatever you come up with, you will probably want to give users the option to override the code and enter a "core" title.

3timspalding
jul 5, 2008, 2:03am

Whatever you come up with, you will probably want to give users the option to override the code and enter a "core" title.

Right. We allow that now—the "canonical title" field in Common Knowledge. I'm looking for a best-case algorithm.

4wimble
jul 5, 2008, 4:44am

I don't know if this helps, but what happens if you generate the common prefix from each original pair of titles (along with a weighting for that prefix). The obvious problem being that from N titles, there are (N-1)(N-2)/2 potential pairs. And then, possibly, repeat on these prefixes, etc, until you get everything down to empty strings.

So, in your Hobbit case:
1+2=The Hobbit (4). 1+3=The Hobbit(3) 2+3=The Hobbit(3)
You've now got The Hobbit(10), and as the only prefix, stop.

In the Harry Potter (assuming 1 count for each ;-)
1+2=HP&tSS (2) 1+3=HP&t (2) 2+3=HP&t(2)
So: HP&tSS(2) HP&t (4)
Repeat on those and you get HP&t (6).

Well, you get some interesting weightings. I'm guessing that the figures in the second example demonstrate that it's not appropriate. Oh well.

Presumably the problem with "and the" is resolvable: I've got code for handling the artists on my MP3s, which first splits the artist string on conjunctions (and, with, ft., vs., this being music ;-), and then moves leading articles to the end, in order to canonicalise. You could presumably just trim these trailing conjunctions and articles. Whether those are the only words you need to worry about, I've got less idea. Probably not :(

5andyl
jul 5, 2008, 5:26am

Firstly some recapping and musing.

1) It is going to be impossible to get a common core across all editions of some works - some of the titles will be in foreign (Harrius Potter et Philosophi Lapis) and some books just have completely different names between US English and Commonwealth English.

2) Some titles have been entered all in caps - do we need to correct for case in this?

3) Some titles have been entered with strange spacing. Comparisons should be done at word level and not character level.

In the Harry Potter and the Sorceror's Stone we presumably want to get the most popular title disregarding all the little bits of fluff that seem to attach to titles.

Because we cannot get a common core across all titles there must be a percentage at which the a common core wins. What that percentage is will need some trial and error.

In the case of Harry Potter there may well be a number of 'cores' that meet that percentage. For example "Harry Potter" might be common across 98% of works, "Harry Potter And The" across 95%, "Harry Potter And The Sorceror's Stone" across 70%. In that case if the more common cores are substrings (excluding spacing and punctuation) of a longer core that also meets the cuttoff then that longer core should win.

6shmjay
jul 5, 2008, 8:59am

>5 andyl:

But in the case of Harry Potter, all the books of the series start with "Harry Potter and the ...", so that would be a very bad choice for the core.

I think the best you can do is get rid of (Amazon parenthetical fluff).

7andyl
jul 5, 2008, 9:16am

#6

You are missing the point I think.

Nearly all of the editions begin "Harry Potter".
Just a few less than that begin "Harry Potter and the"
The majority (I guess) begin (or contain) "Harry Potter and the Sorceror's Stone".
A sizeable minority begin (or contain) "Harry Potter and the Philosopher's Stone".

If that third case has enough copies to reach the cutoff for consideration then it will win over the first two cases as they consist of substrings of the third. The most popular "core" is not the one that will win.

8timspalding
jul 5, 2008, 12:16pm

>6 shmjay:

I'm worried getting rid of the parenthesis will misfire somewhere else, though.

9timspalding
Bewerkt: jul 5, 2008, 1:45pm

So, I've gone with:

1. Collect all titles and their counts. I also know the language the titles went with book-by-book.
2. Go through all the titles and counts, and try to make a parenthetical-less one. If possible and that version was *already existing* then add 75% of the count of the title to the new parenthetical one.
3. Go through titles in reverse-count order. If the language for that title hasn't been found yet.

This seems to work pretty well. For example, it decided on Left Behind: A Novel of the Earth's Last Days although almost all editions add some series-level info like "(Left Behind No. 1)." If they had ALL added the same parenthetical info, that would have one, but the instability (Left Behind #1), (Left Behind Series, 1), etc. worked in our favor.

Of course, this isn't perfect, but it's a 90% solution, I think. And if it fails, there's always the canonical title option.

10timspalding
jul 5, 2008, 1:44pm

It's marching through the works now—at about work number 1,000. It will take a while to hit all of them.

11lilithcat
jul 5, 2008, 1:56pm

> 8

I am, too. I am currently trying to add canonical titles for books, and I am discovering that a great many of the parenthetical remarks cannot be removed without creating the danger that works will be improperly combined.

For example, there's "The Great Gatsby (Cliff notes)" - not to mention (York notes), (Spark notes), etc. I've seen parentheses use to indicate (audiotape) and (movie). Will the book "Harry Potter and the Sorcerer's Stone" end up combined with "Harry Potter and the Sorcerer's Stone (DVD)"?

12timspalding
jul 5, 2008, 2:07pm

>11 lilithcat:

This is about the work title. It does not affect your book titles or the titles of any editions.

That is, you can have two works with the same work title (in English or in any other language), composed of editions with different titles. That was true before too, although it's more likely now.

I'm playing with the Great Gatsby right now, so don't mess with it until I come back.

13lilithcat
jul 5, 2008, 2:18pm

Sorry, Tim, but I "messed" with it before I posted here.

14timspalding
jul 5, 2008, 2:18pm

Yeah, this was possible before. It only unscores how important it is for the combination page to show all the editions, I think.

15timspalding
jul 5, 2008, 2:20pm

>13 lilithcat:

:)

I didn't mean that in a bad way...

16lilithcat
jul 5, 2008, 2:38pm

> 15

Oh, I know that! But I did want you to be aware that I had made changes, as that might have affected what you were doing (or how you interpreted your data).