Help on a book-titles algorithm
Sluit je aan bij LibraryThing om te posten.
Dit onderwerp is gemarkeerd als "slapend"—het laatste bericht is van meer dan 90 dagen geleden. Je kan het activeren door een een bericht toe te voegen.
I want to move away from the Amazon/library dichotomy, and add a concept of "shared core." For example, if LT knows these:
The Hobbit: Or there and back again (2 copies)
The Hobbit (Movie Tie-in Edition) (2 copies)
The Hobbit (1 copy)
I want it to pick "The Hobbit" as the title. Maybe it should pick that even if there wasn't a single copy without the title.
The trick is that it's hard to define "shared core" well. Take this example:
Harry Potter and the Sorcerer's Stone (Movie-Tie In Edition)
Harry Potter and the Sorcerer's Stone
Harry Potter and the Philosopher's Stone
The "shared core" is Harry Potter and the, period. Even if I required the title to already exist, it would probably end up with Harry Potter, since some user has no doubt titled their book that way.
There are some ways around this: I can require some book to have core. I can avoid looking for the core and merely create cores by chopping off parentheses. Of course, there are some books with parentheses in their title. And to prevent too many false positives I can log Harry Potter and the Sorcerer's Stone (Movie-Tie In Edition) as 10 copies in full and five without the parenthesis, so it would only choose it if the core was "pervasive" among all editions.
Anyone have a better, cleaner solution?
Whatever you come up with, you will probably want to give users the option to override the code and enter a "core" title.
Right. We allow that now—the "canonical title" field in Common Knowledge. I'm looking for a best-case algorithm.
So, in your Hobbit case:
1+2=The Hobbit (4). 1+3=The Hobbit(3) 2+3=The Hobbit(3)
You've now got The Hobbit(10), and as the only prefix, stop.
In the Harry Potter (assuming 1 count for each ;-)
1+2=HP&tSS (2) 1+3=HP&t (2) 2+3=HP&t(2)
So: HP&tSS(2) HP&t (4)
Repeat on those and you get HP&t (6).
Well, you get some interesting weightings. I'm guessing that the figures in the second example demonstrate that it's not appropriate. Oh well.
Presumably the problem with "and the" is resolvable: I've got code for handling the artists on my MP3s, which first splits the artist string on conjunctions (and, with, ft., vs., this being music ;-), and then moves leading articles to the end, in order to canonicalise. You could presumably just trim these trailing conjunctions and articles. Whether those are the only words you need to worry about, I've got less idea. Probably not :(
1) It is going to be impossible to get a common core across all editions of some works - some of the titles will be in foreign (Harrius Potter et Philosophi Lapis) and some books just have completely different names between US English and Commonwealth English.
2) Some titles have been entered all in caps - do we need to correct for case in this?
3) Some titles have been entered with strange spacing. Comparisons should be done at word level and not character level.
In the Harry Potter and the Sorceror's Stone we presumably want to get the most popular title disregarding all the little bits of fluff that seem to attach to titles.
Because we cannot get a common core across all titles there must be a percentage at which the a common core wins. What that percentage is will need some trial and error.
In the case of Harry Potter there may well be a number of 'cores' that meet that percentage. For example "Harry Potter" might be common across 98% of works, "Harry Potter And The" across 95%, "Harry Potter And The Sorceror's Stone" across 70%. In that case if the more common cores are substrings (excluding spacing and punctuation) of a longer core that also meets the cuttoff then that longer core should win.
But in the case of Harry Potter, all the books of the series start with "Harry Potter and the ...", so that would be a very bad choice for the core.
I think the best you can do is get rid of (Amazon parenthetical fluff).
You are missing the point I think.
Nearly all of the editions begin "Harry Potter".
Just a few less than that begin "Harry Potter and the"
The majority (I guess) begin (or contain) "Harry Potter and the Sorceror's Stone".
A sizeable minority begin (or contain) "Harry Potter and the Philosopher's Stone".
If that third case has enough copies to reach the cutoff for consideration then it will win over the first two cases as they consist of substrings of the third. The most popular "core" is not the one that will win.
I'm worried getting rid of the parenthesis will misfire somewhere else, though.
1. Collect all titles and their counts. I also know the language the titles went with book-by-book.
2. Go through all the titles and counts, and try to make a parenthetical-less one. If possible and that version was *already existing* then add 75% of the count of the title to the new parenthetical one.
3. Go through titles in reverse-count order. If the language for that title hasn't been found yet.
This seems to work pretty well. For example, it decided on Left Behind: A Novel of the Earth's Last Days although almost all editions add some series-level info like "(Left Behind No. 1)." If they had ALL added the same parenthetical info, that would have one, but the instability (Left Behind #1), (Left Behind Series, 1), etc. worked in our favor.
Of course, this isn't perfect, but it's a 90% solution, I think. And if it fails, there's always the canonical title option.
I am, too. I am currently trying to add canonical titles for books, and I am discovering that a great many of the parenthetical remarks cannot be removed without creating the danger that works will be improperly combined.
For example, there's "The Great Gatsby (Cliff notes)" - not to mention (York notes), (Spark notes), etc. I've seen parentheses use to indicate (audiotape) and (movie). Will the book "Harry Potter and the Sorcerer's Stone" end up combined with "Harry Potter and the Sorcerer's Stone (DVD)"?
This is about the work title. It does not affect your book titles or the titles of any editions.
That is, you can have two works with the same work title (in English or in any other language), composed of editions with different titles. That was true before too, although it's more likely now.
I'm playing with the Great Gatsby right now, so don't mess with it until I come back.