Collaborative works: Joseph Conrad and Ford Madox Ford

DiscussieLiterary Computing

Sluit je aan bij LibraryThing om te posten.

Collaborative works: Joseph Conrad and Ford Madox Ford

1Petroglyph
Bewerkt: mei 12, 2022, 9:17 pm

For today's Lunch Break Experiment (tm) I'll look at the novel The Inheritors (1901), which was a collaboration between Joseph Conrad and Ford Madox Ford. What I want to achieve at the end of this is two things:

  1. Learn how to produce graphs like in this Good Omens analysis by Callaway;
  2. I want to get a feel for how this works, how different algorithms are to each other.
Ideally, a collaboration graph like Callaway's would lay out visually the contributions made by each author, as well as where the handovers occur. The result in this case? No, not really.

Can we identify each author's individual style well enough to arrive at a probabilistic estimate of which parts of The Inheritors were contributed by each? Spoiler:

Right from the get-go I need to admit that I've shamelessly stolen the idea for this experiment from Maciej Eder's own illustrations for this particular application of R:Stylo (Eder is the main developer behind Stylo), though I've done more tinkering and graph-producing. I've also based portions of this post on Rybicki et al. 2014, a paper that treats the issue in much more detail. Three reasons:

  1. My time is limited, so quick and dirty and mainly involving following someone else's lead is perfect right now.
  2. Didactic purposes for myself: I want to try out things and get a feel for them, in order to run more independent little explorations in the future (even in a post-Lunch-Break-Experiment-(tm)-era).
  3. Didactic purposes for others: a carefully and transparently documented process leaves room for only those criticisms that are made a) in bad faith, b) due to careless reading, and/or c) due to a superior understanding of stats and/or the software.
(That last comment is irrelevant for this forum.)

So if parts of this Lunch Break Experiment (tm) feel like "Petroglyph messing about with graphs just to see what they will do", then that is exactly what I was doing.


The method

  1. Feed the software some texts that Conrad wrote on his own; do the same for Ford. This is how each individual author's stylistic profile is generated
  2. Feed the software the collaborative novel
  3. Let the software slice the collaborative novel into small-ish slices of 500 words, determining for each slice an author profile
  4. Let the software compare each slice's profile against both individual authors
  5. Each slice ends up with two probabilistic estimates: X percent certainty the slice is by Conrad; Y percent certainty the slice is by Ford
  6. Put all this in a graph

For the nicest-looking graphs, we want the probability estimates for each slice to show a clear winner, but that cannot always be the case.


Corpus

I downloaded from ProjGut six novels from both Joseph Conrad and Ford Madox Ford (though see below), as well as The Inheritors. (See the filenames in the illustrations below.) At some point it became relevant to also download Romance (another collaboration between Conrad & Ford) and Conrad's Nostromo, which Ford claimed to have contributed prose to. These documents were given a rough cleaning: the ProjGut legal boilerplate was removed, as were any illustrations, Prefaces, Table of Contents, Notes to the First Edition, dedications, mottos, etc. All novels start at Chapter One (or Part One, Chapter One) and end with the last word of fiction: I also removed the dates and places where the books were written from the end.

Conrad: The nigger of the 'Narcissus' (1897), Heart of darkness (1899), Lord Jim (1900), Under western eyes (1911), Chance (1913), Victory (1915)

Ford: The fifth queen (1906), Privy Seal (1907), An English girl (1907), The fifth queen crowned (1908), The young Lovell (1913), The good soldier (1915)

In order to make this corpus a little more representative, I selected only novels (no non-fiction, no short stories, no plays) from a timeframe 1897-1915; The Inheritors is from 1901, Romance from 1903 and Nostromo from 1904.

Conrad's novels are the same as the ones used by Eder in his example corpus. For Ford's novels, however, I made some different choices, due to ProjGut availability. I will say that I spent a few minutes trying to track down good-quality machine-readable copies for the exact novels in Eder's example corpus, with no luck. I looked at ProjGut Canada and Australia, Wikisource, StandardEbooks and OpenLibrary, as well as a few other sites that a duckduckgo query brought up; the copies over at archive.org, while they are machine-readable, suffer from an atrocious number of OCR errors (seriously, ew, ew ew, look how they massacred my poor texts) and would require actual time and effort to clean up. So I took the easy way out and grabbed whatever novels from the right timeframe were available on ProjGut and from this corpus of 100 English novels. This is a Lunch Break Experiment (tm), not something I'll submit to peer review. (Note: if you compare my graphs with Eder's illustration, they'll look a little different, because we calibrated our algorithms on different training data).


Sanity check

First, just as a sanity check, let's see if the software can distinguish between Conrad's novels and Ford's. Here is a bootstrap consensus tree:



Conrad's novels (in green) are systematically separate from Ford's (in blue). I've also included two collaborative novels (in red). Of these, Romance is squarely categorized as one of Conrad's books, whereas The Inheritors fits snugly within Ford's (that particular novel is nowhere near the other collaborative work, either, nor is it to Nostromo). Within Conrad's novels, there's an early group to the left (published between 1899 and 1904), and a later group to the right (published 1911-1915). I have no idea why Narcissus is located where it is. There's some structure within Ford's books as well: the Fifth Queen trilogy (Fifth Queen, Privy Seal, and Fifth Queen Crowned) are placed closely together, apart from the others.

But yes: the software recognizes the differences between Conrad and Ford. now we can start thinking about the analyses and what kind of results we might get.


What kind of results can we expect?

For The Inheritors,
"{Ford} did most of the writing himself, though he discussed it extensively with Conrad, whose role, he said, was ‘to give each scene a final tap’ (Saunders 1996, pp. 135–36)" (Rybicki et al. 2014, p. 423; their quote is from Saunders' Ford Madox Ford: A Dual Life, Vol. 1.


In other words, what we have here is akin to a palimpsest -- both authors worked more or less extensively on nearly every scene, and, therefore, both authors' stylistic peculiarities ought to be present in pretty much everywhere. Perhaps to the point where they drown out the other's voice. This should be interesting!

A figure from Rybicki et al. 2014, p. 425 shows what this palimpsest-like style looks like: (open in separate tab to embiggen):



The black lines are Conrad's novels; the grey lines are Ford's. The lines closest to the X-axis are the ones whose style is closest to that of The Inheritors; the ones furthest away are the least similar. The X-axis is chronological: 0 is the start of the book; 60,000 is the end.

Overall, Ford's style dominates, but there is a lot of Conrad present, too: his novels are often in second, third, and fourth place.

Let's do a Stylo run! I'll provide the code and the corpus!

What we need to do is a) come up with an authorial signature for Conrad, b) an authorial signature of Ford, and c) slice the collaborative text into smaller samples, each of which will be compared to either signature and assigned a probabilistic result (e.g. 80% likely Ford; 20% likely Conrad). Fortunately, R:Stylo does that for us with just a few lines of code, and the correct folder names for our corpus.

Create a folder on your computer that will serve as your working folder. In it, create two subfolders named test_set and reference_set (case sensitive!). The folder reference_set is for your training corpus -- the undisputed works from which the software will derive the individual authors' signatures. The folder test_set is where you put the collaborative or disputed text. (I've uploaded this corpus as a zip file here).

Assuming you've installed both R and RStudio (installing stylo you do by typing install.packages("stylo"), and assuming your two subfolders are set up as described, all the code you need to run a collaborative stylometric check is this:
library(stylo)
setwd("path/to/folder/containing/both/subcorpora")
rolling.classify()

The first line loads the package Stylo into R. The function setwd on the second line sets the working directory to the folder where you've created test_set and reference_set (attn: use forward slashes) and tells R that you want to work within this directory. The funcion on the third line is all you need to run the entire process using R:Stylo defaults. These defaults are slice size: 5000, slice overlap: 4500, classification method: delta, no sampling in the training set, etc. If you want a different slice size, or a different stylometric technique, you can specify those explicitly. (If you want to know what all the defaults are, type help(rolling.classify) or stylo.default.settings().)

Eder only showed one figure, at 1000MFW. I decided to run five analyses, at 50MFW, 150, 300, 500 and 1000. The novels are split into chunks of 5000 words, and the overlap between chunks is 4500 words (i.e. the window moves forward by 500 words each time). Here are the code and the graphs (right-click to embiggen):

rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 50, classification.method = "Delta")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 150, classification.method = "Delta")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 300, classification.method = "Delta")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 500, classification.method = "Delta")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 1000, classification.method = "Delta")


(Note: Burrows' Delta is the default method, so if you run these calls without the argument classification.method you'll get the exact same result.)





Conrad's contributions are marked in red; Ford's in green. The three stripes show three analyses at different levels of probability: the bottom stripe is the most probable one, and the second and third stripe are less likely than the previous one).

Even from the thumbnails it is immediately obvious that the results look different at different levels of magnification: when taking into account only the top 50 or 100 words, the software thinks that Conrad wrote most of the book. But as the number of words considered rises, Ford's segments grow to cover much more of the novel. Narrow sections of Ford show up in the second and third stripe, but they only come to the fore at higher MFW.

These may seem like ambiguous results. However, there is not necessarily any real contradiction between the 1000 MFW graph which assigns great big chunks to Ford and the 100 MFW graph which assigns large chunks to Conrad. Think about what it is they measure differently: the 100 most frequent words will, naturally, include a far greater proportion of function words (of, the, to, a, an, ...) than the 1000 most frequent words, which will be dominated by content words.

If we compare this to what the authors said above about their collaboration -- that Ford wrote much of the initial draft, but that it was Conrad who was responsible for knitting the thing together -- then these graphs certainly fit that story.



Another method

Let's see what results we get using another method: Support-Vector Machine. That is a machine learning algorithm used for classifying data into two or more groups; it plots all the data, decides which group(s) of data belong together, and tries to find the best separator between the various groupings.

Again, here is the code, and below that are five graphs:

rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 50, classification.method = "SVM")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 150, classification.method = "SVM")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 300, classification.method = "SVM")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 500, classification.method = "SVM")
rolling.classify(slice.size = 5000, slice.overlap = 4500, mfw = 1000, classification.method = "SVM")






We see a similar story here: at low MFW, it is Conrad's style that is judged to be most like that of The Inheritors in large stretches, and Ford's style comes more to the fore as the MFW count goes up.

Notice in the SVM graphs how tall both the green and the red bars are: even in those cases where either one is taller, the other one is not far behind. There are very few places in this novel where one author's signature completely dominates the other's.



Any conclusions?

There are some high-level trends that turn up in both classification methods, and that, therefore, are statements we can make about this novel with some confidence.

For one, the ending of the novel (the final third or so) appears to have been mainly by Conrad: that stretch is mostly red all the way through in both methods and at multiple levels of magnification. The Delta method, which ranks three analyses in decreasing order of probability, judges the final third to be almost entirely by Conrad at all MFW levels and in all stripes within those levels.

Secondly, a first draft of the first two thirds of the novel appears to have been mainly written by Ford (in green), with extensive contributions by Conrad.

That fits entirely with what the authors told us about their collaboration.

The authors' individual styles have, through extensive co-editing, become intertwined: there are few sections of this novel that were mainly written by one person and left largely untouched by the other. Or, as Rybicki et al. put it:
Perhaps of the greatest interest here is not the fact that this or that passage in the collaborative works has been shown to bear one or the other writer’s fingerprint, but, rather, the fact that the two authorial signals are so mixed in the collaborations (Rybicki et al. 2014, p. 430)




References

Eder, Maciej, Jan Rybicki, and Mike Kestemont. 2016. ‘Stylometry with R: A Package for Computational Text Analysis’. The R Journal 8 (1): 107. https://doi.org/10.32614/RJ-2016-007.

Eder, Maciej. 2016. ‘Rolling Stylometry’. Digital Scholarship in the Humanities 31 (3): 457–69. https://doi.org/10.1093/llc/fqv010.

Rybicki, J., D. Hoover, and M. Kestemont. 2014. ‘Collaborative Authorship: Conrad, Ford and Rolling Delta’. Literary and Linguistic
Computing
29 (3): 422–31. https://doi.org/10.1093/llc/fqu016.

2Petroglyph
mei 12, 2022, 9:16 pm



Are the other two collaborations any less ambiguous?

Rybicki et al. also look at Romance, another collaboration between Conrad and Ford. What did that process look like?

For Romance, based on Ford’s earlier unfinished Seraphina, however, the consensus seems to be that it is about two-thirds Conrad and one-third Ford. According to the former, "We collaborated right through, but it may be said that the middle part of the book is mainly mine with bits by {Ford Madox Ford}—while the first part is wholly out of 'Seraphina': the second part is almost wholly so. The last part is certainly three quarters MS. F.M.H. with here and there a par. by me" (Karl 1997, p. 147). According to Ford, "parts one, two, three and five are a mosaic of alternately written passages, while part four is entirely Conrad’s work" (Karl 1997, p. 147). (Rybicki et al. 2014, p. 423)


Here is the graph (at 1000 mfw):



So yeah: the first third has portions that are closer to Ford's style than to Conrad's; then there is a large stretch that is mainly Conrad (the secondary green bars are tiny here); and the ending looks like it's mainly Ford's, with some bits by Conrad.

Neat! That fits exactly with the authors' recollections!

And as for Nostromo:

Brice quotes a letter from Ford to Keating (1923 or 1925), saying he wrote 10,000 words of Nostromo that he remembers and that he ‘could place my finger on fairly substantial passages’ (Brice 2004, p. 79), and another 20,000 that he only faintly remembers and would find difficult to trace. Later, in Return to Yesterday, Ford himself minimizes his contribution, saying that what he ‘wrote into Conrad’s books was by no means great in bulk’ (Brice 2004, p. 78) and was ‘so frequently emended out of sight that they could not make as much difference to the completion and glory of his prose as three drops of water poured into a butt of Malmsey’ (Brice 2004, p. 79). (Rybicki et al. 2014, p. 424)


Here's a graph (at 1000MFW):



There's a clear portion just past the halfway point where Stylo recognizes Ford's stylistic signature. Again, whenever Ford is not praising Conrad to high heaven, this graph matches his assessment of his contribution to Nostromo nicely.


Conclusion

This software is capable of recognizing relatively isolated contributions of one writer to another writer's work.

3Petroglyph
mei 12, 2022, 9:24 pm



Let's compare some nonsense

The graphs for The Inheritors may look very ambiguous: the visuals are not as clean and neat as we'd ideally would like them to be, or at least not as stable across varying MFW. But that does not mean they are wrong -- it just means that Conrad and Ford collaborated to such an extent that many sections bear both imprints, and each becomes visible at different levels of granularity.

In order to clarify that point a little, I think it is good to throw some obviously wrong data at the problem and see if that clarifies the matter.

What would happen if we gave Stylo a corpus of Jane Austen novels, and a corpus of DH Lawrence novels, and told it to match The Inheritors to those two? What result would we expect?

(The actual novels are these: Austen: Sense (1811), Pride (1813), Mansfield (1814), Emma (1815), Northanger (1818), Susan (1871 posthumous; written 1794); Lawrence: Sons and Lovers (1913), The Rainbow (1915), Women in Love (1920), The Lost Girl (1920), Kangaroo (1923), The Plumed Serpent (1926), and Lady Chatterley's Lover (1928))

Spontaneously, I would expect the software to match The Inheritors much more to Lawrence than to Austen, i.e. the novelist's output that's closest in time to The Inheritors.

Here are four graphs, generated at 50, 150, 250 and 350 MFW; Austen is red, Lawrence is green.




At 50MFW (i.e. mostly function words) Austen crops up for a few significant stretches, but at every higher level of MFW it's all Lawrence. In the Lawrence-dominated stretches, Austen's estimated contribution is fairly low (the red bars are generally low). Clearly, the software matches The Inheritors mainly to Lawrence.

What would the results look like if we graphed The Inheritors as a collaboration between DH Lawrence and Charles Dickens? Here are, again, four graphs at 50, 150, 250 and 350MFW:




At lower MFW, there are big stretches where Dickens is the most likely author, but as the MFW tally goes up, so does Lawrence's estimated contribution. This is as expected: his subcorpus is contemporaneous with The Inheritors. And look at Dickens's red bars: they are taller during the Lawrence-dominated stretches than Austen's were (and Lawrence's are, naturally, a bit shorter: the green and red bars add up to 100%). So the software, while characterizing The Inheritors as more like Lawrence's style, also thinks that that novel looks more like Dickens's style than Austen's.

And finally, let's see what a combination of Jane Austen and Charlotte Brontë (published in the 1840s-50s) gives us (at 350MFW):



Again, it is the most recent author whose style is dominant. Austen's estimated contributions are barely there.

Now, I am not suggesting that Jane Austen, Charlotte Brontë, Charles Dickens or DH Lawrence wrote The Inheritors. But, just like things like aphasia can tell us a lot about healthy brains merely by highlighting where exactly things can go wrong, these graphs with garbage input can throw revealing light on the Conrad-and-Ford graphs that are, perhaps, less unambiguous.

The obviously wrong graphs that map out where DH Lawrence's style is more like The Inheritors than Austen's or Dickens's may look a lot cleaner and give straight, neatly-divided-looking answers: the writing style throughout the entire novel is much more like his novels than those of the other authors. But in the case of the kind of collaboration that produced The Inheritors such unambiguousness is precisely the wrong kind of certainty.

4Petroglyph
mei 12, 2022, 9:25 pm

Feel free to suggest future topics for a Lunch Break Experiment (tm) dealing with other known collaborative works that are in the public domain (e.g. Shakespeare and Fletcher, Shakespeare and Marlowe, Stratemeyer Syndicate authors, Ford Madox Ford and Isobel Violet Hunt, David Weber and one of his collaborators if Baen ebooks still exists, H. Rider Haggars and Andrew Lang's The world's desire...) or for which clean, machine-readable copies are readily available. I've a feeling that looking at Wattpad or AO3 might turn up some promising candidates, as well.

5Crypto-Willobie
mei 13, 2022, 2:16 pm

Thank you.

6prosfilaes
jul 1, 2023, 11:04 am

>4 Petroglyph: Unfortunately, the case I'm looking at doesn't have good etexts available, even for the PD works. But G. D. H. Cole and Margaret Cole are credited as co-authors on a series of mysteries, but she claimed in his biography that most of the books were the work of one or the other of them working alone. It'd be interesting to see what these tools might say.