• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Authorship Proven by Mathematics

Burrow's Delta helps determine the real author of And Quiet Flows the Don


Marking Mikhail Sholokhov's 115th anniversary (1905-1984), linguists Boris Orekhov of the HSE and Natalya Velikanova of the Moscow State University confirmed his authorship of the epic novel about the Don Cossacks. The researchers were able to attribute the novel using the text distance measure proposed by John Burrows. Termed Burrows' Delta, it provides a simple and reliable method of attributing or confirming the authorship of various texts. Statistical analysis reveals that Burrows' Delta is minimal between And Quiet Flows the Don and Sholokhov's other work—meaning that they were almost certainly written by the same person—while the distances between the epic novel and other authors' texts are significantly greater.

The Case of Unknown Author

'... And the man who wrote the note is a German. Do you note the peculiar construction of the sentence—“This account of you we have from all quarters received”? A Frenchman or Russian could not have written that. It is the German who is so uncourteous to his verbs'. This is an excerpt from one of Arthur Conan Doyle's early stories about the adventures of Sherlock Holmes and Dr. Watson. The private detective's trained eye immediately determines the nationality of the note’s author, the King of Bohemia, and a new investigation begins.

Attribution is the act of ascribing a work of literature or art to a particular author – or of establishing details about the author and the circumstances in which the work was created.

Usually, philologists and historians—rather than fictional detectives—are those who need to determine the author of a particular work. It is still unknown, for example, whether The Dream of the Red Chamber, one of the pinnacles of Chinese literature, was authored by Cao Xueqin alone, or with contributions from his editors Gao E and Cheng Weiyuan. In the U.S., disputes continue around the authorship of The Federalist Papers, the famous interpretation of the Constitution, and whether it was written by Alexander Hamilton or by James Madison. And perhaps the most famous case is that of Shakespeare: did he actually exist, who wrote his plays and how many authors were involved? 

Some cases are more recent. Literary circles spent a considerable amount of time trying to ascertain whether the famous J.K. Rowling and the little-known author Robert Galbraith were one and the same person. Eventually, linguists identified J.K. Rowling as the author of the crime novel The Cuckoo's Calling long before the official disclosure of her pseudonym. 

Along similar lines, debates have been ongoing for more than 80 years in the USSR and then in Russia about the true authorship of And Quiet Flows the Don, which was written as part of a broader epic about the Don Cossacks and brought Mikhail Sholokhov the Nobel Prize for literature in 1965. Did this prominent Soviet writer write the novel or did he plagiarise a manuscript authored by Cossack writer Fyodor Kryukov?

Ghost Fight

Sholokhov was in his early twenties when he wrote the first volume of the epic series published between 1928 and 1940. The outstanding genius and maturity of this literary work, unexpected in an author as young as Sholokhov, stirred envy in many other Soviet writers of the era. Rumours of plagiarism emerged, alleging that the young writer had appropriated a manuscript written by an unknown White Army officer executed by the Bolsheviks. Some people hinted that there was a ghostwriter behind Sholokhov and that the novel was actually written by Alexander Serafimovich, author of The Iron Flood and editor-in-chief of Oktyabr literary magazine, who used Sholokhov's authorship as a cover to avoid damaging his own reputation if things did not go well.

Indeed, concerns about reputation appeared well-justified at the time. Immediately after the book's release, Sholokhov was the victim of ideological harassment. Proponents of 'ideological purity' of Soviet literature in the late 1920s were enraged at the young writer’s sympathetic depiction of the Don Cossacks, who had largely opposed the Communist Revolution. As a result, the release of the novel's third volume was delayed, fuelling even more rumours of plagiarism. But Maxim Gorky intervened on Sholokhov's behalf, and this new volume of the epic was eventually published in 1932. Meanwhile, Sholokhov continued his work on the fourth volume, despite attempts to challenge his authorship of the first three books.

Those who disputed Sholokhov’s authorship of the novel named a few other writers with varying degrees of talent as the potential author. By the early 21st century, three ‘suspects’ remained—Fyodor Kryukov, Alexander Serafimovich and Victor Sevsky (Veniamin Krasnushkin)—believed by some to have written the first volumes of the novel, while Sholokhov supposedly authored its last parts.

Cossack Fyodor Kryukov, a writer and a member of the White Movement, was first named as a potential author in 1937, long after his own death. Sholokhov at that time had only just avoided getting arrested on trumped-up charges of involvement in counter-revolutionary conspiracy on the Don. Later, in the 1970s, Kryukov's alleged authorship surfaced once again in Rapids of the Quiet Don (Riddles of the Novel), a book published in Paris by  philologist Irina Medvedeva-Tomashevskaya (Alexander Solzhenitsyn also supported her theory). Having analysed the plot of the novel, Medvedeva-Tomashevskaya found what she believed was evidence of dual authorship. According to some other researchers, the novel's real author was journalist Victor Sevsky, first-hand witness of the Russian Civil War on the Don. Still others insisted that the book was written by Serafimovich. 

The debate continued until the end of the 20th century, when the manuscripts of the novel's first two volumes, which had long been considered lost, were discovered. Sholokhov's authorship has largely been unquestioned since then, although a few scholars still insisted that he had copied his manuscripts from those written by someone else. Only robust and reliable mathematical methods could resolve this longstanding argument.

Enter Mathematicians

While philologists had initially assumed that mathematical statistics could help them avoid confirmation bias, or a tendency to select evidence which supports their existing beliefs, very soon they discovered that computational studies could just as easily be used to manipulate data to make it fit the desired conclusion. That said, computational methods still make it possible to apply a scientific approach by translating intangible literary concepts into measurable categories. 

Russian researchers were the first to apply mathematical methods to literature back in the early 20th century, when they worked to make philology more of an exact science using formal methods. Long before computers and computation, literary theorist Boris Yarkho used a pencil and paper to perform a statistical analysis of texts. Mathematical methods soon became part of the Sholokhov debate.

There have been many attempts to 'measure' Sholokhov's style. Textology, or text linguistics, studies writers’ literary work and accompanying texts such as diary entries, letters, speeches, and others. It also uses stylometry, where statistical analysis is applied to literature to examine the author’s style, including their use of words, sentence structure, punctuation, etc.

Style is something that can be measured by mathematics. Linguist Boris Orekhov demonstrated this with the help of neurolyrics. He used poems by famous authors ranging from Homer to Mandelstam to train a neural network to compose poetry imitating the style of a particular author. While the poems lacked meaning, their style was easily recognisable. For example, most Russians could identify the 'Vladimir Vysotsky style' with its forceful energy, abrupt phrases and expressive choice of words.

Not only literary scholars, but also historians (e.g. this paper by Andrei Venkov) and mathematicians have attempted to 'test harmony by algebra'. Andrei Zenkov, physicist and mathematician from Yekaterinburg, developed his own, rather controversial method of stylometry and used it to assert that And Quiet Flows the Don was not written by Sholokhov.

The parents of Anatoly Fomenko, a topologist notorious for his pseudoscientific New Chronology theory, also contributed to the debate. Valentina and Timofey Fomenko argued, based on their calculations, that Tales from the Don, Virgin Soil Upturned and other later works, as well as the last two parts of his Nobel-winning novel were, indeed, authored by Sholokhov, while the first, second and even the beginning of the third volume of And Quiet Flows the Don were written by someone else.

A group of Nordic researchers published a paper in which they provided statistical analysis of sentence lengths in And Quiet Flows the Don, Sholokhov's other texts, and Kryukov's writings. They found that the average length of Kryukov's sentences is 13.9 words, while Sholokhov's sentences are, on average, 12.9 words long, which is closer to the average sentence length of 12.4 words in the Don epic. The study’s authors conclude that, in this respect, Sholokhov's writing style has more similarities than Kryukov's with that in which And Quiet Flows the Don is written.

But according to linguist Orekhov, 'there are no guarantees that the average sentence length is even relevant for text attribution'.

The Nordic scholars made some other calculations as well. For example, they compared the distribution of sentence length (measured by the number of words) and found that sentences containing between six and ten words were the most numerous ones in all texts, but more so in Sholokhov's writings and in And Quiet Flows the Don, where their share was 33.2% and 32.8%, respectively, but less so for Kryukov’s texts where just 26.1% of all sentences were between six and ten words. Generally, the distribution curves of And Quiet Flows the Don and Sholokhov's texts tend to mirror each other while that of Kryukov’s texts deviates from the other two.

The problem with these studies is that none of them can prove that the metrics they use actually solve the attribution problem. Instead, the authors invite everyone to believe their findings because they have used mathematical statistics in their research.

'In their attempts to provide a quantitative answer to the question of the authorship of And Quiet Flows the Don, researchers have invented their own methods either before or during the process of proving their hypothesis', Orekhov explains. 'But it is extremely difficult to do both things at the same time, i.e. to design your method and to use it for text attribution – it's like fighting a war on two fronts simultaneously'.

Triumph of Mathematical Linguistics

Instead of reinventing the wheel, one can turn to the long-established arsenal of mathematical linguistics, in particular to one of its best-known, simple and proven methods which is Burrows' Delta, introduced back in 2002 by John Burrows, a renowned expert in computational linguistics, in his paper ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship.

The study method is fairly straightforward: First, a selection of words frequently used in the studied text is compiled, which can include from 100 to 5000 lexical units, usually the most common pronouns, verbs, nouns, prepositions, conjunctions, etc.

Then the frequencies of these words in the studied text are calculated, representing a 'fingerprint' of the author’s style, followed by the average frequency of each word in the sample and its standard deviation. And finally, the second number is subtracted from the first one, and the difference is divided by the third number. The result is a standardised score (z-score), or a certain weight assigned to each word.

By knowing z-scores for the same set of words used in two different texts, a researcher can compare them by calculating the Manhattan distance, aka Burrows' Delta. Its mathematical formula is shown below.

It may look sophisticated but is in fact fairly easy to understand. Imagine a chessboard with a rook in one of its squares. The Manhattan distance between two squares on the chessboard is equal to the minimum number of moves needed for a rook to travel from one square to another (assuming that one move takes the rook to an adjacent square).

We can now place certain frequent words from the two studied texts on the chessboard. If the distance between them is one or three, chances are that both texts are written by the same person. But if the distance is four to eight moves, the texts are almost certainly written by different authors. Successively comparing two sets of words makes it possible to calculate a multidimensional distance, and the shorter this is, the more the two texts are similar and the higher the probability that both can be attributed to one and the same author.

Burrows' Delta measures stylistic differences and makes it possible to distinguish not only between different authors but also between different texts by the same author. It is one of the most frequently used and researched methods of literary attribution.

Burrows' Delta was used to determine Joanne Rowling's authorship of A Cuckoo's Calling. Dozens of studies have recently been published featuring successful application of this method in stylometry, in particular for authorship attribution.

Anniversary Delta

Boris Orekhov and Natalya Velikanova used Burrows' Delta to attribute the authorship of And Quiet Flows the Don. They formed a set of the 200 most frequent words, obtained z-scores and calculated the distances between the texts authored by Sholokhov (Tales from the Don, The Fate of Man, They Fought for Their Motherland, and Virgin Soil Upturned), by some of his prominent contemporaries (Mikhail Bulgakov, Leonid Leonov, Andrey Platonov, Vsevolod Ivanov, Nikolai Ostrovsky, Alexander Fadeev), and by the alleged authors of And Quiet Flows the Don, such as Fyodor Kryukov, Victor Sevsky and Alexander Serafimovich. Then the researchers compared the distances for each pair of texts.

Such distances are always shorter for different texts by the same author than for texts by different authors. The distance between Bulgakov's The Master and Margarita and The White Guard is quite small at 0.7, and the distance between Sholokhov's Tales from the Don and the first volume of And Quiet Flows the Don is even less at just 0.57, meaning that we can be confident that the same person authored both.

It was particularly important to compare the Nobel-winning novel with Tales from the Don, written at approximately the same time and set in the same context. In contrast, the distance between And Quiet Flows the Don and Fyodor Kryukov's writings is considerable, ranging from 0.89 to 1.27, which indicates totally different styles. Likewise, the distances from texts by Serafimovich (0.9 to 1.17) and Sevsky (1.09 to 1.29) are quite large.

Tree of Writers

A dendrogram can be used to visualise distances between texts and authors. Texts which are similar in style are clustered together and similar works by different authors are close to one another. This dendrogram can be based on short texts as well as novels, as long as the total volume of text by one author is at least 10,000 words, the lowest threshold at which Burrows' Delta can produce reliable results.

It can be seen on the dendrogram that Bulgakov’s novels, although very different, are still clustered together. The same is true of books written by Leonov and Vsevolod Ivanov. Kryukov’s texts form a separate cluster and are quite remote from And Quiet Flows the Don. Serafimovich’s and Sevsky’s texts are somewhat closer to And Quiet Flows the Don but not in the same cluster.

But Sholokhov's Tales from the Don and the disputed novel are in the same cluster. Their proximity provides the most reliable evidence of Sholokhov's authorship of both. There is one surprising discovery, however. 'We find it somewhat strange that Tales from the Don and Sholokhov's later texts, such as The Fate of Man and They Fought for Their Motherland, ended up in different clusters’, Orekhov notes.

So here are the main findings from the study. And Quiet Flows the Don and Tales from the Don were written by the same person beyond almost any doubt. All the volumes of the And Quiet Flows the Don series have the same author. This author is definitely neither Sevsky nor Kryukov. The most likely author is Sholokhov. Other writers simply have no chance – at least, according to mathematics. And this comes as a welcome gift for the anniversary of the famous writer of the epic about the Don Cossacks!


Study authors:
Boris Orekhov, PhD in Literature, Associate Professor HSE School of Linguistics
Natalia Velikanova, PhD in Literature, Lomonosov Moscow State University
Authors: Daniil Kuznetsov, Olga Sobolevskaya, June 12