BookID, Book IDunno

In my previous post, I raised a few questions about BookID regarding potential false positives and unjustified removals. Coker has provided numbers for Smashwords removals, but there was no data on the validity of each of those cases. In my search for that data, I found an article from Nate Hoffelder at The Digital Reader, comparing BookID to YouTube’s ContentID. Scribd’s system (“an automated that checks user uploads for pirated content”—an automated what? My dear, you seem to have forgotten something) works “too well,” he writes, and references Smashwords users who have posted details about receiving false positives on their books.

Hoffelder briefly comments on the implications I raised earlier, albeit with slightly different focus, regarding my skepticism about treating the Smashwords document as the master text:

“this has led to more than a few problems. Whenever an author quoted a court document, public domain work, or other legitimately copyright-free document in their book, Scribd logs the quoted text as belonging to the author and their automated system flags and removes any user-uploaded documents that contain the same text.”

I don’t want to belabour this point, but it is an important side effect Scribd needs to take into account with future software updates.

Hoffelder spends the remainder of the post reflecting on YouTube’s numerous unfair removals. As interesting as the comparison is, he unfortunately devotes almost as many words to ContentID as he does to BookID. He concludes with a decision about digital fingerprint software in general, based primarily on ContentID but exacerbated by BookID, that he is critical of systems that make decisions without any human element. While it is unfortunate that legitimate uploads are being taken down, what is the alternative? Hoffelder doesn’t offer any solution. He claims the human decision maker is necessary, but with 48,000 Smashwords removals alone, the time and labour commitment required would be astronomical. It is simply not feasible to get a person involved in every single removal. It would be far more practical to allow a user to file a dispute (Dammit Jim, I’m a writer, not a pirate) over a removal, and get the Scribd employee involved at that point.

One other aspect of BookID that concerns me (which Hoffelder does not address) is the focus on uploads, and not downloads. BookID detects pirated material being added to the site (false positives aside), but does not seem to be able to cope with prevention of users downloading and illegally copying legitimate texts in the first place. I’m no authority on the finer points of DRM, but I wonder how well (if at all) the software can prevent book piracy at an even earlier stage. I am aware of the problematic nature of DRM itself, because it also limits legitimate usage, but perhaps BookID needs to take the downloading into account just as much as the uploading.

Hoffelder raises some important points about the weaknesses of BookID, but leaves me wanting. His post is a brief and barely skims the surface. It’s easy to be critical of Scribd, especially with the bad press they’ve received over the last five years, but they’ve taken a step in the right direction. The software is far from flawless, but the fact remains they are on the front lines of the battle against book piracy.


Smashwords the System

On May 11th, 2014, Smashwords founder Mark Coker updated the official blog to address concerns from the Smashwords community regarding the December 2013 announcement that the company would be joining the Scribd subscription service. Coker briefly explains that the tension centres around Scribd users, many of whom are uploading content that violates copyright. Many of Smashwords’ authors expressed anxieties, having seen their books posted illegally, and criticized the decision to join Scribd. Coker emphasizes that “we wouldn’t have partnered with Scribd if we weren’t confident their heart was in the right place, and if we weren’t confident our relationship with Scribd would benefit all indie authors.”

Coker then moves into a summary and discussion of Scribd’s new digital fingerprint software, called BookID. He explains roughly how the software works:

“BookID automatically scans all Smashwords-delivered books, and analyzes the text for semantic data such as word count, letter frequency, phrases, and other elements. BookID then creates a digital fingerprint of the authorized Smashwords book, and uses this fingerprint to automatically detect and remove unauthorized versions. It proactively removes all files at Scribd that match the same fingerprint, and also uses this fingerprint to proactively block the upload of future unauthorized versions.”

He then provides some hard data about the number of unauthorized uploads that have now been removed. As of his post, Scribd has taken down about 48,000 copies of Smashwords books alone. Sounds impressive. Except, that’s just the successful Smashwords takedowns. How many pirated copies still haven’t been recognized? How many pirated documents do they have across the board? How did those numbers get so high?

Coker acknowledges that “no automated scanning system will every [sic] be 100% accurate,” but remains confident that Scribd will continue to improve the amount of cases they catch. Typos aside, Coker’s account of the matter is highly optimistic, and makes a valiant effort to defend his new business partner while reassuring his writers of the safety of their work. Actually, typos not aside because the situational irony is too glorious to ignore. In an argument about the impossibility of accuracy with automation, he misspells a word he has manually typed.

As mentioned in my previous post, I don’t want to paint Scribd as the devil, but they have indirectly enabled piracy on a grand scale. It reflects well on Scribd, having developed BookID, but I am highly skeptical of Coker’s sunny outlook. There are two primary reasons this Always Look on the Bright Side of Life is illogical, Captain. First, the software does not allow for similarities in story arcs. Many stories, especially those passed down through oral culture and mythology, carry strong resemblances. I cannot imagine the system would be able to cope, and I suspect there have been and will be a high number of false positives. How will it cope with quotes, or stories within stories? If a character is retelling a myth, for example, will BookID flag it because that same myth has appeared in another book?

Second, the software uses the Smashwords document as the Rosetta Stone to find all the pirated versions. While I agree that this method is effective for locating illegal uploads of that original document, it is perhaps dangerous to put Smashwords on such a high pedestal. They are an ebook distributor primarily for independent and self-published authors. I do not by any means wish to imply the work is of a lower quality, but I have no way of knowing that every book has been checked for cases of plagiarism. If Scribd and Smashwords accept books without question, then use those books to create the fingerprints, what if that master text has stolen material from a book by a different publisher? In the future, if that other publisher uploads that book, it will be unjustly removed while the real case of plagiarism remains.

Are cases such as these even possible? Or likely? I haven’t a clue. The software is too new, and the public conversation is too little. There is simply not enough information to warrant the amount of confidence that Coker has. I, therefore, choose to remain a skeptic. That said, I really do give Scribd a pat on the back for trying, and will keep a sharp eye on their development.

Know Your Enemy

Since the launch of their eBook subscription service in 2013, Scribd has been receiving a lot of attention, both positive and negative. On January 9th, 2014, an article by Calvin Reid appeared in Publishers Weekly in response to a recent blog post by Michael Capobianco on Writer Beware.

Capobianco’s post was highly critical of the new model, in two primary respects. First, Capobianco reflects on the confusing royalty structure, which is based partly on standard eBook royalty rates of 25%, and partly on the percentage read for each book. That is, if a subscriber downloads a book, but reads less that 30% of it, no “sale” is counted, and the author does not receive any royalty (this is Capobianco’s analysis, not mine). Second, Capobianco criticizes piracy problems that have remained unresolved since the copyright infringement lawsuit launched in 2009 against Scribd.

In the Reid article, he recaps the response from Scribd’s VP of Content Acquisition, Andrew Weinstein. Weinstein acknowledges the existence of pirated material within their subscription service, but insists that Scribd is not encouraging illegal uploads. He explains that Scribd is working with publishers to develop digital fingerprint software that will prevent unauthorized content from appearing on their site. For Weinstein, the difficulty with this process is the sheer volume of works, most of which are not recognized by the software. Finally, he expresses a desire to change Scribd’s damaged reputation by becoming more public about their efforts to eliminate copyright infringement.

Reid’s article does two things very well. First, he gives voice to Scribd. Capobianco’s post, while in many respects is justified, is harsh. It raises many concerns about the subscription’s implications for authors and readers. By contacting Scribd, Reid allows Weinstein a chance to defend Scribd against Capobianco’s accusations. Capobianco, afterall, has limited access to Scribd’s operating policies. Reid gives Scribd a chance to show the steps they are taking to resolve many of the issues raised in the blog post, while demonstrating that Weinstein is being realistic about the problems he and his company are facing. Second, Reid exposes a major flaw in the proposed copyright fingerprint system. He recognizes the extreme limitations of building fingerprints from publishers’ files, because this means that Scribd cannot recognize infringed material unless the publisher has already sent them the original file.

While I support Reid’s efforts in the above regards, I take issue with two other aspects of the article. It concludes with a final quote from Capobianco in response to Weinstein’s defence:

“despite Scribd efforts, there are still many, many pirated works available through their service, and it’s troubling that they have included them, along with other works uploaded by users, in their paid subscription service.” Capobianco added, “I am amazed that they do this so boldly, and that the legitimate publishers who are offering books through their subscription service apparently don’t care. I hope Scribd re-evaluates its policies and removes all unauthorized versions of copyrighted works from its subscriber service.”

It is troubling, yes. I, too, hope Scribd removes all copyright violations. However, Capobianco’s secondary arguments are unjust. His twofold claim—that Scribd is deliberately and “boldly” capitalizing on piracy, and that publishers “don’t care”— is an ungrounded and unfair accusation. Scribd is in the process of developing a kind of digital rights management software, which is gruelling and expensive endeavor. They are working with publishers, using the publishers’ files to ensure those works are not illegally uploaded in the future. It is unlikely that publishers are turning a blind eye; it is more likely that they are taking proactive steps to prevent copyright violations of their published material.

The second concern I have about these articles is this: both Reid and Capobianco fail to acknowledge that piracy is not a new problem. Both blame Scribd for piracy, but in doing so, they are both misrepresenting the situation. Scribd is not the enemy. I agree that they are in the middle of a serious problem. Pirated material exists on their site, and they need to be held accountable for allowing this material to exist. However, the real problem is piracy itself. Book pirating has existed for centuries, and perhaps piracy, in general, will just always be a problem.

I’m certainly not arguing that Scribd’s hands are clean, but that we must remember they are fighting a much larger matter.