BookID, Book IDunno

In my previous post, I raised a few questions about BookID regarding potential false positives and unjustified removals. Coker has provided numbers for Smashwords removals, but there was no data on the validity of each of those cases. In my search for that data, I found an article from Nate Hoffelder at The Digital Reader, comparing BookID to YouTube’s ContentID. Scribd’s system (“an automated that checks user uploads for pirated content”—an automated what? My dear, you seem to have forgotten something) works “too well,” he writes, and references Smashwords users who have posted details about receiving false positives on their books.

Hoffelder briefly comments on the implications I raised earlier, albeit with slightly different focus, regarding my skepticism about treating the Smashwords document as the master text:

“this has led to more than a few problems. Whenever an author quoted a court document, public domain work, or other legitimately copyright-free document in their book, Scribd logs the quoted text as belonging to the author and their automated system flags and removes any user-uploaded documents that contain the same text.”

I don’t want to belabour this point, but it is an important side effect Scribd needs to take into account with future software updates.

Hoffelder spends the remainder of the post reflecting on YouTube’s numerous unfair removals. As interesting as the comparison is, he unfortunately devotes almost as many words to ContentID as he does to BookID. He concludes with a decision about digital fingerprint software in general, based primarily on ContentID but exacerbated by BookID, that he is critical of systems that make decisions without any human element. While it is unfortunate that legitimate uploads are being taken down, what is the alternative? Hoffelder doesn’t offer any solution. He claims the human decision maker is necessary, but with 48,000 Smashwords removals alone, the time and labour commitment required would be astronomical. It is simply not feasible to get a person involved in every single removal. It would be far more practical to allow a user to file a dispute (Dammit Jim, I’m a writer, not a pirate) over a removal, and get the Scribd employee involved at that point.

One other aspect of BookID that concerns me (which Hoffelder does not address) is the focus on uploads, and not downloads. BookID detects pirated material being added to the site (false positives aside), but does not seem to be able to cope with prevention of users downloading and illegally copying legitimate texts in the first place. I’m no authority on the finer points of DRM, but I wonder how well (if at all) the software can prevent book piracy at an even earlier stage. I am aware of the problematic nature of DRM itself, because it also limits legitimate usage, but perhaps BookID needs to take the downloading into account just as much as the uploading.

Hoffelder raises some important points about the weaknesses of BookID, but leaves me wanting. His post is a brief and barely skims the surface. It’s easy to be critical of Scribd, especially with the bad press they’ve received over the last five years, but they’ve taken a step in the right direction. The software is far from flawless, but the fact remains they are on the front lines of the battle against book piracy.


Smashwords the System

On May 11th, 2014, Smashwords founder Mark Coker updated the official blog to address concerns from the Smashwords community regarding the December 2013 announcement that the company would be joining the Scribd subscription service. Coker briefly explains that the tension centres around Scribd users, many of whom are uploading content that violates copyright. Many of Smashwords’ authors expressed anxieties, having seen their books posted illegally, and criticized the decision to join Scribd. Coker emphasizes that “we wouldn’t have partnered with Scribd if we weren’t confident their heart was in the right place, and if we weren’t confident our relationship with Scribd would benefit all indie authors.”

Coker then moves into a summary and discussion of Scribd’s new digital fingerprint software, called BookID. He explains roughly how the software works:

“BookID automatically scans all Smashwords-delivered books, and analyzes the text for semantic data such as word count, letter frequency, phrases, and other elements. BookID then creates a digital fingerprint of the authorized Smashwords book, and uses this fingerprint to automatically detect and remove unauthorized versions. It proactively removes all files at Scribd that match the same fingerprint, and also uses this fingerprint to proactively block the upload of future unauthorized versions.”

He then provides some hard data about the number of unauthorized uploads that have now been removed. As of his post, Scribd has taken down about 48,000 copies of Smashwords books alone. Sounds impressive. Except, that’s just the successful Smashwords takedowns. How many pirated copies still haven’t been recognized? How many pirated documents do they have across the board? How did those numbers get so high?

Coker acknowledges that “no automated scanning system will every [sic] be 100% accurate,” but remains confident that Scribd will continue to improve the amount of cases they catch. Typos aside, Coker’s account of the matter is highly optimistic, and makes a valiant effort to defend his new business partner while reassuring his writers of the safety of their work. Actually, typos not aside because the situational irony is too glorious to ignore. In an argument about the impossibility of accuracy with automation, he misspells a word he has manually typed.

As mentioned in my previous post, I don’t want to paint Scribd as the devil, but they have indirectly enabled piracy on a grand scale. It reflects well on Scribd, having developed BookID, but I am highly skeptical of Coker’s sunny outlook. There are two primary reasons this Always Look on the Bright Side of Life is illogical, Captain. First, the software does not allow for similarities in story arcs. Many stories, especially those passed down through oral culture and mythology, carry strong resemblances. I cannot imagine the system would be able to cope, and I suspect there have been and will be a high number of false positives. How will it cope with quotes, or stories within stories? If a character is retelling a myth, for example, will BookID flag it because that same myth has appeared in another book?

Second, the software uses the Smashwords document as the Rosetta Stone to find all the pirated versions. While I agree that this method is effective for locating illegal uploads of that original document, it is perhaps dangerous to put Smashwords on such a high pedestal. They are an ebook distributor primarily for independent and self-published authors. I do not by any means wish to imply the work is of a lower quality, but I have no way of knowing that every book has been checked for cases of plagiarism. If Scribd and Smashwords accept books without question, then use those books to create the fingerprints, what if that master text has stolen material from a book by a different publisher? In the future, if that other publisher uploads that book, it will be unjustly removed while the real case of plagiarism remains.

Are cases such as these even possible? Or likely? I haven’t a clue. The software is too new, and the public conversation is too little. There is simply not enough information to warrant the amount of confidence that Coker has. I, therefore, choose to remain a skeptic. That said, I really do give Scribd a pat on the back for trying, and will keep a sharp eye on their development.