I didn’t realize that the University of Toronto and MIT have been busy stealing books from Smashwords. From Publishers’ Marketplace:
Multiple Large Language Models have been trained on BookCorpus, and there have been subsequent informal variants on BookCorpus that some authors report have included more recently-published books, also scraped from Smashwords.
“Compiled in 2014 by researchers at the University of Toronto and MIT, “BookCorpus” should have been called “Stolen from Smashwords”. The researchers apparently scraped posted, self-published ebooks posted by Smashwords that were being offered to read for free—even though doing so violated the terms of service. Adding insult to injury, the original researchers who compiled BookCorpus claimed in their research paper, “These are free books written by yet unpublished authors,” not recognizing that the authors had self-published, and presumably further rationalizing why they gave no thought to intellectual property rights.”
Multiple Large Language Models have been trained on BookCorpus, and there have been subsequent informal variants on BookCorpus that some authors report have included more recently-published books, also scraped from Smashwords.