This may not actually be true though. If it’s a Q&A interface, it’s very unlikely they are training the model on the entire work (since model training is extremely expensive and done extremely infrequently). Now sure, maybe they actually are training on NYT articles, but a similarly powerful LLM could exist without training on those articles and still answer questions about it.
Suppose you wanted to make your own Bing Chat. If you tried to answer the questions entirely based on what the model is trained on, you’d get crap results because the model may not have been trained on any new data in over 2 years. More likely, you’re using retrieval-augmented generation (RAG) to select portions of articles, generally the ones you got from your search results, to provide as context to your LLM.
Also, the argument that these are derivative works seems to be a bit iffy. Derivative works use substantial portions of the original work, but generally speaking a Q&A interface like this would be purely generative. With certain carefully-crafted prompts, it may be able to generate portions of the original work, but assuming they’re using RAG, it’s extremely unlikely they would generate the exact same content that’s in the article because they wouldn’t be using the entirety of the article for generation anyway.
How is this any different from a person scanning an article and writing their own summary based on what they read? Is doing so a violation of copyright, and if so, aren’t news outlets especially notorious for doing this (writing articles based on the articles put out by other news outlets)?
Edit: I should probably add as well, but search engines have been indexing and training models on the content they crawl over for years, and that never seemed to cause anyone to complain about copyright. It’s interesting to me that it’s suddenly a problem now.
This may not actually be true though. If it’s a Q&A interface, it’s very unlikely they are training the model on the entire work (since model training is extremely expensive and done extremely infrequently). Now sure, maybe they actually are training on NYT articles, but a similarly powerful LLM could exist without training on those articles and still answer questions about it.
Suppose you wanted to make your own Bing Chat. If you tried to answer the questions entirely based on what the model is trained on, you’d get crap results because the model may not have been trained on any new data in over 2 years. More likely, you’re using retrieval-augmented generation (RAG) to select portions of articles, generally the ones you got from your search results, to provide as context to your LLM.
Also, the argument that these are derivative works seems to be a bit iffy. Derivative works use substantial portions of the original work, but generally speaking a Q&A interface like this would be purely generative. With certain carefully-crafted prompts, it may be able to generate portions of the original work, but assuming they’re using RAG, it’s extremely unlikely they would generate the exact same content that’s in the article because they wouldn’t be using the entirety of the article for generation anyway.
How is this any different from a person scanning an article and writing their own summary based on what they read? Is doing so a violation of copyright, and if so, aren’t news outlets especially notorious for doing this (writing articles based on the articles put out by other news outlets)?
Edit: I should probably add as well, but search engines have been indexing and training models on the content they crawl over for years, and that never seemed to cause anyone to complain about copyright. It’s interesting to me that it’s suddenly a problem now.