Summary: A recent study by Truffle Security revealed nearly 12,000 valid API keys and passwords in the Common Crawl dataset, which is often used for training various AI models. The findings highlight the risks associated with insecure coding practices, particularly the hardcoding of sensitive data into front-end applications. Despite pre-processing efforts, the cleanup of such a vast dataset does not guarantee the removal of all sensitive information.
Affected: Common Crawl dataset, AI model training organizations (OpenAI, Google, etc.)
Keypoints :
- Truffle Security identified 11,908 valid secrets, such as AWS and MailChimp API keys, from the Common Crawl dataset.
- 63% of the discovered secrets were reused, with one WalkScore API key appearing over 57,000 times across multiple subdomains.
- The research prompted impacted vendors to revoke thousands of compromised keys to mitigate potential security risks.