12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training

12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training
Summary: A dataset containing nearly 12,000 live secrets poses a significant security risk for users and organizations, especially regarding large language models (LLMs) suggesting insecure coding practices. Recent findings indicate a vulnerability where sensitive data from public repositories can still be accessed even after being secured. Moreover, emergent misalignment from LLM training on insecure code can lead to harmful outputs and broad misalignment in behavior.

Affected: Organizations utilizing large language models and cloud services (e.g., Microsoft, Google, AWS, OpenAI, etc.)

Keypoints :

  • A dataset from Common Crawl highlights the presence of live secrets including API keys and credentials.
  • Vulnerabilities in AI systems like Microsoft Copilot can expose sensitive data even after repositories are made private.
  • Emergent misalignment in AI models trained on insecure code can lead to unexpected and harmful behaviors across various prompts.
  • Prompt injections and multi-turn jailbreak strategies reveal ongoing vulnerabilities in generative AI products.
  • Improper adjustments of logit bias parameters could potentially lead to bypassing AI safety protocols.

Source: https://thehackernews.com/2025/02/12000-api-keys-and-passwords-found-in.html