GitHub’s ‘Zombie Data’ Problem: How Private Repositories Can Still Haunt You

4 minutes ago

In August 2024, security researchers at Lasso uncovered a concerning issue involving OpenAI’s ChatGPT and Microsoft Copilot. A LinkedIn post suggested that ChatGPT was retrieving data from private GitHub repositories, raising serious questions about AI’s access to sensitive information. However, as Lasso’s investigation deepened, they discovered an even more alarming truth: data that was once public—even for a short period—can persist indefinitely in AI-driven systems.

When Private Isn’t Really Private

Lasso’s research began with the repository at the center of the controversy. A quick search confirmed that it had once been public—its pages were still indexed by Bing. But when researchers attempted to access it on GitHub, they were met with a 404 error, indicating that it had since been made private.

This raised a critical question: How was ChatGPT able to return results based on content that was no longer publicly available?

Through repeated queries, Lasso found that ChatGPT was not pulling live data from GitHub but rather generating responses based on indexed metadata cached by Bing before the repository was privatized. This finding revealed a significant security risk: data that was once public could persist in search engine caches and AI models, even after users believed it had been removed.

When AI Knows Too Much

Expanding their investigation, Lasso researchers searched Bing for repositories under their own organization and found that one of their private repositories was still indexed. Attempting to access it directly on GitHub returned a 404 error, confirming that it had indeed been made private. Yet, Bing’s cached records still contained remnants of its contents.

While ChatGPT only acknowledged the repository’s existence, Lasso questioned whether Microsoft Copilot—a tool more deeply integrated into Microsoft’s ecosystem—could still retrieve its data. The results were alarming. Unlike ChatGPT, which merely recognized the repository, Copilot returned actual content that had been available while the repository was public. This suggested that even though the repository had been removed, it remained accessible to AI tools leveraging Microsoft’s search index.

Unraveling the ‘Zombie Data’ Phenomenon

Determined to quantify the scope of the issue, Lasso launched a broader investigation. Using Google BigQuery’s GitHub archive dataset, they identified repositories that had been public at any point in 2024. Cross-referencing these repositories against Bing’s cache, they determined how much information was still accessible.

The findings were staggering:

20,580 repositories were retrievable through Bing’s cache despite being private or deleted.
16,290 organizations were affected, including major tech firms such as Microsoft, Google, Intel, Huawei, PayPal, IBM, and Tencent.
100+ internal Python and Node.js packages were exposed, opening the door to dependency confusion attacks.
300+ private tokens, API keys, and secrets tied to GitHub, Hugging Face, Google Cloud, OpenAI, and other platforms were still accessible.

Microsoft’s Response and the Partial Fix

Lasso reported these findings to Microsoft, warning that deleted or private GitHub repositories were still accessible through Bing’s cache and Copilot. Microsoft classified the issue as low severity, arguing that the cached data was of “low impact.” However, within two weeks of Lasso’s report, Microsoft disabled Bing’s cached link feature and blocked access to the cc.bingj.com domain.

Yet, the fix was only partial. While human users could no longer access cached pages through Bing, Copilot could still retrieve the data. The information hadn’t been erased—it had merely been hidden from direct user searches.

The Real-World Exploit: Testing Microsoft’s Fix

On January 14, 2025, Lasso researchers tested Microsoft’s fix. A TechCrunch article revealed that Microsoft had taken legal action against a group allegedly bypassing its AI safety mechanisms. The group’s GitHub repositories had been removed.

Using Copilot, Lasso researchers attempted to retrieve the deleted repository’s contents. Despite the repository being removed, Copilot still returned its data. This confirmed that Microsoft’s fix had only obscured the data rather than eliminating it. Organizations may believe their data is private, but AI-powered retrieval tools can still unearth it.

Lessons for Organizations: How to Protect Your Data

Assume All Data is Permanent: If data was ever public, even momentarily, it may have already been indexed and stored by search engines or AI models. Treat all past exposures as potential risks.
LLMs Are the New Threat Vectors: Traditional cybersecurity focused on monitoring the dark web for leaks. Now, LLMs and AI copilots add a new risk vector—data that was once public can be retrieved by sophisticated AI prompts.
Beware of Overly Helpful AI: AI models are designed to be helpful, sometimes to a fault. If they have access to data—even indirectly—they may provide more information than they should. Organizations must be cautious about what AI-powered tools can retrieve from past public records.
Practice Fundamental Data Hygiene: Regularly audit repositories to ensure secrets, tokens, and internal packages are never pushed to public repositories. If a mistake happens, assume the data is already compromised and take immediate remediation steps such as rotating keys and credentials.

Conclusion: A Digital Footprint That Never Fades

Lasso’s findings highlight the evolving risks posed by AI-driven data retention. The race to train large models has led to aggressive data retention and indexing strategies that often prioritize accessibility over security. While Microsoft has taken steps to mitigate the exposure of zombie data, Lasso’s research suggests these measures are insufficient.

For organizations and developers, the takeaway is clear: The moment data is made public, it may never truly be private again. In a world where AI remembers everything, true deletion is an illusion.