Over 100,000 Github code repositories leak API and cryptographic keys every day

Researchers at North Carolina State University (NCSU) have found that one of the most popular source code repositories in the world is still housing thousands of publicly accessible encryption keys.

Over 100,000 code repositories on source code management site GitHub contain secret access keys that can give attackers privileged access to those repositories (repos) or to online service providers’ services.

The researchers scanned almost 13% of GitHub’s public repositories over nearly six months. In a paper revealing the findings, they said:

We find that not only is secret leakage pervasive – affecting over 100,000 repositories – but that thousands of new, unique secrets are leaked every day.

Across the six-month period, researchers analyzed billions of files from millions of GitHub repositories.

In a research paper published last month, the three-man NCSU team said they captured and analyzed 4,394,476 files representing 681,784 repos using the GitHub Search API, and another 2,312,763,353 files from 3,374,973 repos that had been recorded in Google’s BigQuery database.

NCSU team scanned for API tokens from 11 companies

Inside this gigantic pile of files, researchers looked for text strings that were in the format of particular API tokens or cryptographic keys.

Since not all API tokens and cryptographic keys are in the same format, the NCSU team decided on 15 API token formats (from 15 services belonging to 11 companies, five of which were from the Alexa Top 50), and four cryptographic key formats.

This included API key formats used by Google, Amazon, Twitter, Facebook, Mailchimp, MailGun, Stripe, Twilio, Square, Braintree, and Picatic.

NCSU GitHub scan tested APIs — Image: Meli et. al

Results came back right away, with thousands of API and cryptographic keys leaking being found every day of the research project.

In total, the NCSU team said they found 575,456 API and cryptographic keys, of which 201,642 were unique, all spread over more than 100,000 GitHub projects.

NCSU GitHub scan results — Image: Meli et. al

An observation that the research team made in their academic paper was that the “secrets” found using the Google Search API and the ones via the Google BigQuery dataset also had little overlap.

“After joining both collections, we determined that 7,044 secrets, or 3.49% of the total, were seen in both datasets. This indicates that our approaches are largely complementary,” researchers said.

Furthermore, most of the API tokens and cryptographic keys –93.58 percent– came from single-owner accounts, rather than multi-owner repositories.

What this means is that the vast majority of API and cryptographic keys found by the NCSU team were most likely valid tokens and keys used in the real world, as multi-owner accounts usually tend to contain test tokens used for shared-testing environments and with in-dev code.

Over 12 percent of keys and tokens were gone after a day, while 19 percent stayed for as much as 16 days.

“This also means 81% of the secrets we discover were not removed,” researchers said. “It is likely that the developers for this 81% either do not know the secrets are being committed or are underestimating the risk of compromise.”

NCSU GitHub scan timeline — Image: Meli et. al

Research team uncovers some high-profile leaks

The extraordinary quality of these scans was evident when researchers started looking at what and where were some of these leaks were originating.

“In one case, we found what we believe to be AWS credentials for a major website relied upon by millions of college applicants in the United States, possibly leaked by a contractor,” the NCSU team said.

“We also found AWS credentials for the website of a major government agency in a Western European country. In that case, we were able to verify the validity of the account, and even the specific developer who committed the secrets. This developer claims in their online presence to have nearly 10 years of development experience.”

In another case, researchers also found 564 Google API keys that were being used by an online site to skirt YouTube rate limits and download YouTube videos that they’d later host on another video sharing portal.

“Because the number of keys is so high, we suspect (but cannot confirm) that these keys may have been obtained fraudulently,” NCSU researchers said.

Last, but not least, researchers also found 7,280 RSA keys inside OpenVPN config files. By looking at the other settings found inside these configuration files, researchers said that the vast majority of the users had disabled password authentication and were relying solely on the RSA keys for authentication, meaning anyone who found these keys could have gained accessed to thousands of private networks.

The high quality of the scan results was also apparent when researchers us