Looking for secrets in source code

In the world of software development, it is important to maintain a strong focus on security. One critical aspect of security is ensuring that sensitive data, such as passwords, API keys, and encryption keys, are not accidentally exposed in your source code. These sensitive data are usually called “secrets”, and for penetration testers, they are the low-hanging fruit as they are usually one of the first things we are looking for. In this article, I’d like to talk about some of the techniques and tools that can be used for scanning the source code for secrets.

Before we start talking about how to look for secrets, let’s first take a look at what secrets usually look like. I used to be, and in fact, I am still, a PHP developer, so I will use this example of PHP code. Trust me, you will see this piece of code frequently, as it is usually contained in many of the tutorials and developers tend to copy and paste code a lot (sometimes without considering the security measurements).

$username = "myusername";
$password = "mypassword";
$dbname = "mydatabase";
$host = "localhost";

// Connect to the database
$connection = mysqli_connect($host, $username, $password, $dbname);

// Check if the connection was successful
if (!$connection) {
    die("Connection failed: " . mysqli_connect_error());

The secrets can be spotted pretty easily in this example. They are all stated at the beginning of the file in the plaintext form. It’s every attacker’s wet dream! If I get my hands on this, I can easily connect and exploit the database – leak the data, drop it, stop the production and much more. To make this code secure, we could use environment variables to store sensitive information like the database credentials. Let’s modify the above script to incorporate environment variables.

$username = getenv('DB_USERNAME');
$password = getenv('DB_PASSWORD');
$dbname = getenv('DB_NAME');
$host = getenv('DB_HOST');

// Connect to the database
$connection = mysqli_connect($host, $username, $password, $dbname);

// Check if the connection was successful
if (!$connection) {
    die("Connection failed: " . mysqli_connect_error());

Alternatively, we could also store the sensitive information within the external (with the emphasise on the word external) configuration file that that is not tracked by version control systems, and read the values from the configuration file in our code. If we opt for this approach we have to make sure that the configuration file is properly secured and accessible only by authorized users. We could also use key vaults.


Now we know what a secret is, but how should we detect it before it gets leaked? One of the ways is to use regex, which is a pattern-matching tool used to search and manipulate text. It’s like a code that tells a computer what to look for in a piece of text, and then the computer uses that code to find anything that matches the pattern. For example, you could use regex to search for keywords such as "key", "pass", or "aws" which might be the variable names, that are used to refer to the secrets.

Many API keys are also in this specific format. AWS access key IDs, for example, typically begin with the string AKIA and are followed by 16 characters that can be either numbers (0-9) or capital letters (A-Z). Armed with this knowledge, we can craft the regex AKIA[0-9A-Z]{16} to search for the IDs of the AWS access keys.

Using this strategy, we can discover most hardcoded credentials, but we are also limited to the secrets that adhere to this specific format, which is why it is time to talk about entropy.

Shannon entropy

Shannon entropy measures how random and unpredictable something is. To better understand it, let’s imagine you have a two bag of marbles. One bag contains only the marbles with the same color. In the other bag, each of the marble has a different color.

If you reach into the first back and pick out a marble at random, you already know what color it has (as there is only one color present). This bag has a low entropy. However, if you reach into the second bag and pick out a marble at random, you would have no way of predicting which color you would get, because each color has an equal chance of being chosen. In other words, the second bag has a high entropy.

In our context, Shannon entropy is often used to analyze strings of characters, such as passwords to determine how secure their are. We can leverage the fact that secrets are usually pretty complex and randomized and by analyzing Shannon entropy, we can discover suspicious strings of any format.

I have put together this simple Python script to calculate the entropy of the given string. If we were to use this for real purposes, we would, however, have to make some changes to factor in the overall entropy of the whole codebase.

import math

def get_shannon_entropy(data: str) -> float:
    # Get the frequence of each character in the given string
    frequency_dict = dict.fromkeys(list(data), 0)
    for char in data:
        frequence_dict[char] += 1

    # Calculate the probability of each character by dividing frequency 
    # of each and every key by the length of input
    probability_dict = {}
    for key in frequency_dict:
         probability_dict[key] = frequency_dict[key] / float(len(data))

    # Calculate the Shannon entropy
    entropy = 0.0
    for key in probability_dict:
        probability = probability_dict[key]
        if probability > 0:
            entropy += -probability * math.log(probability, 2)

    return entropy

So now we know, how to look for secrets using regex and Shannon entropy. We can also use SAST tools that can scan your source code for vulnerabilities and weaknesses, which includes the presence of secrets. I personally use Snyk CLI as it provides the comprehensive security report.

Other option might be to use git-secrets which is an open source tool for detecting and preventing the commiting of sensitive information to a Git repository. It works by leveraging Git’s pre-commit hooks to scan staged changes before they are committed. When a user attempts to commit changes that contain sensitive data, git-secrets will alert them with a warning message, preventing the commit from being completed until the sensitive data is removed or encrypted.

If you have read so far, you might want to follow me here on Hashnode. Feel free to connect with me over at LinkedIn or Mastodon.