# Workflow Syntax Reference

This document describes the syntax used to define workflows in BountyHub. Workflows are defined using YAML syntax, which is a human-readable data serialization format.

We will go through each field in the workflow definition, describing its purpose, type, and examples of how to use it.

# scans

Each workflow is made of one or more scans. Each scan has a unique name, which is a key in the scans object. The name will be presented as [ID] for the rest of the document, since it is unique for each scan, and therefore, serves as an identifier.

You can have the unlimited number of scans, as long as their names are unique.

The names must only contain alphanumeric characters and the _ character. This constraint is in place to ensure that names are compatible with expressions and other parts of the system. Names cannot contain spaces or special characters, such as -, ., etc.

namevalid
exampletrue
test_1true
test-onefalse

# Example

yaml

# scans.[ID].on

On field serves as a trigger for a scan execution. This object contains specification that will trigger a scan on some event.

To start the pipeline, there must exist some event to trigger it. Events are raised automatically (on cron), or manually (on dispatch). From there, you build the next stage by running on expr.

To put things into perspective, something needs to trigger the scan. After the scan is done, each expression scan is evaluated. If the expression evaluates to true, the scan is scheduled. This means that a single scan can one or more scans.

During evaluation, the scan checks against its latest job time. More about expressions used to trigger expr scans can be found here.

# scans.[ID].on.cron

Cron defines a schedule based on which the workflows are executed.

Cron is described in form of:

minutehourday of monthmonthday of week
requiredrequiredrequiredrequiredrequired

Times are based on UTC timezone, so please take that into account when writing your schedules.

If you need help specifying or testing your cron schedule, you can use crontab guru. This is a great site that will help you write and understand the spec you write for the scheduler.

As a limitation, cron parser does not allow non-standard modifiers, such as @hourly etc. Currently, supported specifications include:

  • Wildcard (*)
  • Lists (,)
  • Ranges (-)
  • Steps (/)

Since the platform has its own parser, adding non-standard modifiers can be considered in the future, based on your feedback.

Example (every day at 15:00/3PM UTC):

yaml

# scans.[ID].on.expr

This field allows you to specify expression that describes when should the scan be scheduled.

You can learn more about expressions by visiting the Workflow Expressions page

Expressions are to describe the scan that should run only if expression evaluates to true. After each successful scan, server goes through expression scans and evaluates if the scan should be scheduled.

Let's build an example workflow to show how it can be used.

Goals:

  1. Run multiple subdomain discovery tools. For simplicity, let's pick subfinder and assetfinder.
  2. Since I would love to run probes only if there is a diff, I can create a "meta" scan that combines the result.
  3. The httprobe can execute once the subdomain discovery stage is over and there is a diff.

We can combine the power of needs field, and use expr to further describe when our probe should be executed:

yaml

Let's summarize why and how this works. Periodically, we want to check if new domains are discovered. In order to do that, we run two subdomain discovery tools:

  1. assetfiner
  2. subfinder

The purpose of those scans is to find as many subdomains as we possibly can. But since these are just two tools to accomplish something, we create a "meta scan" that only combines these results. At the end of the day, we care about subdomains, not about the single tool's result. This can be done in a single scan to preserve the space, but at the expense of one tool failure resulting in the whole scan failure. Use your judgment to decide what is best for you.

To schedule a subdomaindiscovery scan, we need to have assetfinder or subfinder done and ready to be used. To accomplish that, we can leverage the expression engine, particularly, is_available() function. You can read more about this function on the Workflow Expressions page, but in summary, is_available() checks if there is at least one job that executed successfully after the last execution of the current scan.

But since the subdomaindiscovery can be triggered when only one of these scans is done, we need to conditionally download the results. This is done using the if field of the step. If the scan is not available, the step is skipped.

Next, for the sake of example, we want to run httpx on newly found subdomains. Since we only care about the latest result, we would download the latest version of the subdomaindiscovery scan, and run httpx on it.

# scans.[ID].on.dispatch

Dispatch trigger is the type of trigger that allows only manual invocations.

Examples where dispatch might be useful include:

  • Expensive scans: Some scans might be expensive to run, either because they consume API credits, or because they take a long time to run. In such cases, you might want to run them only when you need them.
  • On-demand scans: Some scans might be needed only on-demand, for example, when you discover a new asset, and you want to run a scan on it, leveraging dispatch.inputs to pass the asset information.
  • Ad-hoc scans: Some scans might be needed only for a specific purpose, and you don't want them to run periodically.

For example:

yaml

# scans.[ID].on.dispatch.inputs

Inputs are arguments, added to the evaluation context, that exist only during the execution of the scan. They allow you to parametrize the scan, so if you want to invoke a tool on-demand, and you need to pass some parameters to it, you can use inputs.

There are two primary reasons to use dispatch:

  1. Let's say you want to run some tool when you discover something interesting. There is no need for anything else to trigger it, but you want to avoid running the command manually every time. You can use dispatch scan and parametrize it with inputs to invoke the tool with required data.
  2. You want to dispatch something from your scan. Let's say you one of your tools discovered something. You want to spawn a new pipeline to check stuff only on that particular domain.
yaml

# scans.[ID].on.dispatch.inputs.[KEY]

Key is a unique identifier for an input. It conforms to constraints related to project variables (i.e. alphanumeric or _).

The key is used by the expression engine to reference the input value, so if you have an input with key domain, you can reference it using ${{ inputs.domain }}.

There are few constraints you should be aware of:

  • Maximum number of inputs is 20.
  • Maximum size of dispatch values is 65535 characters. If you need larger input size, consider uploading assets to the blob storage, and passing down the reference to it.

# scans.[ID].on.dispatch.inputs.[KEY].type

Input type can either be a string or a bool.

By default, if you don't specify the type, it is assumed to be a "string".

The default value for the type string is an empty string (""). For boolean type, the default value is false.

The type of the field is a string, so valid way to specify it is either type: string or type: "string".

# scans.[ID].on.dispatch.inputs.[KEY].required

Required field specifies whether the input is required or not. The field does not assume any default value if not specified.

Keep in mind, empty string is still a valid input. Once dispatched, every value is submitted to the server, and if the input is required, it must be present.

This limitation is mostly implemented to ensure that the calls to bh job dispatch uses correct parameters.

# scans.[ID].on.dispatch.inputs.[KEY].default

Sets the default value for the input if the input is not specified.

The type of this field in YAML is "string", so only valid values for the boolean input fields are "true" and "false".

In other words:

yaml

# scans.[ID].if

This field allows you to specify a condition for when the scan should be scheduled, even if it would normally be scheduled by some event.

Let's say you want to run a scan on cron, but only if some condition is met. Let's say that the condition is that some other scan has produced some output.

You can use the if field to specify the condition:

yaml

In this example, conditional_scan should not be evaluated on expression. Expressions are evaluated every time the job is done. For the cron scan, the trigger is the cron itself.

However, our job should not run if some other job did not produce the output.

The best example I'm using (and why this field is introduced) is related to the liveness probe. I'm running httpx on subdomains periodically. However, it doesn't make sense to schedule it if no subdomains are discovered yet.

Therefore, I can create a cron scan for httpx, and skip it if the if condition is not met yet.

# scans.[ID].artifacts[]

Artifacts field specifies what files or directories should be uploaded to the server when the scan is successful.

If the scan fails, no files will be uploaded to the server.

If a single upload fails, the successfully uploaded artifacts will be persisted so you can inspect regardless of the result. However, expression context does not include the scans whose state is not succeeded. This is uploaded only in case you have artifacts that other scans don't depend on. Some artifacts might not be uploaded, but the ones that did, might contain useful information.

All artifacts are zipped and uploaded to the server under the artifact[].name. The compression level is set to 9 to ensure that the size of the artifact is minimized. The size of the compressed file is counted against your storage quota, since that is the amount of data it occupies on the server.

Even though you don't have to name your artifacts with .zip suffix, it is highly recommended to do so, since it indicates that the artifact is a zip file. I have had issues downloading the output files without the .zip suffix on MacOS, since the OS didn't recognize the file as a zip file.

::alert{type="warning"} All artifact paths must be present in the root directory of your scan.

There are multiple reasons for this:

  1. Eliminate potential errors of uploading files from your local machine. You can always cp files from your local machine to the working directory of your job, but you would have to do it explicitly.
  2. Ensure that workflows are not written in a way that they depend on the local machine. If you want, you can always work around this constraint by using cp command inside the run step to copy files to anywhere on your filesystem. However, this is not recommended, since it breaks the portability of the workflow.

::

Let's take a look at the simple artifact example:

yaml

# scans.[ID].artifacts[].name

The name field specifies the name of the artifact. This is the name under which the artifact will be stored on the server.

The name must be unique for each scan, and it must not contain any special characters.

The reason why artifacts is not an object (or map) is because it would be ugly having keys named "artifact.zip" or "output.zip". Therefore, to avoid having to quote keys, using an array feels more appropriate.

In order to avoid issues downloading the output directly from the platform, it would be best if you add the .zip suffix to the artifact name. However, you don't have to do it.

The name of the artifact is the only reference used for downloading the artifact.

# scans.[ID].artifacts[].paths[]

Paths specifies a list of paths to files and/or directories that will be included in the artifact.

Currently, you cannot use glob paths, but rather prefix paths for directories (e.g. outdir/), or direct file names.

Each path must start from the runners working directory where the scan is executed. For example, if you have a run step that does:

bash

The file will be created under the directory of the scan, and the correct reference for the path field would be file.txt.

# scans.[ID].artifacts[].if

This field allows you to specify a condition for when the artifact should be included in the scan.

The condition is expressed as a CEL expression, just like if or expr fields.

It must evaluate to a boolean value. String values are not treated as valid expression (there are no "truthy" or "falsy" values).

This is especially important in situations where a conditional step produces the file.

The condition to run the step should be the same as the condition to upload the artifact.

Therefore, if step is skipped, the artifact will not be uploaded.

# scans.[ID].artifacts[].expires_in

Expires in allows you to specify a duration after which the artifact should be removed.

This field is especially useful in case you run scans that produce large artifacts, such as screenshots. You most likely don't want to keep all those artifacts forever, especially since they consume your storage quota.

The duration is expressed as a string, and it must be a valid duration format, such as 1h, 30m, or 1d. You would specify it as a string, such as expires_in: "1d".

Valid time units are:

SuffixDescription
mminutes
hhours
ddays

You can combine them together, such as 1d12h30m for 1 day, 12 hours, and 30 minutes.

It is worth noting that the execution of the cleanups are not exact, i.e. there might be a delay between the exact specified time you want your scan to expire, and the time the background process picks it up.

Therefore, please do not rely on the exact timing of the expiration, but rather on the fact that it will be cleaned up eventually.

# scans.[ID].artifacts[].notify_on

A field that specifies when the notification (if configured) should be issued after job finishes.

If no project notifications are configured, you will not receive any notifications, regardless of the value of this field.

The type of the notify_on field is string, and the valid values are:

  • "diff": Notify when the job has a diff compared to the previous job. The value of the diff is calculated by hashing the contents of every file uploaded in the artifact. If the hash is different from the previous job's artifact hash, the notification is issued. This is important to mention because if the order of the lines is different, it will produce a different hash. Keep that in mind. The platform does not assume anything about your file contents, and that includes the format.
  • "always": Notify when the job is executed, regardless of the diff.
  • "never": Notification will never be issued.

Sometimes, you cannot rely on nicely ordering your output. For example, nmap when XML output is used can sometimes be difficult to sort nicely.

You have 2 options then:

  1. Create two artifacts, one with the full output, and one with the normalized output, where you put and order only important data you want to be notified about.
  2. Use external tools, such as notify to issue notifications based on your own logic.

Using two artifacts allows you to leverage native notification mechanism of the platform, but at the expense of increased storage (although you can expire the normalized artifact after an hour or so). On the other hand, using tools like notify is a bit harder to set up, and requires you to pass the required data through the workflow.

# scans.[ID].env

Env field on the scan specifies environment variables that will be available during every step of the scan. These environment variables will override the environment variables set on the runner.

You can expose environment variables using 3 main mechanisms:

  1. Runner environment variables - environment variables sourced by the runner during run call, or during service install call.
  2. Using this field - environment variables are available during the execution of every step in the scan.
  3. Inside the step - environment variables specified inside the step will override the scan-level environment variables. They can be exported inside the shell script, such as export VAR_NAME=${{ secrets.BOUNTYHUB_TOKEN }}.

The best way in almost all cases is to use the scan-level environment variables. This ensures that the environment variable is available during every step of the scan, and it is not tied to a specific step.

It is also a top level field, so it is easy to find and manage.

In this example, we will use expression to set the environment variable on the scan:

yaml

# scans.[ID].steps[]

Every job is composed of one or more steps, executed sequentially. It can be as simple as a single step running the script, or it can be a script completely written in a workflow.

For every step, stdout and stderr streams are captured and sent to the server. These streams are stored for later inspection.

Do not rely on the stdout and stderr streams. The output will expire after 2 weeks, and you will be left with no data. Use artifacts to store important data, and use logs to troubleshoot the workflow if necessary.

If the step fails, the job fails.

Failure of the step is based on the exit code of the script.

Currently, the stream is propagated to the UI while the job is being executed.

# scans.[ID].steps[].if

Steps may conditionally be skipped. This is what I refer to as conditional steps, since they are executed only if certain condition is met.

Let's try to illustrate this idea with an example.

This idea has already been demonstrated in the expr section, particularly in the subdomaindiscovery scan.

yaml

Since the subdomaindiscovery scan can be scheduled when either by assetfinder or subfinder is done, we need to conditionally download the results. Otherwise, our scan would fail if one of the scans is not available.

Therefore, we use the if field to conditionally execute the step.

On the other hand, if you want to run step regardless of the fact that the job will fail, you can use if: always expression to run the step.

# scans.[ID].steps[].run

Run specifies a script to be executed. The content of this field is first written to a temporary file, and then executed by the shell specified in the shell field.

The script is written to a temporary file with random name, inside the working directory of the scan.

The working directory of a scan is created inside the workdir of the runner, and is unique per job.

There may be situations where workflow syntax looks valid, but the server fails to parse it. The most surprising YAML parsing issue I have found is related to a : character inside a single line run field.

Unless quoted, the parser would throw an error. Therefore, the following is invalid:

yaml

What I do in most cases is to use multi-line string syntax, even for single line scripts:

yaml

Keep in mind, this has nothing to do with the workflow syntax itself, but rather with the YAML parser used on the server side. I don't have time to re-implement a new parser, so please be aware of this limitation.

# scans.[ID].steps[].shell

Shell specifies the executable that is going to execute the script.

The way runner evaluates the shell is by running shlex split.

Shlex will try to separate arguments of the shell based on how shell would evaluate positional arguments.

Then, the first argument is used as the entrypoint command, rest are used as arguments to the command, and file is concatenated at the end of the arguments.

Let's demonstrate this with an example:

yaml

This should be equivalent as the following:

bash

# scans.[ID].steps[].allow_failure

Each step contains information about its state and outcome.

The state contains the information about whether the step executed successfully or not. The outcome contains the information about whether the step is considered successful or not.

By default, outcome is derived from the state. If the step executed successfully, the outcome is succeeded. If the step failed to execute (i.e. non-zero exit code), the outcome is failed.

However, by setting allow_failure: true, you can change this behavior.

When allow_failure: true is set, the outcome of the step will always be succeeded, regardless of the state.

Using allow_failure: true is useful in situations where you want to run a step, but you don't want the job to fail if the step fails. For example, you might want to notify an external service when the job is completed, but you don't want the job to fail if the notification fails.

yaml

# Next steps

# Next Steps

Learn more about specific workflow components: