Workflows

Workflows are the way you describe automation. They should describe the way you do stuff in a concise and declarative way. Workflows should act as if there is a replica of you doing the stuff that can be automated. It should be a continuous process, feeding you information in the real time, allowing you to do the best work you possibly can.

Pipelines on the other hand are sequences of steps that are executed from start to finish. It is much easier to reason about the pipeline, but it is more natural to work in terms of workflows.

Workflows vs Pipelines

The main difference between workflows and pipelines is that pipelines are executed from start to finish. You can have parallelism in pipelines, but essentially, you are passing through stages. Let's take a simple example to understand the difference.

Pipeline example:

pipeline

Let's put it this way, pipeline is a series of steps taken where the output from one stage is used as the input to another stage.

Workflow is the continuous execution of multiple pipelines. One step in stage 4 may trigger re-run of stage 1 for example. It is a live thing, continuously automating what you usually do, so you don't have to do it. Decisions when something is triggered is completely up to you, based on the way you do things(i.e. your workflow).

Terminology

term	description
scan	Scan is a description for the job. Scan describes what you want to run.
job	Job is a single invocation of the scan. Each scan can have 0 or more jobs, while each job belongs to a single scan.
workflow	Workflow is a collection of scans. It is described in your workflow file. It is a unit, meaning the one workflow cannot use another workflow
pipeline	Collection of scans that depend on each other. It is a subset of the workflow.

Writing workflows

Workflows are yaml definitions where you describe what you want to do. YAML proved to be a great tool to describe your intent. It is a commonly used language to describe workflows.

Initially, writing workflows started with the UI editor, but proved to be much more complicated than the declarative way.

Right now, the experience of writing workflows is horrible. Hopefully, LSP will be implemented soon, but please, make sure to have this reference open if parsing fails... Server will respond with the error specifying what is missing, but that is not good enough

The editor however does contain a basic syntax hints, so you can leverage autocompletes to see what fields are available.

Common pitfalls

Value containing ":"

Let's say we define a step like the following:

- run: curl -H 'Authorization: Bearer ${{ secrets.WEBHOOK_API_KEY }}' "https://${{ vars.WEBHOOK_SERVICE_URL }}/api/subfinder"
  shell: bash

Defining workflow like this could result in workflow parsing error. When you see error like that, you can simply use | symbol used to represent multi-line strings.

The fix is:

- run: |
    curl -H 'Authorization: Bearer ${{ secrets.WEBHOOK_API_KEY }}' "https://${{ vars.WEBHOOK_SERVICE_URL }}/api/subfinder"
  shell: bash

Workflow components

`scans`

Each workflow is made of one or more scans. Each scan has a unique name, which is a key in the scans object. The name will be presented as [ID] for the rest of the document, since it is unique for each scan, and therefore, serves as an identifier.

You can have the unlimited number of scans, as long as their names are unique. The names must only contain alphanumeric characters and the _ character.

name	valid
example	true
test_1	true
test-one	false

Example:

scans:
  assetfinder:
    on:
      cron: 0 * * * *
    steps:
      - run: assetfinder example.com
        shell: bash

`scans.[ID].on`

Serves as a trigger for a scan execution. This object contains specification that will trigger a scan on some event.

To start the pipeline, there must exist some event to trigger it. Events are raised automatically (on cron), or manually (on dispatch). From there, you build the next stage by running on expr.

`scans.[ID].on.cron`

Cron defines a schedule based on which the workflows are executed.

Cron is described in form of:

minute	hour	day of month	month	day of week
required	required	required	required	required

If you need help specifying or testing your cron schedule, you can use crontab guru. This is a great site that will help you write and understand the spec you write for the scheduler.

As a limitation, cron parser does not allow non-standard modifiers, such as @hourly etc. Currently, supported specifications include:

Wildcard (*)
Lists (,)
Ranges (-)
Steps (/)

Example:

scans:
  assetfinder:
    on:
      cron: 0 * * * *

`scans.[ID].on.expr`

This field allows you to specify expression that describes when should the scan be scheduled. Read more about expressions here

Expressions are to describe the scan that should run only if expression evaluates to true. After each successful scan, server goes through expression scans and evaluates if the scan should be scheduled. Only expression scans are evaluated at this time.

Let's build a somewhat complicated example to show how it can be used in the real world.

Goals:

Run multiple subdomain discovery tools. For simplicity, let's pick subfinder and assetfinder.
Since I would love to run probes only if there is a diff, I can create a "meta" scan that combines the result.
The httprobe can execute once the subdomain discovery stage is over and there is a diff.

We can combine the power of needs field, and use expr to further describe when our probe should be executed.

Example

scans:
  assetfinder:
    on:
      cron: 0 0/12 * * *
    artifacts:
      - name: output
        paths:
          - output.txt
    steps:
      - run: assetfinder ${{ vars.DOMAIN }} > output.txt
        shell: bash
      - run: sort -o output.txt -u output.txt
        shell: bash

  subfinder:
    on:
      cron: 15 0/12 * * *
    artifacts:
      - name: output
        paths:
          - output.txt
    steps:
      - run: subfinder -d ${{ vars.DOMAIN }} > output.txt
        shell: bash
      - run: sort -o output.txt -u output.txt
        shell: bash

  # Consolidate results in a single place
  subdomaindiscovery:
    on:
      expr: |
        scans.subfinder.is_available() || scans.assetfinder.is_available()
    artifacts:
      - name: output
        paths:
          - output.txt
    steps:
      # Download the latest assetfinder result if available
      - if: scans.assetfinder.is_available()
        run: |
          mkdir assetfinder
          cd assetfinder
          bh job artifact download -j ${{ scans.assetfinder[0].id }} -a output -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # Download the latest subfinder result if available
      - if: scans.subfinder.is_available()
        run: |
          mkdir subfinder
          cd subfinder
          bh job artifact download -j ${{ scans.subfinder[0].id }} -a output -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # Combine them in the output.txt file
      - run: |
          cat assetfinder/output.txt > output.txt
          cat subfinder/output.txt >> output.txt
          sort -o output.txt -u output.txt
        shell: bash

  probe-subdomains:
    on:
      expr: |
        scans.subdomaindiscovery.is_available() && scans.subdomaindiscovery.has_diff()
    artifacts:
      - name: output
        paths:
          - output.txt
    steps:
      # Download the latest subdomain discovery result
      - run: |
          mkdir latest
          cd latest
          bh job artifact download -j ${{ scans.subdomaindiscovery.filter(s, s.state == 'succeeded')[0].id }} -a output -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # If we got triggered on diff and this is not the first invocation.
      # For the sake of example, we should just calculate the diff
      - if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) > 1
        run: |
          mkdir older
          cd older
          bh job artifact download -j ${{ scans.subfinder.filter(s, s.state == 'succeeded')[1].id }} -a output -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # If we got triggered on diff, find out what is the diff here
      # For the sake of example, we should just calculate the diff
      - if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) > 1
        run: |
          cat latest/output.txt | anew older/output.txt -d > new-subdomains.txt
        shell: bash
      # Otherwise, if this is not a diff trigger, let's use the output of the latest scan instead
      - if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) == 1
        run: |
          cp latest/output.txt new-subdomains.txt
        shell: bash
      # We prepared our environment now. Subdomains that we want to probe are written in
      # the `new-subdomains.txt` file. Let's finish the job
      - run: cat new-subdomains.txt | httprobe > output.txt
        shell: bash
      - run: sort -o output.txt -u output.txt
        shell: bash

Let's summarize why and how this works. Periodically, we want to check if new domain is discovered. For that, we run two tools:

The purpose of those scans is to find as many subdomains as we possibly can. But since these are just two tools to accomplish something, we create a meta scan that only combines these results. At the end of the day, we care about the subdomain discovery, not about the single tools result at this point. This can be done in-place if you want to save the storage.

To schedule a subdomaindiscovery scan, we need to have assetfinder or subfinder done and ready to be used. To accomplish that, utility function exists on scans[].[ID] called is_available, that effectively checks if there exists a job with state "succeeded", that is executed after the last execution of the subdomaindiscovery scan.

Next, for the sake of example, we want to run httprobe on newly found subdomains. Here are two scenarios:

Subdomain discovery is executed for the first time.
Subdomain discovery has multiple executions, so we need to find the new ones.

This can be summarized with scans.subdomaindiscovery.is_available() && scans.subdomaindiscovery.has_diff(). The is_available() method checks that subdomaindiscovery scan has successful executions that are after the latest execution of scans.probe, while has_diff() checks that either this was the first successful execution (therefore, diff between nothing and something is always something), or last two successful subdomaindiscovery jobs have something different in their outputs.

To prepare the environment, we can conditionally run steps. If this is the first successful execution of subdomain discovery, number of succeeded jobs will be 1. Therefore, the whole output should be used. Otherwise, we can download two successful runs, calculate the diff, and use that as our input.

`scans.[ID].on.dispatch`

This field allows you to specify a scan that can start a pipeline. The idea behind this scan type is that you may have some expensive tools that you want to run on-demand. They should not be executed periodically, but they can start a pipeline of other tools.

Example

scans:
  expensive:
    on:
      dispatch: {} # Important to add '{}', since dispatch is empty object in yaml
    steps:
      - run: echo "Executing expensive tool with API rate limit etc."
        shell: bash

  expensive-stage-2:
    on:
      expr: scans.expensive.is_available()
    steps:
      - run: echo "You continue normally from here"
        shell: bash

`scans.[ID].on.dispatch.inputs`

Inputs act as project vars while keeping the lifetime of a single invocation. Specifying inputs would mean that your scan should act differently depending on the input provided. For the ones who are also software engineers, inputs are like arguments to a function.

There are two primary reasons to use dispatch:

From time to time, you want to run scan on a specific program by triggering it manually. Inputs are completely defined by the program (for example, program domain(s), IPS, etc.)
You want to dispatch something from your scan. Let's say you uncovered a new domain. You want to spawn a new pipeline to check stuff only on that particular domain.

Example

scans:
  expensive:
    on:
      dispatch:
        inputs:
          domain:
            type: string
            required: true
    steps:
      - run: echo "Executing expensive tool with API rate limit etc."
        shell: bash
      - run: echo "Scanning ${{ inputs.domain }}"
        shell: bash

`scans.[ID].on.dispatch.inputs.[KEY]`

Key is a unique identifier for an input. It conforms to constraints related to project variables, but can be accessed from workflow expression like ${{ inputs.[KEY] }}.

There are few constraints you should be aware of:

Maximum number of inputs is 20.
Maximum size of dispatch values is 65535 characters. If you need larger input size, consider uploading assets and passing down the reference to it.

`scans.[ID].on.dispatch.inputs.[KEY].type`

Type of this YAML key is string. The value is either string or bool. By default, it is of type string.

`scans.[ID].on.dispatch.inputs.[KEY].required`

Type of this YAML key is boolean. By default, it is false. If you specify required: true, it would mean that you MUST provide the value when spawning the job.

As a side-note, all values are provided by default from UI. Empty value is still a provided value. This limitation is mostly important for API calls (e.g. bh scan dispatch ...)

`scans.[ID].on.dispatch.inputs.[KEY].default`

Type of this YAML key is string. However, if you specify type to be bool, valid values are "true" or "false". Otherwise, since the only supported type is string, any string will do. By default, for bool, it is "false", and for string, it is "".

`scans.[ID].artifacts[]`

Artifacts field specifies what files or directories are uploaded to the server when the scan is successful. If the scan fails, no files will be uploaded to the server. If a single upload fails, the successfully uploaded artifacts will be persisted so you can inspect regardless of the result.

All artifacts are zipped and uploaded to the server under the artifact[].name.

Please note that step outputs are separate from artifacts. Outputs are log lines logged to the stdout and stderr of the step, while artifacts are files or directories that are uploaded to the server intended to be used by humans or other scans.

All artifact paths must be present in the root directory of your scan.There are multiple reasons for this:

Eliminate potential errors of uploading files from your local machine. You can always cp files from your local machine to the working directory of your job, but you would have to do it explicitly.
Ensure that workflows are not written in a way that they depend on the local machine. If you want, you can always work around this constraint by using cp command inside the run step to copy files to your present working directory.

Example

scans:
  example:
    on:
      cron: 0 0 * * *
    artifacts:
      - name: output
        paths:
          - file.txt
          - outdir/subdir
    steps:
      - run: echo "Contents of the file" > file.txt
        shell: bash
      - run: mkdir -p outdir/subdir
        shell: bash
      - run: |
          echo "Contents of file 1 in subdir" > outdir/subdir/file1
          echo "Contents of file 2 in subdir" > outdir/subdir/file2
          echo "Using directories are useful when you run a tool that results in outputs written to the directory"
        shell: bash

`scans.[ID].artifacts[].name`

The name field specifies the name of the artifact. This is the name under which the artifact will be stored on the server. Each upload is zipped and stored under the name field. The name must be unique for each scan, and it must not contain any special characters.

The name must be ascii alphanumeric, or contain - or _ and .. Probably the best practice would be to include .zip suffix to every name, but it's not strictly required.

The name of the artifact will be used during downloads, such as bh job artifact download -j <job_id> -a <artifact_name> -o <output_file>.

`scans.[ID].artifacts[].paths[]`

This field specifies a list of paths to the files or directories that will be included in the artifact.

Currently, you cannot use glob paths, but rather prefix paths for directories (e.g. outdir/), or direct file names.

Each path must start from the runners working directory where the scan is executed. For example, if you have a run step that does:

echo "Contents of the file" > file.txt

The file will be created under the directory of the scan, and the correct reference for the path field would be file.txt.

`scans.[ID].artifacts[].if`

This field allows you to specify a condition for when the artifact should be included in the scan. The condition is expressed as a CEL expression, just like if fields of the scan step.

This is especially important if one conditional step produces the file. The condition to run the step should be the same as the condition to upload the artifact. Therefore, if step is skipped, the artifact will not be uploaded.

`scans.[ID].artifacts[].expires_in`

This field allows you to specify a duration for how long the artifact should be retained on the server. This field is especially useful if you want to reduce the storage costs, expiring results after a certain period of time.

The duration is expressed as a string, and it must be a valid duration format, such as 1h, 30m, or 1d. You would specify it as a string, such as expires_in: "1d".

It is worth noting that the execution of the cleanups are not exact, i.e. there might be a delay between the exact specified time you want your scan to expire, and the time the background process picks it up.

Therefore, please do not rely on the exact timing of the expiration, but rather on the fact that it will be cleaned up eventually.

`scans.[ID].artifacts[].notify_on`

A field that specifies when the notification (if configured) should be issued after job finishes.

The type of the notify_on field is string, and the valid values are:

"diff": Notify when the job has a diff compared to the previous job.
"always": Notify when the job is executed, regardless of the diff.
"never": Notification will never be issued.

`scans.[ID].env`

This field allows you to specify environment variables that will be available for the scan. The environment variables are available for all steps in the scan, and they are not inherited from the runner. This is useful when you want to pass some variables that are specific to the scan, and you need them to be available in all steps.

For example, you want to store the BOUNTYHUB_TOKEN in the environment variable, so that bh CLI can use it to authenticate with the server.

Instead of doing magic to export this env from ${{ secrets.BOUNTYHUB_TOKEN }}, you can simply specify it in the env field, and it will be available for all steps in the scan.

The value is an object, where keys and values are strings.

Example

scans:
  example:
    on:
      cron: 0 0 * * *
    env:
      BOUNTYHUB_TOKEN: ${{ secrets.BOUNTYHUB_TOKEN }}
    steps:
      - run: echo "Using BOUNTYHUB_TOKEN in the scan"
        shell: bash

`scans.[ID].steps[]`

Every job is composed of one or more steps to be taken in order to execute a job. It can be as simple as a single step running the script, or it can be a script completely written in a workflow. Each steps stdout and stderr is streamed back to the server and later displayed.

If the step fails, the job fails. Failure of the step is based on the exit code of the script.

Currently, the stream is not propagated to the UI while the job is being executed. Hopefully soon, that functionality will be implemented. If you notice any issues with the output logs, please do not hesitate to report it.

`scans.[ID].steps[].if`

Steps may conditionally be skipped. This is useful when you want to execute a step if something meets the condition. Let's try to illustrate this in a difficult example.

Our objectives for this example are:

To run httprobe if either assetfinder is successful or subfinder is successful.
Since there can be an instance where one of our dependency scans are done, while the other one is not, we need to conditionally download the result.
If both of them are done, we should combine their latest outputs.

Keep in mind, in this example scan, we can have the following sequence of events:

assetfinder successfully finishes
subfinder is not scheduled yet, so we schedule the httprobe.
subfinder is scheduled after httprobe.
httprobe is scheduled again, and we will re-use the latest data from the assetfinder. We don't care that we executed the httprobe on those subdomains that it found. Maybe, just maybe, some of those subdomains are up, and we want to test them.

scans:
  assetfinder:
    artifacts:
      - name: output
        path:
          - output.txt
    # ...
  subfinder:
    artifacts:
      - name: output
        path:
          - output.txt
    # ...
  httprobe:
    on:
      expr: |
        scans.assetfinder.is_available() || scans.subfinder.is_available()
    steps:
      # Prepare the subdomains file
      - run: touch subdomains.txt
        shell: bash
      # Run if the latest assetfinder was successful
      - if: scans.assetfinder.is_available()
        run: |
          echo "Downloading assetfinder results using bh"
          echo "Makind directory to download the result to"
          mkdir assetfinder && cd assetfinder
          echo "Downloading the latest assetfinder result"
          bh job artifact download -j ${{ scans.assetfinder[0].id }} -n output -o output.zip
          echo "Unzipping the result"
          unzip output.zip
          rm output.zip
          echo "Writing the result into subdomains.txt"
          cd ..
          cat assetfinder/output.txt >> subdomains.txt
        shell: bash
      - if: scans.subfinder.is_available()
        run: |
          echo "Downloading subfinder results using bh"
          echo "Makind directory to download the result to"
          mkdir subfinder && cd subfinder
          echo "Downloading the latest subfinder result"
          bh job artifact download -j ${{ scans.subfinder[0].id }} -n output -o output.zip
          echo "Unzipping the result"
          unzip output.zip
          rm output.zip
          echo "Writing the result into subdomains.txt"
          cd ..
          cat subfinder/output.txt >> subdomains.txt
        shell: bash
      - run: |
          echo "Ensuring results are unique and sorted"
          sort -o subdomains.txt -u subdomains.txt
        shell: bash
      - run: cat subdomains.txt | httprobe
        shell: bash

`scans.[ID].steps[].run`

Run specifies a script to be executed. It is written to a temporary file on the disk and executed by a shell script specified in scans.[ID].steps[].shell.

The directory where the script is written depends on the way the runner is configured. Let's say that the runner is located at /home/runner and the workdir configured for the runner is _work, then the script will be temporarily written to the /home/runner/_work/{job_id}/{uuid}.

The script lives only during the step. It will be cleaned up after each step.

To write a script, you can use template evaluation syntax. To read more about that, please visit "Template evaluation" section.

There may be situations where workflow syntax looks valid, but the server fails to parse it.For example:

- run: sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
  shell: bash

To fix this problem, you can do the following:

- run: |
    sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
  shell: bash

`scans.[ID].steps[].shell`

Shell specifies the executable that is going to execute the script. The way runner evaluates the shell is by running shlex split. This will try to separate arguments of the shell based on how shell would evaluate positional arguments.

Then, the first argument is used as the entrypoint command, rest are used as arguments to the command, and file is concatenated at the end of the arguments.

Let's say I want to run

- run: echo "example"
  shell: bash --noprofile --norc -eo pipefail

This should be equivalent as the following:

bash --noprofile --norc -eo pipefail {random_filename}

`scans.[ID].steps[].allow_failure`

This field allows you to specify whether the step should be allowed to fail. If the step fails, the step status will be failed, but the outcome of the step would be succeeded. Therefore, the job will continue executing the next steps.

This is useful when you want to run a step that may fail, but you don't want the job to fail. For example, you may want to have step that would communicate information to your external server, but if your server is down, you still want to preserve the job output by allowing job to be successful.

Example

Let's say you have the server running at the VPS where your runner is. The server is running on port :8080, and has a /notify endpoint that accepts POST requests. You want to notify the server when the job is completed, but you don't want the job to fail if the server is down.

scans:
  example:
    on:
      cron: 0 0 * * *
    steps:
      - run: curl -X POST localhost:8080/notify -d "Job completed"
        shell: bash
        allow_failure: true

Edit this page on GitHub

WorkflowsExpressions

BountyHub Docs

BountyHub Docs

Workflows

Workflows vs Pipelines

Pipeline example:

Terminology

Writing workflows

Common pitfalls

Value containing ":"

Workflow components

`scans`

Example:

`scans.[ID].on`

`scans.[ID].on.cron`

Example:

`scans.[ID].on.expr`

Example

`scans.[ID].on.dispatch`

Example

`scans.[ID].on.dispatch.inputs`

Example

`scans.[ID].on.dispatch.inputs.[KEY]`

`scans.[ID].on.dispatch.inputs.[KEY].type`

`scans.[ID].on.dispatch.inputs.[KEY].required`

`scans.[ID].on.dispatch.inputs.[KEY].default`

`scans.[ID].artifacts[]`

Example

`scans.[ID].artifacts[].name`

`scans.[ID].artifacts[].paths[]`

`scans.[ID].artifacts[].if`

`scans.[ID].artifacts[].expires_in`

`scans.[ID].artifacts[].notify_on`

`scans.[ID].env`

Example

`scans.[ID].steps[]`

`scans.[ID].steps[].if`

`scans.[ID].steps[].run`

`scans.[ID].steps[].shell`

`scans.[ID].steps[].allow_failure`

Example