Workflows

Workflows are the way you describe automation. They should describe the way you do stuff in a concise and declarative way. Workflows should act as if there is a replica of you doing the stuff that can be automated. It should be a continuous process, feeding you information in the real time, allowing you to do the best work you possibly can.

Pipelines on the other hand are sequences of steps that are executed from start to finish. It is much easier to reason about the pipeline, but it is more natural to work in terms of workflows.

Workflows vs Pipelines

The main difference between workflows and pipelines is that pipelines are executed from start to finish. You can have parallelism in pipelines, but essentially, you are passing through stages. Let's take a simple example to understand the difference.

Pipeline example:

pipeline

Let's put it this way, pipeline is a series of steps taken where the output from one stage is used as the input to another stage.

Workflow is the continuous execution of multiple pipelines. One step in stage 4 may trigger re-run of stage 1 for example. It is a live thing, continuously automating what you usually do, so you don't have to do it. Decisions when something is triggered is completely up to you, based on the way you do things(i.e. your workflow).

Terminology

termdescription
scanScan is a description for the job. Scan describes what you want to run.
jobJob is a single invocation of the scan. Each scan can have 0 or more jobs, while each job belongs to a single scan.
workflowWorkflow is a collection of scans. It is described in your workflow file. It is a unit, meaning the one workflow cannot use another workflow
pipelineCollection of scans that depend on each other. It is a subset of the workflow.

Writing workflows

Workflows are yaml definitions where you describe what you want to do. YAML proved to be a great tool to describe your intent. It is a commonly used language to describe workflows.

Initially, writing workflows started with the UI editor, but proved to be much more complicated than the declarative way.

Right now, the experience of writing workflows is horrible. Hopefully, LSP will be implemented soon, but please, make sure to have this reference open if parsing fails... Server will respond with the error specifying what is missing, but that is not good enough

The editor however does contain a basic syntax, so you can leverage autocompletes to see what fields are available.

Workflow components

scans

Each workflow is made of one or more scans. Each scan has a unique name, which is a key in the scans object. The name will be presented as [ID] for the rest of the document, since it is unique for each scan, and therefore, serves as an identifier.

You can have the unlimited number of scans, as long as their names are unique. The names must only contain alphanumeric characters and the _ character.

namevalid
exampletrue
test_1true
test-onefalse

Example:

scans:
  assetfinder:
    on:
      cron: 0 * * * *
    steps:
      - run: assetfinder example.com
        shell: bash

scans.[ID].on

Serves as a trigger for a scan execution. This object contains specification that will trigger a scan on some event.

To start the pipeline, there must exist some event to trigger it. Events are raised automatically (on cron), or manually (on dispatch). From there, you build the next stage by running on expr.

scans.[ID].on.cron

Cron defines a schedule based on which the workflows are executed.

Cron is described in form of:

minutehourday of monthmonthday of week
requiredrequiredrequiredrequiredrequired

If you need help specifying or testing your cron schedule, you can use crontab guru. This is a great site that will help you write and understand the spec you write for the scheduler.

As a limitation, cron parser does not allow non-standard modifiers, such as @hourly etc. Currently, supported specifications include:

  • Wildcard (*)
  • Lists (,)
  • Ranges (-)
  • Steps (/)

Example:

scans:
  assetfinder:
    on:
      cron: 0 * * * *

scans.[ID].on.expr

This field allows you to specify expression that describes when should the scan be scheduled. Read more about expressions here

Expressions are to describe the scan that should run only if expression evaluates to true. After each successful scan, server goes through expression scans and evaluates if the scan should be scheduled. Only expression scans are evaluated at this time.

Let's build a somewhat complicated example to show how it can be used in the real world.

Goals:

  1. Run multiple subdomain discovery tools. For simplicity, let's pick subfinder and assetfinder.
  2. Since I would love to run probes only if there is a diff, I can create a "meta" scan that combines the result.
  3. The httprobe can execute once the subdomain discovery stage is over and there is a diff.

We can combine the power of needs field, and use expr to further describe when our probe should be executed.

Example

scans:
  assetfinder:
    on:
      cron: 0 0/12 * * *
    uploads:
      - output.txt
    steps:
      - run: assetfinder ${{ vars.DOMAIN }} > output.txt
        shell: bash
      - run: sort -o output.txt -u output.txt
        shell: bash

  subfinder:
    on:
      cron: 15 0/12 * * *
    uploads:
      - output.txt
    steps:
      - run: subfinder -d ${{ vars.DOMAIN }} > output.txt
        shell: bash
      - run: sort -o output.txt -u output.txt
        shell: bash

  # Consolidate results in a single place
  subdomaindiscovery:
    on:
      expr: |
        scans.subfinder.is_available() || scans.assetfinder.is_available()
    uploads:
      - output.txt
    steps:
      # Download the latest assetfinder result if available
      - if: scans.assetfinder.is_available()
        run: |
          mkdir assetfinder
          cd assetfinder
          bh job download -j ${{ scans.assetfinder[0].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # Download the latest subfinder result if available
      - if: scans.subfinder.is_available()
        run: |
          mkdir subfinder
          cd subfinder
          bh job download -j ${{ scans.subfinder[0].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # Combine them in the output.txt file
      - run: |
          cat assetfinder/output.txt > output.txt
          cat subfinder/output.txt >> output.txt
          sort -o output.txt -u output.txt
        shell: bash

  probe-subdomains:
    on:
      expr: |
        scans.subdomaindiscovery.is_available() && scans.subdomaindiscovery.has_diff()
    uploads:
      - output.txt
    steps:
      # Download the latest subdomain discovery result
      - run: |
          mkdir latest
          cd latest
          bh job download -j ${{ scans.subdomaindiscovery.filter(s, s.state == 'succeeded')[0].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # If we got triggered on diff and this is not the first invocation.
      # For the sake of example, we should just calculate the diff
      - if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) > 1
        run: |
          mkdir older
          cd older
          bh job download -j ${{ scans.subfinder.filter(s, s.state == 'succeeded')[1].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # If we got triggered on diff, find out what is the diff here
      # For the sake of example, we should just calculate the diff
      - if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) > 1
        run: |
          cat latest/output.txt | anew older/output.txt -d > new-subdomains.txt
        shell: bash
      # Otherwise, if this is not a diff trigger, let's use the output of the latest scan instead
      - if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) == 1
        run: |
          cp latest/output.txt new-subdomains.txt
        shell: bash
      # We prepared our environment now. Subdomains that we want to probe are written in
      # the `new-subdomains.txt` file. Let's finish the job
      - run: cat new-subdomains.txt | httprobe > output.txt
        shell: bash
      - run: sort -o output.txt -u output.txt
        shell: bash

Let's summarize why and how this works. Periodically, we want to check if new domain is discovered. For that, we run two tools:

  1. assetfiner
  2. subfinder

The purpose of those scans is to find as many subdomains as we possibly can. But since these are just two tools to accomplish something, we create a meta scan that only combines these results. At the end of the day, we care about the subdomain discovery, not about the single tools result at this point. This can be done in-place if you want to save the storage.

To schedule a subdomaindiscovery scan, we need to have assetfinder or subfinder done and ready to be used. To accomplish that, utility function exists on scans[].[ID] called is_available, that effectively checks if there exists a job with state "succeeded", that is executed after the last execution of the subdomaindiscovery scan.

Next, for the sake of example, we want to run httprobe on newly found subdomains. Here are two scenarios:

  1. Subdomain discovery is executed for the first time.
  2. Subdomain discovery has multiple executions, so we need to find the new ones.

This can be summarized with scans.subdomaindiscovery.is_available() && scans.subdomaindiscovery.has_diff(). The is_available() method checks that subdomaindiscovery scan has successful executions that are after the latest execution of scans.probe, while has_diff() checks that either this was the first successful execution (therefore, diff between nothing and something is always something), or last two successful subdomaindiscovery jobs have something different in their outputs.

To prepare the environment, we can conditionally run steps. If this is the first successful execution of subdomain discovery, number of succeeded jobs will be 1. Therefore, the whole output should be used. Otherwise, we can download two successful runs, calculate the diff, and use that as our input.

scans.[ID].on.dispatch

This field allows you to specify a scan that can start a pipeline. The idea behind this scan type is that you may have some expensive tools that you want to run on-demand. They should not be executed periodically, but they can start a pipeline of other tools.

Example

scans:
  expensive:
    on:
      dispatch: {} # Important to add '{}', since dispatch is empty object in yaml
    steps:
      - run: echo "Executing expensive tool with API rate limit etc."
        shell: bash

  expensive-stage-2:
    on:
      expr: scans.expensive.is_available()
    steps:
      - run: echo "You continue normally from here"
        shell: bash

scans.[ID].on.dispatch.inputs

Inputs act as project vars while keeping the lifetime of a single invocation. Specifying inputs would mean that your scan should act differently depending on the input provided. For the ones who are also software engineers, inputs are like arguments to a function.

There are two primary reasons to use dispatch:

  1. From time to time, you want to run scan on a specific program by triggering it manually. Inputs are completely defined by the program (for example, program domain(s), IPS, etc.)
  2. You want to dispatch something from your scan. Let's say you uncovered a new domain. You want to spawn a new pipeline to check stuff only on that particular domain.

Example

scans:
  expensive:
    on:
      dispatch:
        inputs:
          domain:
            type: string
            required: true
    steps:
      - run: echo "Executing expensive tool with API rate limit etc."
        shell: bash
      - run: echo "Scanning ${{ inputs.domain }}"
        shell: bash

scans.[ID].on.dispatch.inputs.[KEY]

Key is a unique identifier for an input. It conforms to constraints related to project variables, but can be accessed from workflow expression like ${{ inputs.[KEY] }}.

There are few constraints you should be aware of:

  • Maximum number of inputs is 20.
  • Maximum size of dispatch values is 65535 characters. If you need larger input size, consider uploading assets and passing down the reference to it.

scans.[ID].on.dispatch.inputs.[KEY].type

Type of this YAML key is string. The value is either string or bool. By default, it is of type string.

scans.[ID].on.dispatch.inputs.[KEY].required

Type of this YAML key is boolean. By default, it is false. If you specify required: true, it would mean that you MUST provide the value when spawning the job.

As a side-note, all values are provided by default from UI. Empty value is still a provided value. This limitation is mostly important for API calls (e.g. bh scan dispatch ...)

scans.[ID].on.dispatch.inputs.[KEY].default

Type of this YAML key is string. However, if you specify type to be bool, valid values are "true" or "false". Otherwise, since the only supported type is string, any string will do. By default, for bool, it is "false", and for string, it is "".

scans.[ID].uploads[]

Uploads field specifies what files or directories are uploaded to the server when the scan is successful. If the scan fails, no files will be uploaded to the server.

In most cases, you need uploads when you have dependencies between jobs. Please do not mistake uploads with the step outputs. In most cases, people tend to write step that will tee the output to the file, and then upload that file. This is not necessary. You can simply upload the file, and it will be available for download.

Using tee is perfectly fine if you don't expect the large output. However, you should be aware of the following limitations:

  1. The log has a maximum size of 10MiB, and if you exceed that, the log will be truncated. If you want to see the output, you can always download the file and inspect it.
  2. The log for each job expires after 2 weeks. After that, you won't be able to see it.
  3. You cannot download the stdout or stderr and use it as an input for another scan. Log is written in a way that is not suitable for passing it as input.

If scan contains uploads, internal step will be used to zip the content you specified, and calculate the hash of it. That hash will be the nonce value of job. If output changes, hash should be different, and therefore, output contains the diff.

All uploads must be present in the root directory of your scan.There are multiple reasons for this:
  1. Eliminate potential errors of uploading files from your local machine. You can always cp files from your local machine to the working directory of your job, but you would have to do it explicitly.
  2. Ensure that workflows are not written in a way that they depend on the local machine. If you want, you can always work around this constraint by using cp command inside the run step to copy files to your present working directory.

Example

scans:
  example:
    on:
      cron: 0 0 * * *
    uploads:
      - file.txt
      - outdir/subdir
    steps:
      - run: echo "Contents of the file" > file.txt
        shell: bash
      - run: mkdir -p outdir/subdir
        shell: bash
      - run: |
          echo "Contents of file 1 in subdir" > outdir/subdir/file1
          echo "Contents of file 2 in subdir" > outdir/subdir/file2
          echo "Using directories are useful when you run a tool that results in outputs written to the directory"
        shell: bash

scans.[ID].steps[]

Every job is composed of one or more steps to be taken in order to execute a job. It can be as simple as a single step running the script, or it can be a script completely written in a workflow. Each steps stdout and stderr is streamed back to the server and later displayed.

If the step fails, the job fails. Failure of the step is based on the exit code of the script.

Currently, the stream is not propagated to the UI while the job is being executed. Hopefully soon, that functionality will be implemented. If you notice any issues with the output logs, please do not hesitate to report it.

scans.[ID].steps[].if

Steps may conditionally be skipped. This is useful when you want to execute a step if something meets the condition. Let's try to illustrate this in a difficult example.

Our objectives for this example are:

  • To run httprobe if either assetfinder is successful or subfinder is successful.
  • Since there can be an instance where one of our dependency scans are done, while the other one is not, we need to conditionally download the result.
  • If both of them are done, we should combine their latest outputs.

Keep in mind, in this example scan, we can have the following sequence of events:

  1. assetfinder successfully finishes
  2. subfinder is not scheduled yet, so we schedule the httprobe.
  3. subfinder is scheduled after httprobe.
  4. httprobe is scheduled again, and we will re-use the latest data from the assetfinder. We don't care that we executed the httprobe on those subdomains that it found. Maybe, just maybe, some of those subdomains are up, and we want to test them.
scans:
  assetfinder:
    uploads:
      - output.txt
    # ...
  subfinder:
    uploads:
      - output.txt
    # ...
  httprobe:
    on:
      expr: |
        scans.assetfinder.is_available() || scans.subfinder.is_available()
    steps:
      # Prepare the subdomains file
      - run: touch subdomains.txt
        shell: bash
      # Run if the latest assetfinder was successful
      - if: scans.assetfinder.is_available()
        run: |
          echo "Downloading assetfinder results using bh"
          echo "Makind directory to download the result to"
          mkdir assetfinder && cd assetfinder
          echo "Downloading the latest assetfinder result"
          bh job download -j ${{ scans.assetfinder[0].id }} -o output.zip
          echo "Unzipping the result"
          unzip output.zip
          rm output.zip
          echo "Writing the result into subdomains.txt"
          cd ..
          cat assetfinder/output.txt >> subdomains.txt
        shell: bash
      - if: scans.subfinder.is_available()
        run: |
          echo "Downloading subfinder results using bh"
          echo "Makind directory to download the result to"
          mkdir subfinder && cd subfinder
          echo "Downloading the latest subfinder result"
          bh job download -j ${{ scans.subfinder[0].id }} -o output.zip
          echo "Unzipping the result"
          unzip output.zip
          rm output.zip
          echo "Writing the result into subdomains.txt"
          cd ..
          cat subfinder/output.txt >> subdomains.txt
        shell: bash
      - run: |
          echo "Ensuring results are unique and sorted"
          sort -o subdomains.txt -u subdomains.txt
        shell: bash
      - run: cat subdomains.txt | httprobe
        shell: bash

scans.[ID].steps[].run

Run specifies a script to be executed. It is written to a temporary file on the disk and executed by a shell script specified in scans.[ID].steps[].shell.

The directory where the script is written depends on the way the runner is configured. Let's say that the runner is located at /home/runner and the workdir configured for the runner is _work, then the script will be temporarily written to the /home/runner/_work/{job_id}/{uuid}.

The script lives only during the step. It will be cleaned up after each step.

To write a script, you can use template evaluation syntax. To read more about that, please visit "Template evaluation" section.

There may be situations where workflow syntax looks valid, but the server fails to parse it.For example:
- run: sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
  shell: bash
To fix this problem, you can do the following:
- run: |
    sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
  shell: bash

scans.[ID].steps[].shell

Shell specifies the executable that is going to execute the script. The way runner evaluates the shell is by running shlex split. This will try to separate arguments of the shell based on how shell would evaluate positional arguments.

Then, the first argument is used as the entrypoint command, rest are used as arguments to the command, and file is concatenated at the end of the arguments.

Let's say I want to run

- run: echo "example"
  shell: bash --noprofile --norc -eo pipefail

This should be equivalent as the following:

bash --noprofile --norc -eo pipefail {random_filename}