Workflows

Workflows are the way you describe automation. They should describe the way you do stuff in a concise and declarative way. Workflows should act as if there is a replica of you doing the stuff that can be automated. It should be a continuous process, feeding you information in the real time, allowing you to do the best work you possibly can.

Pipelines on the other hand are sequences of steps that are executed from start to finish. It is much easier to reason about the pipeline, but it is more natural to work in terms of workflows.

Workflows vs Pipelines

The main difference between workflows and pipelines is that pipelines are executed from start to finish. You can have parallelism in pipelines, but essentially, you are passing through stages. Let's take a simple example to understand the difference.

Pipeline example:

pipeline

The main problem here is that you either execute the entire pipeline, or you don't. But that is not the way you would approach a target. You would use common data to execute multiple pipelines, and this is where workflows kick in.

Let's say I start with subdomain discovery. You have multiple tools that run subdomain discovery to maximize the target scope. However, from there, you can have multiple pipelines. For example:

  1. Screenshot all subdomain landing pages to quickly pick a target
  2. Run httprobe to see what subdomains are live
  3. Run port scan on each target.
  4. ...

It is not that often that ports are open, and scan is noisy, so we can run it every other day or once a week. However, running httprobe is relatively quick, and does not raise alarms that often, so we can do it every 6 hours.

This is an example of how a single stage may have multiple pipelines. And you are already doing that right now, I would guess.

Terminology

termdescription
scanScan is a description for the job. Scan describes what you want to run.
jobJob is a single invocation of the scan. Each scan can have 0 or more jobs, while each job belongs to a single scan.
workflowWorkflow is a collection of scans. It is described in your workflow file. It is a unit, meaning the one workflow cannot use another workflow
pipelineCollection of scans that depend on each other. It is a subset of the workflow.

Writing workflows

Workflows are yaml definitions where you describe what you want to do. YAML proved to be a great tool to describe your intent. It is a commonly used language to describe workflows.

Initially, writing workflows started with the UI editor, but proved to be much more complicated than the declarative way.

Right now, the experience of writing workflows is horrible. Hopefully, language client will be implemented soon, but please, make sure to have this reference open if parsing fails...

The editor however does contain a basic syntax, so you can leverage autocompletes to see what fields are available.

Workflow components

scans

Each workflow is made of one or more scans. Each scan has a unique name, which is a key in the scans object. The name will be presented as [ID] for the rest of the document, since it is unique for each scan, and therefore, serves as an identifier.

You can have the unlimited number of scans, as long as their names are unique. The names must only contain alphanumeric characters and the _ character.

namevalid
exampletrue
test_1true
test-onefalse

Example:

scans:
  assetfinder:
    on:
      cron: 0 0 * * * *
    steps:
      - run: assetfinder example.com
        shell: bash

scans.[ID].on

Serves as a trigger for a scan execution. This object contains specification that will trigger a scan on some event.

To start the pipeline, there must exist some event to trigger it. Events are raised automatically (on cron), or manually (on dispatch). From there, you build the next stage by running on expr.

scans.[ID].on.cron

Cron defines a schedule based on which the workflows are executed.

Cron is described in form of:

secondminutehourday of monthmonthday of week
requiredrequiredrequiredrequiredrequiredrequired

If you need help specifying or testing your cron schedule, you can use crontab guru. This is a great site that will help you write and understand the spec you write for the scheduler.

As a limitation, cron parser does not allow non-standard modifiers, such as @hourly etc. Currently supported specifications include:

  • Wildcard (*)
  • Lists (,)
  • Ranges (-)
  • Steps (/)

Example:

scans:
  assetfinder:
    on:
      cron: 0 * * * * *

scans.[ID].on.expr

This field allows you to specify expression that describes when should the scan be scheduled. Read more about expressions here

Expressions are to describe the scan that should run only if expression evaluates to true. After each successful scan, server goes through expression scans and evaluates if the scan should be scheduled. Only expression scans are evaluated at this time.

Let's build a somewhat complicated example to show how it can be used in the real world.

Goals:

  1. Run multiple subdomain discovery tools. For simplicity, let's pick subfinder and assetfinder.
  2. Since I would love to run probes only if there is a diff, I can create a "meta" scan that combines the result.
  3. The httprobe can execute once the subdomain discovery stage is over and there is a diff.

We can combine the power of needs field, and use expr to further describe when our probe should be executed.

Example

scans:
  assetfinder:
    on:
      cron: 0 1 0/12 * * *
    uploads:
      - output.txt
    steps:
    - run: assetfinder ${{ vars.DOMAIN }} > output.txt
      shell: bash
    - run: sort -o output.txt -u output.txt
      shell: bash

  subfinder:
    on:
      cron: 1 15 0/12 * * *
    uploads:
      - output.txt
    steps:
    - run: subfinder -d ${{ vars.DOMAIN }} > output.txt
      shell: bash
    - run: sort -o output.txt -u output.txt
      shell: bash

  subdomaindiscovery:
    needs:
    - subfinder
    - assetfinder
    on:
      expr: "true" // If needs field is good enough to describe the intention. So expr should always be true
    uploads:
      - output.txt
    steps:
      # Download the latest assetfinder result
      - run: |
          mkdir assetfinder
          cd assetfinder
          bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.assetfinder[0].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # Download the latest subfinder result
      - run: |
          mkdir subfinder
          cd subfinder
          bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subfinder[0].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # Combine them in the output.txt file
      - run: |
          cat assetfinder/output.txt > output.txt
          cat subfinder/output.txt >> output.txt
          sort -o output.txt -u output.txt
        shell: bash

  probe:
    needs:
    - subdomaindiscovery
    on:
      expr: |
        size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1 // First time we combined the subdomain findings
        || scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].nonce != scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[1].nonce // There is a diff between last 2 successful runs
    uploads:
      - output.txt
    steps:
      # Download the latest subdomain discovery result
      - run: |
          mkdir latest
          cd latest
          bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # If we got triggered on diff
      - if: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) > 1
        run: |
          mkdir older
          cd older
          bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subfinder.filter(s, s.status == 'succeeded')[1].id }} -o output.zip
          unzip output.zip
          rm output.zip
        shell: bash
      # If we got triggered on diff, find out what is the diff here
      - if: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) > 1
        run: |
          cat latest/output.txt | anew older/output.txt -d > new-subdomains.txt
        shell: bash
      # Otherwise, if this is not a diff trigger, let's use the output of the latest scan instead
      - if: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1
        run: |
          cp latest/output.txt new-subdomains.txt
      # We prepared our environment now. Subdomains that we want to probe are written in
      # the `new-subdomains.txt` file. Let's finish the job
      - run: cat new-subdomains.txt | httprobe > output.txt
        shell: bash
      - run: sort -o output.txt -u output.txt
        shell: bash

Let's summarize why and how this works. Periodically, we want to check if new domain is discovered. For that, we run two tools:

  1. assetfiner
  2. subfinder

The purpose of those scans is to find as many subdomains as we possibly can. But since these are just two tools to accomplish something, we create a meta scan that only combines these results. At the end of the day, we care about the subdomain discovery, not about the single tools result at this point.

To combine these two, we can write a somewhat complicated expression, or we can simplify things using needs field. The needs field is introduced so we can avoid writing multi-line expressions for a common task. And since the needs field accomplishes what we are trying to do, the expr field should only evaluate to true.

Having that out of the way, our probe scan requires subdomaindiscovery. We can express this using needs field as well. But this time, the needs field is not good enough. We only want to run if the subdomaindiscovery contains a diff. This is when power of expr kicks in, so let's dive into it.

The needs ensures that the probe scan did not see the latest successful subdomaindiscovery result. That means that we have at least one successful result. If there is exactly 1 successful result, then the entire output is new to probe, therefore the expression is written like this: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1.

However, if there are multiple successful scans, the latest two successful scans should be examined. If their nonce value (the hash of the result) is different, the scan should be scheduled. Therefore, scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].nonce != scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[1].nonce.

To combine these two, we want to run when subdomaindiscovery stage is done for the first time OR there is a diff.

So the resulting expression is:

size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1
|| scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].nonce != scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[1].nonce

scans.[ID].on.dispatch

This field allows you to specify a scan that can start a pipeline. The idea behind this scan type is that you may have some expensive tools that you want to run on-demand. They should not be executed periodically, but they can start a pipeline of other tools.

Example

scans:
  expensive:
    on:
      dispatch: {} # Important to add '{}', since dispatch is empty object in yaml
    steps:
      - run: echo "Executing expensive tool with API rate limit etc."
        shell: bash

  expensive-stage-2:
    on:
      expr: size(scans.expensive) > 0 && scans.expensive[0].status == 'succeeded'
    steps:
      - run: echo "You continue normally from here"
        shell: bash

scans.[id].needs[]

Needs field is created to simplify the on.expr field. Let's describe the problem first, and then explain how to use this field properly.

When you want to chain scans, you continuously go through stages. First, you finish the subdomain discovery. Then, you check for the live hosts. Then, of those live hosts, you want to do directory brute-forcing, or you want to port scan it, etc.

To properly write an expression, you would have to catch multiple corner cases. Since on each finished scan, the platform tries to figure out if it should schedule a scan that runs on the expression, we can end up running a scan multiple times using the same dependency.

To better understand this, let's write down the faulty scan that runs multiple times with no good reason.

scans:
  stage_one:
    on:
      dispatch: {}
    steps:
      - run: echo "Do something in the first stage"
        shell: bash
  stage_two:
    on:
      expr: |
        size(scans.stage_one) > 0 // stage_one must have at least one result
        && scans.stage_one[0].status == 'succeeded' // and the latest scan is succeeded
    steps:
      - run: echo "Run stage two!"
        shell: bash
  stage_three:
    on:
      expr: |
        size(scans.stage_two) > 0 && scans.stage_two[0].status == 'succeeded'
    steps:
      - run: echo "Run stage three!"
        shell: bash

This looks okay, but it is not. Let's examine the scheduler and what it will do in each case.

First, we need an entrypoint. That is our stage_one scan. It is manually triggered for simplicity.

Once the stage_one is done, the scheduler goes through all scans associated with this workflow, and evaluates each expression. This ensures that a single scan can trigger multiple scans.

Let's say that server first evaluates stage_two. The expression evaluates to true, since we have 1 completed job, and it is successful. So scheduler creates a new job record, meaning, it schedules the stage_two scan. So far so good.

Next, it evaluates the stage_three. The stage_one is not yet done, so the size of jobs is 0. The scheduler does not create a new job record for the stage_three.

The stage_two is done, so we go through all scans again to see if some expression matches, and we schedule the stage_three.

Then, stage_three is done so we evaluate expressions again. Woops! This is where the problem is. The stage_two will evaluate to true. It is still the case that stage_one is done (has more than 0 jobs) and the latest result is successful. Now, stage_two is scheduled, and when done, triggers the stage_three, and it goes in cycles.

To fix this issue using expression, we can write stage_two's expression like this:

stage_two:
  on:
    expr: |
      size(scans.stage_one) > 0 // stage_one must have at least one result
      && (
        size(scans.stage_two) == 0 // stage two doesn't have any job records, and stage_one is done
        || scans.stage_two[0].id < scans.stage_one[0].id // This means that last job for stage two
                                                         // occured before the latest result of the stage one.
      )

This can be even more complicated if you want to account for only successful scans:

stage_two:
  on:
    expr: |
      size(scans.stage_one.filter(s, s.status == 'succeeded')) > 0
      && (
        size(scans.stage_two.filter(s, s.status == 'succeeded')) == 0
        || scans.stage_two.filter(s, s.status == 'succeeded')[0].id < scans.stage_one.filter(s, s.status == 'succeeded')[0].id
      )

This case is common enough, so there should be a mechanism to simplify it. This is where the needs field kicks in.

The needs field effectively:

  1. Ensures that the result of the latest version of the dependency is successful.
  2. Ensures that the result of the latest version is found after the latest invocation of the scan that specifies the needs field.
  3. Ensures that all dependencies specified in needs are matched. That means that for each scan specified in needs, the step 1. and step 2. is evaluated, and then if every entry evaluates to true, the result of needs is true.

There is a nice side benefit to this as well. Now, if you only depend on the scan, and you don't need to further inspect something in order to have a scan scheduled, your expression can simply be:

scans:
  stage_one:
    # ...
  stage_two:
    needs:
      - stage_one
    on:
      expr: "true"
    # ...

scans.[ID].uploads[]

Uploads field specifies what files or directories are uploaded to the server when the scan is successful. If the scan fails, no files will be uploaded to the server.

Uploads are required if you have any dependencies between scans. They will not show up in the UI, but you have a button to download these results. In the UI, only stdout and stderr is displayed. The reader for stdout is limited at 2MiB so please, do not try to print everything to the stdout or stderr.

Example

scans:
  example:
    on:
      cron: 0 0 0 * * *
    uploads:
      - file.txt
      - outdir/subdir
    steps:
      - run: echo "Contents of the file" > file.txt
        shell: bash
      - run: mkdir -p outdir/subdir
        shell: bash
      - run: |
          echo "Contents of file 1 in subdir" > outdir/subdir/file1
          echo "Contents of file 2 in subdir" > outdir/subdir/file2
          echo "Using directories are useful when you run a tool that results in outputs written to the directory"
        shell: bash

scans.[ID].steps[]

Every job is composed of one or more steps to be taken in order to execute a job. It can be as simple as a single step running the script, or it can be a script completely written in a workflow. Each steps stdout and stderr is streamed back to the server and later displayed as a result.

If the step fails, the job fails. Failure of the step is based on the exit code of the script.

If step fails, the stderr is going to be displayed on the UI.

The step output endpoint is limited to 2MiB. However, the result endpoint is limited to 100MiB. So if the output of your tool is larger than 2MiB, you should avoid writing it to the stdout.

scans.[ID].steps[].if

Steps may conditionally be skipped. This is useful when you want to execute a step if something meets the condition. Let's try to illustrate this in a difficult example.

Our objectives for this example are:

  • To run httprobe if either assetfinder is successful or subfinder is successful.
  • Since there can be an instance where one of our dependency scans are done, while the other one is not, we need to conditionally download the result.
  • If both of them are done, we should combine their latest outputs.

Keep in mind, in this example scan, we can have the following sequence of events:

  1. assetfinder successfully finishes
  2. subfinder is not scheduled yet so we schedule the httprobe.
  3. subfinder is scheduled after httprobe.
  4. httprobe is scheduled again, and we will re-use the latest data from the assetfinder. We don't care that we executed the httprobe on those subdomains that it found. Maybe, just maybe, some of those subdomains are up, and we want to test them.
scans:
  assetfinder:
    uploads:
      - output.txt
    # ...
  subfinder:
    uploads:
      - output.txt
    # ...
  httprobe:
    on:
      expr: |
        (size(scans.assetfinder) > 0 && scans.assetfinder[0].status == 'succeeded') // If assetfinder has latest scan succeded
        || (size(scans.subfinder) > 0 && scans.subfinder[0].status == 'succeeded') // If subfinder has latest scan succeded
    steps:
      # Prepare the subdomains file
      - run: touch subdomains.txt
        shell: bash
      # Run if the latest assetfinder was successful
      - if: |
          size(scans.assetfinder) > 0 
          && scans.assetfinder[0].status == 'succeeded'
        run: |
          echo "Downloading assetfinder results using bh"
          echo "Makind directory to download the result to"
          mkdir assetfinder && cd assetfinder
          echo "Downloading the latest assetfinder result"
          bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.assetfinder[0].id }} -o output.zip
          echo "Unzipping the result"
          unzip output.zip
          rm output.zip
          echo "Writing the result into subdomains.txt"
          cd ..
          cat assetfinder/output.txt >> subdomains.txt
        shell: bash
      - if: |
          size(scans.subfinder) > 0 # Run if the latest subfinder was successful
          && scans.subfinder[0].status == 'succeeded'
        run: |
          echo "Downloading subfinder results using bh"
          echo "Makind directory to download the result to"
          mkdir subfinder && cd subfinder
          echo "Downloading the latest subfinder result"
          bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subfinder[0].id }} -o output.zip
          echo "Unzipping the result"
          unzip output.zip
          rm output.zip
          echo "Writing the result into subdomains.txt"
          cd ..
          cat subfinder/output.txt >> subdomains.txt
        shell: bash
      - run: |
          echo "Ensuring results are unique and sorted"
          sort -o subdomains.txt -u subdomains.txt
        shell: bash
      - run: cat subdomains.txt | httprobe
        shell: bash

scans.[ID].steps[].run

Run specifies a script to be executed. It is written to a temporary file on the disk and executed by a shell script specified in scans.[ID].steps[].shell.

The directory where the script is written depends on the way the runner is configured. Let's say that the runner is located at /home/runner and the workdir configured for the runner is _work, then the script will be temporarily written to the /home/runner/_work/{job_id}/{uuid}.

The script lives only during the step. It will be cleaned up after each step.

To write a script, you can use template evaluation syntax. To read more about that, please visit "Template evaluation" section.

There may be situations where workflow syntax looks valid, but the server fails to parse it.For example:
- run: sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
  shell: bash
To fix this problem, you can do the following:
- run: |
    sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
  shell: bash

scans.[ID].steps[].shell

Shell specifies the executable that is going to execute the script. The runner uses shell as an argument to the which command. If the executable does not exist on the path, the step will fail. Right now, there is a limitation where you cannot provide arguments to the command. This may change in the future.

For example, let's say the step is the following:

- run: echo "example"
  shell: bash

The steps the runner will take are:

  1. Call which bash. Let's say the output is /bin/bash
  2. Create a temporary file with a random name. For the random name, UUIDv4 is used. For simplicity, let's just refer to it as {UUID}
  3. Take the content of the run and write it to a file.
  4. Execute /bin/bash {UUID}