Workflows
Workflows are the way you describe automation. They should describe the way you do stuff in a concise and declarative way. Workflows should act as if there is a replica of you doing the stuff that can be automated. It should be a continuous process, feeding you information in the real time, allowing you to do the best work you possibly can.
Pipelines on the other hand are sequences of steps that are executed from start to finish. It is much easier to reason about the pipeline, but it is more natural to work in terms of workflows.
Workflows vs Pipelines
The main difference between workflows and pipelines is that pipelines are executed from start to finish. You can have parallelism in pipelines, but essentially, you are passing through stages. Let's take a simple example to understand the difference.
Pipeline example:
Let's put it this way, pipeline is a series of steps taken where the output from one stage is used as the input to another stage.
Workflow is the continuous execution of multiple pipelines. One step in stage 4 may trigger re-run of stage 1 for example. It is a live thing, continuously automating what you usually do, so you don't have to do it. Decisions when something is triggered is completely up to you, based on the way you do things(i.e. your workflow).
Terminology
term | description |
---|---|
scan | Scan is a description for the job. Scan describes what you want to run. |
job | Job is a single invocation of the scan. Each scan can have 0 or more jobs, while each job belongs to a single scan. |
workflow | Workflow is a collection of scans. It is described in your workflow file. It is a unit, meaning the one workflow cannot use another workflow |
pipeline | Collection of scans that depend on each other. It is a subset of the workflow. |
Writing workflows
Workflows are yaml definitions where you describe what you want to do. YAML proved to be a great tool to describe your intent. It is a commonly used language to describe workflows.
Initially, writing workflows started with the UI editor, but proved to be much more complicated than the declarative way.
The editor however does contain a basic syntax, so you can leverage autocompletes to see what fields are available.
Workflow components
scans
Each workflow is made of one or more scans
. Each scan has a unique name, which is a key in the scans object. The name will be presented as [ID]
for the rest of the document, since it is unique for each scan, and therefore, serves as an identifier.
You can have the unlimited number of scans, as long as their names are unique. The names must only contain alphanumeric characters and the _
character.
name | valid |
---|---|
example | true |
test_1 | true |
test-one | false |
Example:
scans:
assetfinder:
on:
cron: 0 * * * *
steps:
- run: assetfinder example.com
shell: bash
scans.[ID].on
Serves as a trigger for a scan execution. This object contains specification that will trigger a scan on some event.
To start the pipeline, there must exist some event to trigger it. Events are raised automatically (on cron
), or manually (on dispatch
). From there, you build the next stage by running on expr
.
scans.[ID].on.cron
Cron defines a schedule based on which the workflows are executed.
Cron is described in form of:
minute | hour | day of month | month | day of week |
---|---|---|---|---|
required | required | required | required | required |
If you need help specifying or testing your cron schedule, you can use crontab guru. This is a great site that will help you write and understand the spec you write for the scheduler.
As a limitation, cron parser does not allow non-standard modifiers, such as @hourly
etc. Currently, supported specifications include:
- Wildcard (
*
) - Lists (
,
) - Ranges (
-
) - Steps (
/
)
Example:
scans:
assetfinder:
on:
cron: 0 * * * *
scans.[ID].on.expr
This field allows you to specify expression that describes when should the scan be scheduled. Read more about expressions here
Expressions are to describe the scan that should run only if expression evaluates to true
. After each successful scan,
server goes through expression scans and evaluates if the scan should be scheduled. Only expression scans are evaluated at this time.
Let's build a somewhat complicated example to show how it can be used in the real world.
Goals:
- Run multiple subdomain discovery tools. For simplicity, let's pick
subfinder
andassetfinder
. - Since I would love to run probes only if there is a diff, I can create a "meta" scan that combines the result.
- The
httprobe
can execute once the subdomain discovery stage is over and there is a diff.
We can combine the power of needs
field, and use expr
to further describe when our probe should be executed.
Example
scans:
assetfinder:
on:
cron: 0 0/12 * * *
uploads:
- output.txt
steps:
- run: assetfinder ${{ vars.DOMAIN }} > output.txt
shell: bash
- run: sort -o output.txt -u output.txt
shell: bash
subfinder:
on:
cron: 15 0/12 * * *
uploads:
- output.txt
steps:
- run: subfinder -d ${{ vars.DOMAIN }} > output.txt
shell: bash
- run: sort -o output.txt -u output.txt
shell: bash
# Consolidate results in a single place
subdomaindiscovery:
on:
expr: |
scans.subfinder.is_available() || scans.assetfinder.is_available()
uploads:
- output.txt
steps:
# Download the latest assetfinder result if available
- if: scans.assetfinder.is_available()
run: |
mkdir assetfinder
cd assetfinder
bh job download -j ${{ scans.assetfinder[0].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# Download the latest subfinder result if available
- if: scans.subfinder.is_available()
run: |
mkdir subfinder
cd subfinder
bh job download -j ${{ scans.subfinder[0].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# Combine them in the output.txt file
- run: |
cat assetfinder/output.txt > output.txt
cat subfinder/output.txt >> output.txt
sort -o output.txt -u output.txt
shell: bash
probe-subdomains:
on:
expr: |
scans.subdomaindiscovery.is_available() && scans.subdomaindiscovery.has_diff()
uploads:
- output.txt
steps:
# Download the latest subdomain discovery result
- run: |
mkdir latest
cd latest
bh job download -j ${{ scans.subdomaindiscovery.filter(s, s.state == 'succeeded')[0].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# If we got triggered on diff and this is not the first invocation.
# For the sake of example, we should just calculate the diff
- if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) > 1
run: |
mkdir older
cd older
bh job download -j ${{ scans.subfinder.filter(s, s.state == 'succeeded')[1].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# If we got triggered on diff, find out what is the diff here
# For the sake of example, we should just calculate the diff
- if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) > 1
run: |
cat latest/output.txt | anew older/output.txt -d > new-subdomains.txt
shell: bash
# Otherwise, if this is not a diff trigger, let's use the output of the latest scan instead
- if: size(scans.subdomaindiscovery.filter(s, s.state == 'succeeded')) == 1
run: |
cp latest/output.txt new-subdomains.txt
shell: bash
# We prepared our environment now. Subdomains that we want to probe are written in
# the `new-subdomains.txt` file. Let's finish the job
- run: cat new-subdomains.txt | httprobe > output.txt
shell: bash
- run: sort -o output.txt -u output.txt
shell: bash
Let's summarize why and how this works. Periodically, we want to check if new domain is discovered. For that, we run two tools:
The purpose of those scans is to find as many subdomains as we possibly can. But since these are just two tools to accomplish something, we create a meta scan that only combines these results. At the end of the day, we care about the subdomain discovery, not about the single tools result at this point. This can be done in-place if you want to save the storage.
To schedule a subdomaindiscovery
scan, we need to have assetfinder
or subfinder
done and ready to be used.
To accomplish that, utility function exists on scans[].[ID]
called is_available
, that effectively checks if there exists a job
with state "succeeded"
, that is executed after the last execution of the subdomaindiscovery
scan.
Next, for the sake of example, we want to run httprobe
on newly found subdomains. Here are two scenarios:
- Subdomain discovery is executed for the first time.
- Subdomain discovery has multiple executions, so we need to find the new ones.
This can be summarized with scans.subdomaindiscovery.is_available() && scans.subdomaindiscovery.has_diff()
.
The is_available()
method checks that subdomaindiscovery
scan has successful executions that are after the latest execution of scans.probe
, while has_diff()
checks that either this was the first successful execution (therefore, diff between nothing and something is always something), or last two successful subdomaindiscovery
jobs have something different in their outputs.
To prepare the environment, we can conditionally run steps. If this is the first successful execution of subdomain discovery, number of succeeded
jobs will be 1. Therefore, the whole output should be used. Otherwise, we can download two successful runs, calculate the diff, and use that as our input.
scans.[ID].on.dispatch
This field allows you to specify a scan that can start a pipeline. The idea behind this scan type is that you may have some expensive tools that you want to run on-demand. They should not be executed periodically, but they can start a pipeline of other tools.
Example
scans:
expensive:
on:
dispatch: {} # Important to add '{}', since dispatch is empty object in yaml
steps:
- run: echo "Executing expensive tool with API rate limit etc."
shell: bash
expensive-stage-2:
on:
expr: scans.expensive.is_available()
steps:
- run: echo "You continue normally from here"
shell: bash
scans.[ID].on.dispatch.inputs
Inputs act as project vars while keeping the lifetime of a single invocation. Specifying inputs would mean that your scan should act differently depending on the input provided. For the ones who are also software engineers, inputs are like arguments to a function.
There are two primary reasons to use dispatch:
- From time to time, you want to run scan on a specific program by triggering it manually. Inputs are completely defined by the program (for example, program domain(s), IPS, etc.)
- You want to dispatch something from your scan. Let's say you uncovered a new domain. You want to spawn a new pipeline to check stuff only on that particular domain.
Example
scans:
expensive:
on:
dispatch:
inputs:
domain:
type: string
required: true
steps:
- run: echo "Executing expensive tool with API rate limit etc."
shell: bash
- run: echo "Scanning ${{ inputs.domain }}"
shell: bash
scans.[ID].on.dispatch.inputs.[KEY]
Key is a unique identifier for an input. It conforms to constraints related to project variables, but can be accessed from workflow expression
like ${{ inputs.[KEY] }}
.
There are few constraints you should be aware of:
- Maximum number of inputs is 20.
- Maximum size of dispatch values is 65535 characters. If you need larger input size, consider uploading assets and passing down the reference to it.
scans.[ID].on.dispatch.inputs.[KEY].type
Type of this YAML key is string
. The value is either string
or bool
. By default, it is of type string
.
scans.[ID].on.dispatch.inputs.[KEY].required
Type of this YAML key is boolean
. By default, it is false
. If you specify required: true
, it would mean that you MUST provide the value
when spawning the job.
As a side-note, all values are provided by default from UI. Empty value is still a provided value. This limitation is mostly important for API calls (e.g. bh scan dispatch ...
)
scans.[ID].on.dispatch.inputs.[KEY].default
Type of this YAML key is string
. However, if you specify type
to be bool
, valid values are "true"
or "false"
. Otherwise, since the only supported
type is string
, any string will do. By default, for bool
, it is "false"
, and for string
, it is ""
.
scans.[ID].uploads[]
Uploads field specifies what files or directories are uploaded to the server when the scan is successful. If the scan fails, no files will be uploaded to the server.
In most cases, you need uploads when you have dependencies between jobs. Please do not mistake uploads with the step outputs. In most cases, people tend to write step that will tee the output to the file, and then upload that file. This is not necessary. You can simply upload the file, and it will be available for download.
Using tee
is perfectly fine if you don't expect the large output. However, you should be aware of the following limitations:
- The log has a maximum size of 10MiB, and if you exceed that, the log will be truncated. If you want to see the output, you can always download the file and inspect it.
- The log for each job expires after 2 weeks. After that, you won't be able to see it.
- You cannot download the
stdout
orstderr
and use it as an input for another scan. Log is written in a way that is not suitable for passing it as input.
If scan contains uploads
, internal step will be used to zip the content you specified, and calculate the hash of it.
That hash will be the nonce
value of job. If output changes, hash should be different, and therefore, output contains the diff.
- Eliminate potential errors of uploading files from your local machine. You can always
cp
files from your local machine to the working directory of your job, but you would have to do it explicitly. - Ensure that workflows are not written in a way that they depend on the local machine. If you want, you can always
work around this constraint by using
cp
command inside therun
step to copy files to your present working directory.
Example
scans:
example:
on:
cron: 0 0 * * *
uploads:
- file.txt
- outdir/subdir
steps:
- run: echo "Contents of the file" > file.txt
shell: bash
- run: mkdir -p outdir/subdir
shell: bash
- run: |
echo "Contents of file 1 in subdir" > outdir/subdir/file1
echo "Contents of file 2 in subdir" > outdir/subdir/file2
echo "Using directories are useful when you run a tool that results in outputs written to the directory"
shell: bash
scans.[ID].steps[]
Every job is composed of one or more steps to be taken in order to execute a job. It can be as simple as a single step running the script, or it can be a script completely written in a workflow. Each steps stdout
and stderr
is streamed back to the server and later displayed.
If the step fails, the job fails. Failure of the step is based on the exit code of the script.
Currently, the stream is not propagated to the UI while the job is being executed. Hopefully soon, that functionality will be implemented. If you notice any issues with the output logs, please do not hesitate to report it.
scans.[ID].steps[].if
Steps may conditionally be skipped. This is useful when you want to execute a step if
something meets the condition.
Let's try to illustrate this in a difficult example.
Our objectives for this example are:
- To run
httprobe
if eitherassetfinder
is successful orsubfinder
is successful. - Since there can be an instance where one of our dependency scans are done, while the other one is not, we need to conditionally download the result.
- If both of them are done, we should combine their latest outputs.
Keep in mind, in this example scan, we can have the following sequence of events:
assetfinder
successfully finishessubfinder
is not scheduled yet, so we schedule thehttprobe
.subfinder
is scheduled afterhttprobe
.httprobe
is scheduled again, and we will re-use the latest data from theassetfinder
. We don't care that we executed thehttprobe
on those subdomains that it found. Maybe, just maybe, some of those subdomains are up, and we want to test them.
scans:
assetfinder:
uploads:
- output.txt
# ...
subfinder:
uploads:
- output.txt
# ...
httprobe:
on:
expr: |
scans.assetfinder.is_available() || scans.subfinder.is_available()
steps:
# Prepare the subdomains file
- run: touch subdomains.txt
shell: bash
# Run if the latest assetfinder was successful
- if: scans.assetfinder.is_available()
run: |
echo "Downloading assetfinder results using bh"
echo "Makind directory to download the result to"
mkdir assetfinder && cd assetfinder
echo "Downloading the latest assetfinder result"
bh job download -j ${{ scans.assetfinder[0].id }} -o output.zip
echo "Unzipping the result"
unzip output.zip
rm output.zip
echo "Writing the result into subdomains.txt"
cd ..
cat assetfinder/output.txt >> subdomains.txt
shell: bash
- if: scans.subfinder.is_available()
run: |
echo "Downloading subfinder results using bh"
echo "Makind directory to download the result to"
mkdir subfinder && cd subfinder
echo "Downloading the latest subfinder result"
bh job download -j ${{ scans.subfinder[0].id }} -o output.zip
echo "Unzipping the result"
unzip output.zip
rm output.zip
echo "Writing the result into subdomains.txt"
cd ..
cat subfinder/output.txt >> subdomains.txt
shell: bash
- run: |
echo "Ensuring results are unique and sorted"
sort -o subdomains.txt -u subdomains.txt
shell: bash
- run: cat subdomains.txt | httprobe
shell: bash
scans.[ID].steps[].run
Run specifies a script to be executed. It is written to a temporary file on the disk and executed by a shell script specified in scans.[ID].steps[].shell
.
The directory where the script is written depends on the way the runner is configured. Let's say that the runner is located at /home/runner
and the workdir
configured for the runner is _work
, then the script will be temporarily written to the /home/runner/_work/{job_id}/{uuid}
.
The script lives only during the step. It will be cleaned up after each step.
To write a script, you can use template evaluation syntax. To read more about that, please visit "Template evaluation" section.
- run: sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
shell: bash
- run: |
sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
shell: bash
scans.[ID].steps[].shell
Shell specifies the executable that is going to execute the script. The way runner evaluates the shell
is by running shlex split
.
This will try to separate arguments of the shell based on how shell would evaluate positional arguments.
Then, the first argument is used as the entrypoint command, rest are used as arguments to the command, and file is concatenated at the end of the arguments.
Let's say I want to run
- run: echo "example"
shell: bash --noprofile --norc -eo pipefail
This should be equivalent as the following:
bash --noprofile --norc -eo pipefail {random_filename}
- Workflows vs Pipelines
- Terminology
- Writing workflows
- Workflow components
- scans
- scans.[ID].on
- scans.[ID].on.cron
- scans.[ID].on.expr
- scans.[ID].on.dispatch
- scans.[ID].on.dispatch.inputs
- scans.[ID].on.dispatch.inputs.[KEY]
- scans.[ID].on.dispatch.inputs.[KEY].type
- scans.[ID].on.dispatch.inputs.[KEY].required
- scans.[ID].on.dispatch.inputs.[KEY].default
- scans.[ID].uploads[]
- scans.[ID].steps[]
- scans.[ID].steps[].if
- scans.[ID].steps[].run
- scans.[ID].steps[].shell