Workflows
Workflows are the way you describe automation. They should describe the way you do stuff in a concise and declarative way. Workflows should act as if there is a replica of you doing the stuff that can be automated. It should be a continuous process, feeding you information in the real time, allowing you to do the best work you possibly can.
Pipelines on the other hand are sequences of steps that are executed from start to finish. It is much easier to reason about the pipeline, but it is more natural to work in terms of workflows.
Workflows vs Pipelines
The main difference between workflows and pipelines is that pipelines are executed from start to finish. You can have parallelism in pipelines, but essentially, you are passing through stages. Let's take a simple example to understand the difference.
Pipeline example:
The main problem here is that you either execute the entire pipeline, or you don't. But that is not the way you would approach a target. You would use common data to execute multiple pipelines, and this is where workflows kick in.
Let's say I start with subdomain discovery. You have multiple tools that run subdomain discovery to maximize the target scope. However, from there, you can have multiple pipelines. For example:
- Screenshot all subdomain landing pages to quickly pick a target
- Run httprobe to see what subdomains are live
- Run port scan on each target.
- ...
It is not that often that ports are open, and scan is noisy, so we can run it every other day or once a week. However, running httprobe is relatively quick, and does not raise alarms that often, so we can do it every 6 hours.
This is an example of how a single stage may have multiple pipelines. And you are already doing that right now, I would guess.
Terminology
term | description |
---|---|
scan | Scan is a description for the job. Scan describes what you want to run. |
job | Job is a single invocation of the scan. Each scan can have 0 or more jobs, while each job belongs to a single scan. |
workflow | Workflow is a collection of scans. It is described in your workflow file. It is a unit, meaning the one workflow cannot use another workflow |
pipeline | Collection of scans that depend on each other. It is a subset of the workflow. |
Writing workflows
Workflows are yaml definitions where you describe what you want to do. YAML proved to be a great tool to describe your intent. It is a commonly used language to describe workflows.
Initially, writing workflows started with the UI editor, but proved to be much more complicated than the declarative way.
The editor however does contain a basic syntax, so you can leverage autocompletes to see what fields are available.
Workflow components
scans
Each workflow is made of one or more scans
. Each scan has a unique name, which is a key in the scans object. The name will be presented as [ID]
for the rest of the document, since it is unique for each scan, and therefore, serves as an identifier.
You can have the unlimited number of scans, as long as their names are unique. The names must only contain alphanumeric characters and the _
character.
name | valid |
---|---|
example | true |
test_1 | true |
test-one | false |
Example:
scans:
assetfinder:
on:
cron: 0 0 * * * *
steps:
- run: assetfinder example.com
shell: bash
scans.[ID].on
Serves as a trigger for a scan execution. This object contains specification that will trigger a scan on some event.
To start the pipeline, there must exist some event to trigger it. Events are raised automatically (on cron
), or manually (on dispatch
). From there, you build the next stage by running on expr
.
scans.[ID].on.cron
Cron defines a schedule based on which the workflows are executed.
Cron is described in form of:
second | minute | hour | day of month | month | day of week |
---|---|---|---|---|---|
required | required | required | required | required | required |
If you need help specifying or testing your cron schedule, you can use crontab guru. This is a great site that will help you write and understand the spec you write for the scheduler.
As a limitation, cron parser does not allow non-standard modifiers, such as @hourly
etc. Currently supported specifications include:
- Wildcard (
*
) - Lists (
,
) - Ranges (
-
) - Steps (
/
)
Example:
scans:
assetfinder:
on:
cron: 0 * * * * *
scans.[ID].on.expr
This field allows you to specify expression that describes when should the scan be scheduled. Read more about expressions here
Expressions are to describe the scan that should run only if expression evaluates to true
. After each successful scan,
server goes through expression scans and evaluates if the scan should be scheduled. Only expression scans are evaluated at this time.
Let's build a somewhat complicated example to show how it can be used in the real world.
Goals:
- Run multiple subdomain discovery tools. For simplicity, let's pick
subfinder
andassetfinder
. - Since I would love to run probes only if there is a diff, I can create a "meta" scan that combines the result.
- The
httprobe
can execute once the subdomain discovery stage is over and there is a diff.
We can combine the power of needs
field, and use expr
to further describe when our probe should be executed.
Example
scans:
assetfinder:
on:
cron: 0 1 0/12 * * *
uploads:
- output.txt
steps:
- run: assetfinder ${{ vars.DOMAIN }} > output.txt
shell: bash
- run: sort -o output.txt -u output.txt
shell: bash
subfinder:
on:
cron: 1 15 0/12 * * *
uploads:
- output.txt
steps:
- run: subfinder -d ${{ vars.DOMAIN }} > output.txt
shell: bash
- run: sort -o output.txt -u output.txt
shell: bash
subdomaindiscovery:
needs:
- subfinder
- assetfinder
on:
expr: "true" // If needs field is good enough to describe the intention. So expr should always be true
uploads:
- output.txt
steps:
# Download the latest assetfinder result
- run: |
mkdir assetfinder
cd assetfinder
bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.assetfinder[0].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# Download the latest subfinder result
- run: |
mkdir subfinder
cd subfinder
bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subfinder[0].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# Combine them in the output.txt file
- run: |
cat assetfinder/output.txt > output.txt
cat subfinder/output.txt >> output.txt
sort -o output.txt -u output.txt
shell: bash
probe:
needs:
- subdomaindiscovery
on:
expr: |
size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1 // First time we combined the subdomain findings
|| scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].nonce != scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[1].nonce // There is a diff between last 2 successful runs
uploads:
- output.txt
steps:
# Download the latest subdomain discovery result
- run: |
mkdir latest
cd latest
bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# If we got triggered on diff
- if: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) > 1
run: |
mkdir older
cd older
bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subfinder.filter(s, s.status == 'succeeded')[1].id }} -o output.zip
unzip output.zip
rm output.zip
shell: bash
# If we got triggered on diff, find out what is the diff here
- if: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) > 1
run: |
cat latest/output.txt | anew older/output.txt -d > new-subdomains.txt
shell: bash
# Otherwise, if this is not a diff trigger, let's use the output of the latest scan instead
- if: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1
run: |
cp latest/output.txt new-subdomains.txt
# We prepared our environment now. Subdomains that we want to probe are written in
# the `new-subdomains.txt` file. Let's finish the job
- run: cat new-subdomains.txt | httprobe > output.txt
shell: bash
- run: sort -o output.txt -u output.txt
shell: bash
Let's summarize why and how this works. Periodically, we want to check if new domain is discovered. For that, we run two tools:
The purpose of those scans is to find as many subdomains as we possibly can. But since these are just two tools to accomplish something, we create a meta scan that only combines these results. At the end of the day, we care about the subdomain discovery, not about the single tools result at this point.
To combine these two, we can write a somewhat complicated expression, or we can simplify things using needs
field.
The needs field is introduced so we can avoid writing multi-line expressions for a common task. And since the needs field accomplishes what we are trying to do, the expr
field should only evaluate to true.
Having that out of the way, our probe
scan requires subdomaindiscovery
. We can express this using needs
field as well. But this time, the needs
field is not good enough. We only want to run if the subdomaindiscovery
contains a diff. This is when power of expr
kicks in, so let's dive into it.
The needs
ensures that the probe
scan did not see the latest successful subdomaindiscovery
result. That means that we have at least one successful result. If there is exactly 1 successful result, then the entire output is new to probe
, therefore the expression is written like this: size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1
.
However, if there are multiple successful scans, the latest two successful scans should be examined. If their nonce value (the hash of the result) is different, the scan should be scheduled. Therefore, scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].nonce != scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[1].nonce
.
To combine these two, we want to run when subdomaindiscovery
stage is done for the first time OR there is a diff.
So the resulting expression is:
size(scans.subdomaindiscovery.filter(s, s.status == 'succeeded')) == 1
|| scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[0].nonce != scans.subdomaindiscovery.filter(s, s.status == 'succeeded')[1].nonce
scans.[ID].on.dispatch
This field allows you to specify a scan that can start a pipeline. The idea behind this scan type is that you may have some expensive tools that you want to run on-demand. They should not be executed periodically, but they can start a pipeline of other tools.
Example
scans:
expensive:
on:
dispatch: {} # Important to add '{}', since dispatch is empty object in yaml
steps:
- run: echo "Executing expensive tool with API rate limit etc."
shell: bash
expensive-stage-2:
on:
expr: size(scans.expensive) > 0 && scans.expensive[0].status == 'succeeded'
steps:
- run: echo "You continue normally from here"
shell: bash
scans.[id].needs[]
Needs field is created to simplify the on.expr
field. Let's describe the problem first, and then explain how to use this field properly.
When you want to chain scans, you continuously go through stages. First, you finish the subdomain discovery. Then, you check for the live hosts. Then, of those live hosts, you want to do directory brute-forcing, or you want to port scan it, etc.
To properly write an expression, you would have to catch multiple corner cases. Since on each finished scan, the platform tries to figure out if it should schedule a scan that runs on the expression, we can end up running a scan multiple times using the same dependency.
To better understand this, let's write down the faulty scan that runs multiple times with no good reason.
scans:
stage_one:
on:
dispatch: {}
steps:
- run: echo "Do something in the first stage"
shell: bash
stage_two:
on:
expr: |
size(scans.stage_one) > 0 // stage_one must have at least one result
&& scans.stage_one[0].status == 'succeeded' // and the latest scan is succeeded
steps:
- run: echo "Run stage two!"
shell: bash
stage_three:
on:
expr: |
size(scans.stage_two) > 0 && scans.stage_two[0].status == 'succeeded'
steps:
- run: echo "Run stage three!"
shell: bash
This looks okay, but it is not. Let's examine the scheduler and what it will do in each case.
First, we need an entrypoint. That is our stage_one
scan. It is manually triggered for simplicity.
Once the stage_one
is done, the scheduler goes through all scans associated with this workflow, and evaluates each expression. This ensures that a single scan can trigger multiple scans.
Let's say that server first evaluates stage_two
. The expression evaluates to true
, since we have 1 completed job, and it is successful. So scheduler creates a new job record, meaning, it schedules the stage_two
scan. So far so good.
Next, it evaluates the stage_three
. The stage_one
is not yet done, so the size of jobs is 0. The scheduler does not create a new job record for the stage_three
.
The stage_two
is done, so we go through all scans again to see if some expression matches, and we schedule the stage_three
.
Then, stage_three
is done so we evaluate expressions again. Woops! This is where the problem is. The stage_two
will evaluate to true
. It is still the case that stage_one
is done (has more than 0 jobs) and the latest result is successful. Now, stage_two
is scheduled, and when done, triggers the stage_three
, and it goes in cycles.
To fix this issue using expression, we can write stage_two
's expression like this:
stage_two:
on:
expr: |
size(scans.stage_one) > 0 // stage_one must have at least one result
&& (
size(scans.stage_two) == 0 // stage two doesn't have any job records, and stage_one is done
|| scans.stage_two[0].id < scans.stage_one[0].id // This means that last job for stage two
// occured before the latest result of the stage one.
)
This can be even more complicated if you want to account for only successful scans:
stage_two:
on:
expr: |
size(scans.stage_one.filter(s, s.status == 'succeeded')) > 0
&& (
size(scans.stage_two.filter(s, s.status == 'succeeded')) == 0
|| scans.stage_two.filter(s, s.status == 'succeeded')[0].id < scans.stage_one.filter(s, s.status == 'succeeded')[0].id
)
This case is common enough, so there should be a mechanism to simplify it. This is where the needs
field kicks in.
The needs
field effectively:
- Ensures that the result of the latest version of the dependency is successful.
- Ensures that the result of the latest version is found after the latest invocation of the scan that specifies the
needs
field. - Ensures that all dependencies specified in needs are matched. That means that for each scan specified in needs, the step 1. and step 2. is evaluated, and then if every entry evaluates to true, the result of
needs
is true.
There is a nice side benefit to this as well. Now, if you only depend on the scan, and you don't need to further inspect something in order to have a scan scheduled, your expression can simply be:
scans:
stage_one:
# ...
stage_two:
needs:
- stage_one
on:
expr: "true"
# ...
scans.[ID].uploads[]
Uploads field specifies what files or directories are uploaded to the server when the scan is successful. If the scan fails, no files will be uploaded to the server.
Uploads are required if you have any dependencies between scans. They will not show up in the UI, but you have a button to download these results. In the UI, only stdout and stderr is displayed. The reader for stdout is limited at 2MiB so please, do not try to print everything to the stdout or stderr.
Example
scans:
example:
on:
cron: 0 0 0 * * *
uploads:
- file.txt
- outdir/subdir
steps:
- run: echo "Contents of the file" > file.txt
shell: bash
- run: mkdir -p outdir/subdir
shell: bash
- run: |
echo "Contents of file 1 in subdir" > outdir/subdir/file1
echo "Contents of file 2 in subdir" > outdir/subdir/file2
echo "Using directories are useful when you run a tool that results in outputs written to the directory"
shell: bash
scans.[ID].steps[]
Every job is composed of one or more steps to be taken in order to execute a job. It can be as simple as a single step running the script, or it can be a script completely written in a workflow. Each steps stdout
and stderr
is streamed back to the server and later displayed as a result.
If the step fails, the job fails. Failure of the step is based on the exit code of the script.
If step fails, the stderr
is going to be displayed on the UI.
stdout
.scans.[ID].steps[].if
Steps may conditionally be skipped. This is useful when you want to execute a step if
something meets the condition.
Let's try to illustrate this in a difficult example.
Our objectives for this example are:
- To run
httprobe
if eitherassetfinder
is successful orsubfinder
is successful. - Since there can be an instance where one of our dependency scans are done, while the other one is not, we need to conditionally download the result.
- If both of them are done, we should combine their latest outputs.
Keep in mind, in this example scan, we can have the following sequence of events:
assetfinder
successfully finishessubfinder
is not scheduled yet so we schedule thehttprobe
.subfinder
is scheduled after httprobe.httprobe
is scheduled again, and we will re-use the latest data from theassetfinder
. We don't care that we executed thehttprobe
on those subdomains that it found. Maybe, just maybe, some of those subdomains are up, and we want to test them.
scans:
assetfinder:
uploads:
- output.txt
# ...
subfinder:
uploads:
- output.txt
# ...
httprobe:
on:
expr: |
(size(scans.assetfinder) > 0 && scans.assetfinder[0].status == 'succeeded') // If assetfinder has latest scan succeded
|| (size(scans.subfinder) > 0 && scans.subfinder[0].status == 'succeeded') // If subfinder has latest scan succeded
steps:
# Prepare the subdomains file
- run: touch subdomains.txt
shell: bash
# Run if the latest assetfinder was successful
- if: |
size(scans.assetfinder) > 0
&& scans.assetfinder[0].status == 'succeeded'
run: |
echo "Downloading assetfinder results using bh"
echo "Makind directory to download the result to"
mkdir assetfinder && cd assetfinder
echo "Downloading the latest assetfinder result"
bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.assetfinder[0].id }} -o output.zip
echo "Unzipping the result"
unzip output.zip
rm output.zip
echo "Writing the result into subdomains.txt"
cd ..
cat assetfinder/output.txt >> subdomains.txt
shell: bash
- if: |
size(scans.subfinder) > 0 # Run if the latest subfinder was successful
&& scans.subfinder[0].status == 'succeeded'
run: |
echo "Downloading subfinder results using bh"
echo "Makind directory to download the result to"
mkdir subfinder && cd subfinder
echo "Downloading the latest subfinder result"
bh job download -p ${{ project.id }} -w ${{ workflow.id }} -r ${{ revision.id }} -j ${{ scans.subfinder[0].id }} -o output.zip
echo "Unzipping the result"
unzip output.zip
rm output.zip
echo "Writing the result into subdomains.txt"
cd ..
cat subfinder/output.txt >> subdomains.txt
shell: bash
- run: |
echo "Ensuring results are unique and sorted"
sort -o subdomains.txt -u subdomains.txt
shell: bash
- run: cat subdomains.txt | httprobe
shell: bash
scans.[ID].steps[].run
Run specifies a script to be executed. It is written to a temporary file on the disk and executed by a shell script specified in scans.[ID].steps[].shell
.
The directory where the script is written depends on the way the runner is configured. Let's say that the runner is located at /home/runner
and the workdir
configured for the runner is _work
, then the script will be temporarily written to the /home/runner/_work/{job_id}/{uuid}
.
The script lives only during the step. It will be cleaned up after each step.
To write a script, you can use template evaluation syntax. To read more about that, please visit "Template evaluation" section.
- run: sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
shell: bash
- run: |
sleep $((sleep_time > 0 ? sleep_time : 0)); echo "test"
shell: bash
scans.[ID].steps[].shell
Shell specifies the executable that is going to execute the script. The runner uses shell as an argument to the which
command. If the executable does not exist on the path, the step will fail. Right now, there is a limitation where you cannot provide arguments to the command. This may change in the future.
For example, let's say the step is the following:
- run: echo "example"
shell: bash
The steps the runner will take are:
- Call
which bash
. Let's say the output is/bin/bash
- Create a temporary file with a random name. For the random name, UUIDv4 is used. For simplicity, let's just refer to it as
{UUID}
- Take the content of the run and write it to a file.
- Execute
/bin/bash {UUID}