Airflow vs Prefect vs Temporal vs Windmill

We compared Airflow, Prefect, Temporal and Windmill with the following usecases:

One flow composed of 40 lightweight tasks.
One flow composed of 10 long-running tasks.

More context

For additional insights about this study, refer to our blog post.

We chose to compute Fibonacci numbers as a simple task that can easily be run with the three orchestrators. Given that Airflow has a first class support for Python, we used Python for all 3 orchestrators. The function in charge of computing the Fibonacci numbers was very naive:

def fibo(n: int):
    if n <= 1:
        return n
    else:
        return fibo(n - 1) + fibo(n - 2)

After some testing, we chose to compute fibo(10) for the lightweight tasks (taking around 10ms in our setup), and fibo(33) for what we called "long-running" tasks (taking at least a few hundreds milliseconds as seen in the results).

On the infrastructure side, we went simple and used the docker-compose.yml recommended in the documentation of each orchestrator. We deployed the orchestrators on AWS m4-large instances.

Airflow setup

We set up Airflow version 2.7.3 using the docker-compose.yaml referenced in Airflows official documentation.

The DAG was the following:

ITER = 10     # respectively 40
FIBO_N = 33   # respectively 10

with DAG(
    dag_id="bench_{}".format(ITER),
    schedule=None,
    start_date=datetime(2023, 1, 1),
    catchup=False,
    tags=["benchmark"],
) as dag:
    for i in range(ITER):
        @task(task_id=f"task_{i}")
        def task_module():
            return fibo(FIBO_N)
        fibo_task = task_module()

        if i > 0:
            previous_task >> fibo_task
        previous_task = fibo_task

Results

For 10 long running tasks run sequentially:

Details

Task	Created at	Started at	Completed at
task_00	0.000	4.347	6.910
task_01	7.315	9.690	16.387
task_02	16.545	18.361	20.077
task_03	20.130	21.785	23.487
task_04	23.869	25.319	27.463
task_05	28.061	29.665	32.354
task_06	33.210	34.996	37.498
task_07	38.378	39.938	41.754
task_08	42.366	43.933	45.887
task_09	46.281	50.179	54.668

For 40 lightweights tasks run sequentially:

Details

Task	Created at	Started at	Completed at
task_00	0.000	4.335	4.752
task_01	6.236	8.710	8.923
task_02	9.792	11.117	11.320
task_03	12.157	13.513	13.733
task_04	13.804	15.413	15.622
task_05	16.201	17.587	17.849
task_06	18.902	20.227	20.432
task_07	21.262	22.691	22.958
task_08	24.015	25.349	25.558
task_09	26.368	28.158	28.635
task_10	29.361	31.035	31.357
task_11	31.861	36.245	37.062
task_12	38.868	42.180	42.388
task_13	42.641	44.027	44.280
task_14	45.321	46.676	46.877
task_15	47.676	49.073	49.298
task_16	50.432	51.786	51.999
task_17	52.415	53.852	54.051
task_18	54.155	55.564	55.771
task_19	56.575	58.346	58.781
task_20	59.254	60.999	61.355
task_21	62.071	63.671	64.079
task_22	64.366	66.011	66.442
task_23	67.061	68.619	68.866
task_24	69.601	71.842	72.303
task_25	73.373	77.495	78.212
task_26	78.428	79.896	80.134
task_27	81.199	82.495	82.741
task_28	83.665	84.958	85.153
task_29	85.205	86.561	86.766
task_30	87.690	89.357	89.778
task_31	90.419	91.970	92.282
task_32	93.024	94.610	95.031
task_33	95.636	97.495	97.745
task_34	98.857	100.626	100.877
task_35	101.926	103.271	103.477
task_36	103.915	105.523	105.875
task_37	105.996	107.412	107.622
task_38	108.409	112.610	113.214
task_39	114.054	115.998	116.221

Prefect setup

We set up Prefect version 2.14.4. We wrote our own simple docker compose since we couldn't find a recommended one in Prefect's documentation. We chose to use Postgresql as a database, as it is the recommended option for production usecases.

version: '3.8'

services:
  postgres:
    image: postgres:14
    restart: unless-stopped
    volumes:
      - db_data:/var/lib/postgresql/data
    expose:
      - 5432
    environment:
      POSTGRES_PASSWORD: changeme
      POSTGRES_DB: prefect
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U postgres']
      interval: 10s
      timeout: 5s
      retries: 5

  prefect-server:
    image: prefecthq/prefect:2-latest
    command:
      - prefect
      - server
      - start
    ports:
      - 4200:4200
    depends_on:
      postgres:
        condition: service_started
    volumes:
      - ${PWD}/prefect:/root/.prefect
      - ${PWD}/flows:/flows
    environment:
      PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://postgres:changeme@postgres:5432/prefect
      PREFECT_LOGGING_SERVER_LEVEL: INFO
      PREFECT_API_URL: http://localhost:4200/api
volumes:
  db_data: null

The flow was defined using the following Python file.

from prefect import flow, task

ITER = 10     # respectively 40
FIBO_N = 33   # respectively 10

def fibo(n: int):
    if n <= 1:
        return n
    else:
        return fibo(n - 1) + fibo(n - 2)

@task
def fibo_task():
    return fibo(FIBO_N)

@flow(name="bench_{}".format(ITER))
def benchmark_flow():
    for i in range(ITER):
        fibo_task()

if __name__ == "__main__":
    benchmark_flow.serve(name="bench_{}".format(ITER))

Results

For 10 long running tasks:

Details

Task	Created at	Started at	Completed at
task_00	0.000	1.270	2.629
task_01	2.673	2.703	4.059
task_02	4.095	4.121	5.475
task_03	5.508	5.534	6.916
task_04	6.951	6.979	8.337
task_05	8.373	8.401	9.816
task_06	9.849	9.874	11.253
task_07	11.287	11.313	12.675
task_08	12.710	12.737	14.070
task_09	14.102	14.129	15.489

For 40 lightweights tasks run sequentially:

Details

Task	Created at	Started at	Completed at
task_00	0.000	1.213	1.257
task_01	1.294	1.321	1.362
task_02	1.394	1.423	1.463
task_03	1.496	1.522	1.558
task_04	1.587	1.612	1.647
task_05	1.676	1.700	1.738
task_06	1.767	1.791	1.828
task_07	1.858	1.882	1.943
task_08	1.974	1.998	2.037
task_09	2.068	2.093	2.131
task_10	2.162	2.188	2.228
task_11	2.260	2.292	2.330
task_12	2.359	2.382	2.420
task_13	2.449	2.476	2.517
task_14	2.548	2.573	2.612
task_15	2.640	2.670	2.713
task_16	2.742	2.765	2.800
task_17	2.828	2.851	2.886
task_18	2.916	2.940	2.975
task_19	3.004	3.028	3.066
task_20	3.095	3.119	3.156
task_21	3.187	3.211	3.247
task_22	3.276	3.299	3.335
task_23	3.364	3.389	3.427
task_24	3.462	3.489	3.528
task_25	3.557	3.579	3.613
task_26	3.641	3.664	3.699
task_27	3.726	3.751	3.788
task_28	3.817	3.839	3.873
task_29	3.900	3.921	4.004
task_30	4.033	4.059	4.094
task_31	4.123	4.151	4.185
task_32	4.211	4.234	4.267
task_33	4.293	4.315	4.349
task_34	4.377	4.404	4.442
task_35	4.470	4.492	4.526
task_36	4.555	4.577	4.611
task_37	4.638	4.661	4.696
task_38	4.726	4.749	4.784
task_39	4.814	4.838	4.872

Temporal setup

We set up Temporal version 2.19.0 using the docker-compose.yml from the official GitHub repository.

The flow was defined using the following Python file. We executed it on the EC2 instance, using Python 3.10.12.

ITER = 10     # respectively 40
FIBO_N = 33   # respectively 10

@activity.defn
async def fibo_activity(n: int) -> int:
    return fibo(n)

@workflow.defn
class BenchWorkflow:
    @workflow.run
    async def run(self) -> None:
        for i in range(ITER):
            await workflow.execute_activity(
                fibo_activity,
                FIBO_N,
                activity_id="task_{}".format(i),
                start_to_close_timeout=timedelta(seconds=60),
            )

async def main():
    client = await Client.connect("localhost:7233")
    flow_name = "bench-{}".format(ITER)
    async with Worker(
        client,
        task_queue=flow_name,
        workflows=[BenchWorkflow],
        activities=[fibo_activity],
    ):
        await client.execute_workflow(
            BenchWorkflow.run,
            id=flow_name,
            task_queue=flow_name,
        )


if __name__ == "__main__":
    asyncio.run(main())

Results

For 10 long running tasks:

Details

Task	Created at	Started at	Completed at
task_00	0.000	0.012	1.357
task_01	1.380	1.388	2.697
task_02	2.720	2.729	4.034
task_03	4.056	4.065	5.371
task_04	5.394	5.403	6.711
task_05	6.733	6.742	8.050
task_06	8.074	8.083	9.388
task_07	9.411	9.420	10.739
task_08	10.762	10.773	12.086
task_09	12.111	12.120	13.434

For 40 lightweights tasks run sequentially:

Details

Task	Created at	Started at	Completed at
task_00	0.000	0.009	0.016
task_01	0.034	0.044	0.052
task_02	0.072	0.079	0.087
task_03	0.107	0.116	0.124
task_04	0.144	0.153	0.161
task_05	0.180	0.189	0.197
task_06	0.218	0.227	0.235
task_07	0.256	0.265	0.273
task_08	0.296	0.305	0.312
task_09	0.332	0.340	0.348
task_10	0.367	0.376	0.383
task_11	0.403	0.412	0.420
task_12	0.440	0.449	0.457
task_13	0.486	0.498	0.507
task_14	0.527	0.536	0.545
task_15	0.565	0.574	0.583
task_16	0.622	0.660	0.669
task_17	0.721	0.759	0.768
task_18	0.820	0.859	0.867
task_19	0.920	0.959	0.967
task_20	1.020	1.059	1.069
task_21	1.122	1.159	1.167
task_22	1.221	1.259	1.268
task_23	1.321	1.360	1.368
task_24	1.421	1.460	1.468
task_25	1.521	1.560	1.568
task_26	1.622	1.660	1.669
task_27	1.721	1.759	1.767
task_28	1.822	1.859	1.867
task_29	1.921	1.960	1.969
task_30	2.021	2.059	2.067
task_31	2.121	2.160	2.168
task_32	2.220	2.260	2.269
task_33	2.322	2.359	2.368
task_34	2.427	2.459	2.467
task_35	2.522	2.559	2.568
task_36	2.621	2.659	2.668
task_37	2.721	2.759	2.768
task_38	2.820	2.859	2.867
task_39	2.921	2.959	2.967

Windmill setup

We set up Windmill version 1.204.1 using the docker-compose.yml from the official GitHub repository. We made some adjustments to it to have a similar setup compared to the other orchestrator. We set the number of workers to only one and removed the native workers since they would have been useless.

We executed the Windmill benchmarks in both "normal" and "dedicated worker" mode. To implement the 2 flows in Windmill, we first created a script simply computing the Fibonacci numbers:

# WIMDMILL script: `u/benchmarkuser/fibo_script`
def fibo(n: int):
    if n <= 1:
        return n
    else:
        return fibo(n - 1) + fibo(n - 2)

def main(
    n: int,
):
    return fibo(n)

And then we used this script in a simple flow composed of a For-Loop sequentially executing the scripts. The JSON representation of the flow is as follow:

summary: Fibonacci benchmark flow
description: Flow running 10 (resp. 40) times Fibonacci of 33 (resp. 10)
value:
  modules:
    - id: a
      value:
        type: forloopflow
        modules:
          - id: b
            value:
              path: u/admin/fibo_script
              type: script
              input_transforms:
                n:
                  type: static
                  value: 33 # respectively 10
        iterator:
          expr: Array(10) # respectively 40
          type: javascript
        parallel: false
        skip_failures: true
schema:
  '$schema': https://json-schema.org/draft/2020-12/schema
  properties: {}
  required: []
  type: object

Results

For 10 long running tasks in normal mode:

Details

Task	Created at	Started at	Completed at
task_00	0.000	0.004	0.789
task_01	0.847	0.852	1.630
task_02	1.691	1.695	2.516
task_03	2.575	2.579	3.349
task_04	3.409	3.412	4.179
task_05	4.237	4.241	5.008
task_06	5.066	5.070	5.852
task_07	5.912	5.915	6.685
task_08	6.743	6.747	7.519
task_09	7.578	7.582	8.351

For 40 lightweights tasks run sequentially in normal mode:

Details

Task	Created at	Started at	Completed at
task_00	0.000	0.004	0.052
task_01	0.111	0.115	0.163
task_02	0.220	0.224	0.272
task_03	0.330	0.334	0.382
task_04	0.440	0.443	0.490
task_05	0.547	0.551	0.598
task_06	0.655	0.659	0.706
task_07	0.763	0.767	0.813
task_08	0.872	0.875	0.925
task_09	0.982	0.987	1.036
task_10	1.093	1.097	1.144
task_11	1.202	1.205	1.252
task_12	1.313	1.317	1.373
task_13	1.432	1.436	1.488
task_14	1.545	1.548	1.595
task_15	1.656	1.659	1.704
task_16	1.762	1.766	1.812
task_17	1.869	1.873	1.920
task_18	1.978	1.982	2.029
task_19	2.087	2.091	2.141
task_20	2.198	2.201	2.251
task_21	2.310	2.313	2.360
task_22	2.417	2.420	2.466
task_23	2.524	2.528	2.574
task_24	2.631	2.634	2.680
task_25	2.739	2.743	2.789
task_26	2.846	2.851	2.897
task_27	2.954	2.958	3.005
task_28	3.063	3.066	3.112
task_29	3.168	3.172	3.218
task_30	3.275	3.279	3.326
task_31	3.383	3.386	3.432
task_32	3.489	3.493	3.539
task_33	3.596	3.600	3.646
task_34	3.704	3.707	3.753
task_35	3.812	3.815	3.863
task_36	3.920	3.923	3.972
task_37	4.030	4.034	4.083
task_38	4.140	4.143	4.190
task_39	4.248	4.252	4.300

In dedicated worker mode, we obtained the following results. For 10 long running tasks:

Details

Task	Created at	Started at	Completed at
task_00	0.000	0.004	0.738
task_01	0.802	0.809	1.543
task_02	1.601	1.605	2.334
task_03	2.392	2.396	3.124
task_04	3.187	3.191	3.945
task_05	3.980	3.985	4.744
task_06	4.771	4.774	5.506
task_07	5.561	5.565	6.291
task_08	6.350	6.354	7.082
task_09	7.136	7.140	7.885

And for the 40 lightweight tasks:

Details

Task	Created at	Started at	Completed at
task_00	0.000	0.003	0.005
task_01	0.062	0.065	0.067
task_02	0.123	0.126	0.128
task_03	0.184	0.188	0.190
task_04	0.247	0.251	0.253
task_05	0.310	0.314	0.316
task_06	0.372	0.376	0.378
task_07	0.434	0.437	0.439
task_08	0.496	0.500	0.502
task_09	0.559	0.563	0.565
task_10	0.622	0.625	0.627
task_11	0.684	0.687	0.689
task_12	0.746	0.750	0.752
task_13	0.809	0.813	0.815
task_14	0.873	0.877	0.879
task_15	0.934	0.938	0.940
task_16	0.997	1.000	1.002
task_17	1.059	1.062	1.064
task_18	1.120	1.124	1.128
task_19	1.182	1.186	1.188
task_20	1.244	1.248	1.250
task_21	1.306	1.309	1.311
task_22	1.368	1.371	1.373
task_23	1.429	1.432	1.434
task_24	1.491	1.494	1.496
task_25	1.552	1.555	1.557
task_26	1.614	1.618	1.620
task_27	1.677	1.681	1.683
task_28	1.740	1.744	1.746
task_29	1.802	1.806	1.808
task_30	1.864	1.867	1.869
task_31	1.926	1.930	1.932
task_32	1.988	1.992	1.994
task_33	2.050	2.054	2.056
task_34	2.112	2.116	2.118
task_35	2.174	2.178	2.181
task_36	2.237	2.240	2.242
task_37	2.300	2.303	2.305
task_38	2.362	2.366	2.368
task_39	2.424	2.427	2.429

Comparisons

We can exclude Airflow from the previous chart:

At a macro level, it took 54.668s to Airflow to execute the 10 long running tasks, where Prefect took 15.489s, Temporal 13.434s and Windmill 8.351s in normal mode (7.885s in dedicated worker mode).

The same can be observed for the 40 lightweight tasks, where Airflow took total of 116.221s, Prefect 4.872s, Temporal 2.967s and Windmill 4.300s in normal mode (2.429s in dedicated worker mode).

By far, Airflow is the slowest. Temporal and Prefect are faster, but not as fast as Windmill. For the 40 lightweight tasks, Windmill in normal mode was equivalent to Prefect and slightly slower than Temporal. This can be explained by the fact that the way Temporal works is closer to the way Windmill works in dedicated mode. I.e. Windmill in normal mode does a cold starts for each tasks, and when the tasks are numerous and lightweight, most of the execution ends up being taken by the cold start. In dedicated worker mode however, Windmill behavior is closer to Temporal, and we can see that the performance are similar, with a slight advantage for Windmill.

But we can deep dive in a little and compare the orchestrators three categories:

Execution time: The time it takes for the orchestrator to execute the task once is has been assigned to an executor
Assignment time: The time is takes for a task to be assigned to an executor once it has been created in the queue
Transition time: The time it takes for to create the following time once a task is finished

After looking at the macro numbers above, it's interesting to compare the time spent in each of the above categories, relative to the total time the orchestrator took to execute the flow.

For the 10 long running tasks flow, we see the following:

	Airflow	Prefect	Temporal	Windmill Normal	Windmill Dedicated Worker
Total duration (in secconds)	54.668	15.489	13.434	8.351	7.885
Assignement	40.36%	9.77%	0.71%	0.47%	0.55%
Execution	51.72%	88.18%	97.74%	93.17%	93.46%
Transition	7.93%	2.05%	1.55%	6.36%	6.00%

The proportion of time spent in execution is important here since each task takes a long time to run. We see that Airflow and Prefect are spending a lot of time assigning the tasks compared to the two others (When we look at the actual numbers, we see that both Prefect and Airflow are spending a lot of time assigning the first tasks, but after that, assignment duration decrease. Airflow remain relatively slow though, and Prefect reaches decent performance. The exact same can be observed with the 40 tasks workflow below). Temporal and Windmill in normal mode are pretty similar. Windmill in dedicated worker mode is incredibly fast at executing the jobs, at a cost of spending a little more time doing the transitions, but overall it is the fastest.

If we look at the 40 lightweight tasks flow, we have:

	Airflow	Prefect	Temporal	Windmill Normal	Windmill Dedicated Worker
Total duration (in secconds)	56.221	4.872	2.967	4.300	2.429
Assignement	64.63%	44.62%	35.58%	3.42%	5.89%
Execution	10.77%	31.73%	11.26%	44.19%	3.42%
Transition	24.60%	23.65%	53.16%	52.40%	90.70%

Here we see that Windmill takes a greater portion of time executing the tasks, which can be explained by the fact that Windmill runs a "cold start" for each tasks submitted to the worker. However, it's by far the fastest assigning tasks to executors. As observed above, Windmill in dedicated worker mode is lightning fast at executing the tasks, but takes more time transitioning from one task to the next one.

Conclusion

Airflow is the slowest in all categories, followed by Prefect. If you're looking for a high performance job orchestrator, they seem to not be the best option. Temporal and Windmill have better performance and are closer to each other in terms of performance, but in both cases Windmill performs better either in normal mode or in dedicated mode. If you're looking for a job orchestrator for various long-running tasks, Windmill in normal mode will be the most performant solution, optimizing the duration of each tasks knowing that transitions and assignments will remain a small portion of the overall workload. To run lightweight tasks at a very fast pace Windmill in dedicated worker mode should be your preferred choice, provided that the tasks are similar. It is lightening fast at execution and assignment.

Appendix: Scaling Windmill

We performed those benchmarks with a single worker assuming the capacity to process jobs would scale linearly with the number of workers deployed on the stack. We haven't verified this assumption for Airflow, Prefect and Temporal, but we've scaled Windmill up to a 100 virtual workers to verify. And the conclusion is that it scales pretty linearly.

For this test, we've deployed the same docker compose as above on an AWS m4.xlarge instance (4 vCPU, 16Gb of memory) and to virtually increase the number of workers, we've used the NUM_WORKERS environment variable Windmill accepts. Note that it is not strictly equivalent to adding real hardware to the stack, but until we reach the maximum capacities on the instance, both in terms of CPU and memory, we can assume it's a good approximation. The other change we had to make was to bump the max_connections to 1000 on Postgresql: as we're adding more and more workers, each worker needs to connect to the database and we need to increase the maximum number of connections Posgtresql allows.

The job we ran was a simple sleeping job sleeping for 100ms, which is a good average during for a job running on an orchestrator.

import time
def main():
	time.sleep(0.1)

Finally, we've ran it on Windmill Dedicated Worker mode, and we used a specific endpoint to "bulk-create" the jobs before any worker can start pulling them from the queue. For this test to be representative, we had to measure the performance of Windmill processing a large number of jobs (10000 in this case), and we quickly realised that the time it was taking to only insert the jobs one by one in the queue was non negligible and was affecting the real performance of workers.

The results are the following:

Details

Number of workers	Throughput (jobs/sec) batch of 10K jobs
2	19.9
6	59.8
10	99.6
20	198
30	298
40	391
50	496
60	591
70	693
80	786
90	887
100	981

This proves that Windmill scales linearly with the number of workers (at least up to 100 workers). We can also notice that the throughput is close to the optimal: given that the job takes 100ms to be executed, N workers processing the jobs in parallel can't go above N*100 jobs per seconds, and Windmill is pretty close.

Airflow vs Prefect vs Temporal vs Windmill

Airflow setup​

Results​

Prefect setup​

Results​

Temporal setup​

Results​

Windmill setup​

Results​

Comparisons​

Conclusion​

Appendix: Scaling Windmill​

Airflow setup

Results

Prefect setup

Results

Temporal setup

Results

Windmill setup

Results

Comparisons

Conclusion

Appendix: Scaling Windmill