Apache Airflow by Cloudup, Quick Start

Apache Airflow v3.x with all providers enabled on Amazon Linux 2023 by Cloudup

This AMI will allow you to quickly deploy and integrate open source Apache Airflow version 3.x (optionally v2.x), enabling all the default providers that come with the Apache Airflow and orchestration capabilities for AWS based workloads.

Apache Airflow® is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web-based UI helps you visualize, manage, and debug your workflows.

This would be ideal for data engineers and data scientists getting familiar with enterprise orchestration concepts and experimenting Apache Airflow in development environments using an EC2 instance. This image includes all up-to-date modules and prerequisites of Apache Airflow v3.x and v2.x releases

More comprehensive Apache Air Flow deployment guidance and alternative do it yourself deployment options (e.g. Helm, Docker) can be found under the open source project documentation; https://airflow.apache.org/docs/.

More information regarding the Apache AirFlow can be found at https://airflow.apache.org


Instructions


ec2-AMI Product Configuration

While configuring the product, please select the desired Apached Airflow version and the target region;


and then proceed to Launch;

Make sure to choose an appropriate sized instance for your workload and place it to the VPC/Subnet which you can access to (e.g. public subnet);


Make sure you attach a security group or create a new one Based on Seller Settings. As bare minimum, you will need SSH access (port 22) to be able to upload DAGs or modify the configuration files via SSH/SFTP, and also port 8080 access to be able to interact with the Apache Airflow user interface. Please also ensure attaching a Key Pair to be able to login as ec2-user to the underlying Amazon Linux operating system.

Then proceed to the launch;



UI Login and the Authentication

After the deployment, to access the AirFlow interface use a browser and point to http://public_dns_name:8080 or https://public_dns_name:8080

If you receive any message regarding the non-secure connection prior configuring a custom SSL key, please ignore it and continue to the site;

Depending on the version deployed you should see one of the login prompts given below;

Default administrator username is "airflow", and the default password is your ec2 instance id. Please allow few minutes before trying to log-in for the first time after your provisioning, so that the default user can be created. Please don't forget to update the default password after the setup, and/or change the default authentication method. Please refer to the documentation link provided in the login prompt to modify your password or set up alternate authentication method.


Sample DAGs and Adding Your Own DAGs

After login you can see the sample dags and run operations as usual;

If you would like to upload new Dags, please SSH/SFTP to the ec2 instance with the key pair you associated and user name "ec2-user". You should upload them to the /airflow/dags folder. Typically, dag file names should include "airflow" and "dags" keywords, and they should python file suffix "py".


Connecting to the underlying EC2 / Linux

To connect to the operating system, use SSH and the username ec2-user.


Configuration Files and the AIRFLOW Home Directory

AirFlow configuration files are stored under the /airflow directory. For instance, to modify the Apache AirFlow web server settings, you may update the [webserver] section of the configuration file /airflow/airflow.cfg

More configuration guidance can be found at Apache Airflow documentation.

After the configuration updates, in version 3.x, you may restart the relevant services via;
sudo /bin/systemctl start airflow-webserver
sudo /bin/systemctl start airflow-scheduler
sudo /bin/systemctl start airflow-processor
sudo /bin/systemctl start airflow-triggerer

For version 2.x, there are only webserver and the scheduler
sudo /bin/systemctl start airflow-webserver
sudo /bin/systemctl start airflow-scheduler

Disable the Default User


After adding / updating new users, or changing the default authentication provider, please disable the default user;
sudo /bin/systemctl disable airflow-defaultpass
sudo /bin/systemctl stop airflow-defaultpass

 


Tips

Please use Amazon RDS based repository configuration for your Airflow deployment to ensure smooth migrations later on, and please make sure you apply the same configuration options when you migrate. AirFlow configuration files are kept under the /airflow directory