List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
- Cron: a commond line utility introduced in the 1970s. Used to execute a particular command at a specified data and time.
- Syntax
- Example
- Example: scheduling data pipeline with Cron (Pure scheduling approach)
- Weakness: If one step failed → whole process failed
- But cron is still useful for simple and repetitive tasks (ex: regular data downloads) or in the prototyping phase (ex: testing aspects of your data pipeline)
- Dataswarm (Facebook, late 2000s) → oozie (2010s) → airbnb’s Airflow (2014, open sources) → Apache Airflow (2019)
- Airflow is used by a lot of teams. → should know
- Pros and Cons of Airflow
- Others
- DAG = Directed Acyclic Graph
- “directed” → data flows only in one direction
- “acyclic” → no circles or cycles
- Dependencies: the previous tasks are required to finish before the next task starts
- Basic concepts
- Orchestration in Airflow
- Writing tasks like this
- We can trigger based on time or event (eg: click)
- We can set up data quality checks like checking null values, range of values…
- We - users - only interact with DAG directory and UI
- Trigger: User initiates via UI or Scheduler automatically triggers.The Scheduler continuously monitors all DAGs in the DAG Directory (once per minute by default). When tasks are ready, it pushes them to a queue for the Workers. To manage tasks and dispatch them to Workers, the Scheduler employs an Executor.
- Status of a task: scheduled → queued → running → success → failed
The Scheduler and Workers store task statuses in the Metadata Database. The Web Server then retrieves this information to display in the UI.
- In MWAA (Amazon Managed Workflows for Apache Airflow), all these components are automatically created and managed for you.
Nothing to note about.
- Understand sheduling DAG: Data Interval,
execution_date, Timetables, Data-aware scheduling.
- Built-in variables like
ti,run_id,… incontext
- Defineing dependencies
- Check these codes.
- In the lab, we use
boto3to interact with S3 but in real life, we don’t do that because: - Use Airflow-specific operators for storage and processing interactions (see S3 operator documentation).
- Airflow should orchestrate, not process. Delegate workloads to appropriate tools like databases or Spark clusters (see Spark operators).
- Remember that the DAG's logical date can be obtained from
context["ds"]
- In the previous lab, we use S3 as an intermediate storage to pass data from one task to another task. ← this method is used for large data.
- Small data → XCom (cross-communication). Small = metadata, dates, single value metrics, simple computations.
context→ Airflow’s built-in variables in the currently running task.
xcom_push()is a method associated with a task instance (context[’ti’]), the same forxcom_pull()
- In Airflow UI, check XCom in Admin > XComs
- We can defined user variables and use them in the code via Airflow UI (Admin > Variables) and use them in the codes
- Keep tasks simple and atomic: Keep your tasks simple such that each task represents one operation
- Avoid top-level code: any code that isn’t part of your DAG or operator instantiations is considered to be top-level code. This type of code will be executed at the time when the DAG is parsed by the scheduler. Any code that is part of an operator is executed when the task runs, not when the DAG is parsed.
- Use variables (user-created variables, Airflow built-in variables and macros): variables are global and should be used for overall configs. To pass data from one task to another task, use XCom!
- Use built-in variables with Jinja templating syntax (like
{{ds}})
- Using Task groups
- Airflow is an orchestrator not an executor:
- Heavy processing should be done with other framework (eg. Spark).
- Large dataset → use intermediary data storage (eg. S3) instead of XCom.
- Keep any extra codes (not part of your DAG) in a separate file.
- Check these codes.
- Determinism means that the same input will always produce the same output.
- Idempotence means if you execute the same operation multiple times, you will obtain the same result.
- The
catchupparameter: defines whether the DAG will be executed for all the data intervals between thestart_dateand the current date. It is recommended that you set it toFalseto have more control over the execution of the DAG. You can also use the Backfill feature to execute the DAG for a specific date range.
- Some notable codes
- TaskFlow → use decorators to create the DAG and its tasks (use less names)
- DAG dependencies with TaskFlow
- XCom with TaskFlow
Or
- The decorator doesn’t replace all operators! ← use both paradigms.
- Example on branching
- Check these codes.
- Copy data to AWS S3 bucket and check
- Airflow branching → decide which task should be triggered based on some conditions.
- Dynamic DAG Generation — Airflow Documentation ← keep the DRY principle true for DAGs.
- Useful outil
protect_undefineds
- Host Airflow on AWS → more control vs more convenience
- Host open-source version on EC2 ← full control configs and scaling but you need to manage all infrastructure and integration yourself.
- Using Amazon MWAA ← host Apache Airflow for you + integrate with other AWS services.
- Alternative:
- AWS Glue Workflows → create, run and monitor complex ETL workflow
- AWS Step Functions
- Which one to choose? → based on requirements + keep up to date tools.