Tug of War - Apache Airflow vs Apache Nifi

 They kinda overlap a little as both serves as the pipeline processing (conditional processing job/streams)

Airflow is more on programmatically scheduler (you will need to write dags to do your airflow job all the time) while nifi has the UI to set processes(let it be ETL, stream filtering etc) with least programming needed.

Use case:

Use NiFi if you are dealing with tons of different streams/data source which needs manual adjustment. The UI allows non programmer to set pipeline with faster development time.

Use Airflow if you are dealing with scheduled task, job and dealing with data import export which predefined operator(Currently, it has tons of operators pre-made for airflow such as Redshift, mysql, s3 etc)



They aren't really in the same space though some of the high level nonsense wording we all use to describe our projects might suggest they are. Where Apache NiFi aims to be extremely awesome is in helping you connect systems from wherever data is created/collected back to and through the various places that it will get consumed. We talk about this as 'data flow management'. It sometimes gets lumped in with 'enterprise integration', 'system integration', 'data integration' and other terms that get intertwined and muddied up. Bottom line here is NiFi gives you an application which will act as a dataflow management broker. You tell it to listen for data or go grab data, you tell it to run various transforms, make web service calls, enrich this, filter that, combine these things, and ultimately deliver it to various destinations. But you're doing all this as a way to connect 'producers' and 'consumers' and principally you're doing it to address problems that simply having some message based transport in the middle will not solve. This is explicit dataflow management rather than passive messaging for example. I could go on...but to the topic at hand...

Where I believe Airflow and other systems in the workflow space. They way I look at something like that is the data has arrived into some central place/cluster/etc.. Usually this means it has arrived into a database or Kafka or something. Then a team knows they want to run a series of steps in certain orders and those steps when visualized form a DAG and so on. Having a powerful workflow tool then is very awesome. Airflow appears to fit into this space which is orchestrating some processing pipeline once data has made it to some back end point. It can be a bit confusing here because indeed NiFi is used to do many of these things as well. That said, remember NiFi's focus and goal in life is to help you connect and manage the flow of data throughout the enterprise whereas typically worfklow systems are about managing the flow within some given domain/cluster.

If your problem is more like that where the data has already arrived and you want to control jobs and such then Airflow is probably awesome at that. Frankly in looking at where it comes from, members of their community, their extremely nicely done documentation they seem like a pretty legit option (again I'm no expert there).

If your problem is more like the flow management aspects I described then NiFi is probably a great choice. There will no doubt be some overlap but ultimately it comes down to your use case and whether it is more like what Airflow aims to be great at or whether it is more like what NiFi aims to be great at.

Comments

Popular posts from this blog

Read and Navigate XML - Beautiful Soup

difference-between-stream-processing-and-message-processing

WordNet in Python