Amassing Information with Apache Airflow on a Raspberry Pi | by Dmitrii Eliuseev | Oct, 2023
Typically, we have to acquire some knowledge inside a sure time period. It may be knowledge from the IoT sensor, statistical knowledge from social networks, or one thing else. For instance, the YouTube Data API permits us to get the variety of views and subscribers for any channel on the present second, however the analytics and historic knowledge can be found solely to the channel proprietor. Thus, if we wish to get weekly or month-to-month summaries about these channels, we have to acquire this knowledge ourselves. Within the case of the IoT sensor, there could also be no API in any respect, and we additionally want to gather and save knowledge on our personal. On this article, I’ll present the way to configure Apache Airflow on a Raspberry Pi, which permits operating duties for a protracted time period with out involving any cloud supplier.
Clearly, if you happen to’re working for a big firm, you’ll in all probability not want a Raspberry Pi. In that case, if you happen to want an additional cloud occasion, simply create a Jira ticket on your MLOps division 😉 However for a pet undertaking or a low-budget startup, it may be an fascinating answer.
Let’s see the way it works.
Raspberry Pi
What is definitely a Raspberry Pi? For these readers who’ve by no means been desirous about {hardware} for the final 10 years (the primary Raspberry Pi mannequin was launched in 2012), I can briefly clarify that it is a single-board pc operating full-fledged Linux. Normally, a Raspberry Pi has a 1GHz, 2–4-core ARM CPU and 1–8 MB of RAM. It’s small, low-cost, and silent; it has no followers and no disk drive (the OS is operating from a Micro SD card). A Raspberry Pi wants solely a typical USB energy provide; it may be linked by way of Wi-Fi or Ethernet to a community and run totally different duties inside months and even years.
For my knowledge science pet undertaking, I needed to gather the YouTube channel statistics inside 2 weeks. For a process that requires solely 30–60 seconds twice per day, a serverless structure generally is a good answer, and we are able to use one thing like Google Cloud Function for that. However each tutorial from Google began with the phrase “allow billing on your undertaking”. There may be free first credit score and free quotas offered by Google, however I didn’t wish to have one other headache of monitoring how a lot cash I…