Backfilling Mastery: Elevating Knowledge Engineering Experience | by Naser Tamimi | Nov, 2023


A go-to information for knowledge engineers wading by the backfilling maze

Picture by Towfiqu barbhuiya on Unsplash

Think about beginning a brand new knowledge pipeline and getting knowledge from a supply you’ve by no means parsed earlier than (e.g. pulling information from an API or an current hive desk). Now, you’re on a mission to make it look like you collected this knowledge ages in the past. That’s one instance of what we name knowledge backfilling in knowledge engineering.

However it’s not nearly beginning a brand new knowledge pipeline or desk. You may have a desk that’s been gathering knowledge for some time, and abruptly, it is advisable change the info (for instance as a consequence of a brand new metric definition), or toss in additional knowledge from a brand new knowledge supply. Or perhaps there’s a clumsy hole in your knowledge, and also you simply wish to patch it up. All these conditions are examples of information backfilling. The widespread thread is popping “again” in time and “filling” up your desk with some historic knowledge.

The next determine (Determine 1) reveals an easy backfilling situation. On this occasion, a each day job retrieves knowledge from two upstream sources (one for platform A and one other for platform B). The dataset is structured with the primary partition being ‘ds,’ and the second partition (or sub-partitions) representing the platforms. Sadly, knowledge for the interval from 2023–10–03 to 2023–10–05 is absent as a consequence of sure points. To handle this hole, a backfilling operation was initiated (the backfilling job began on 2023–10–08).

Determine 1) A easy backfilling situation

A short heads-up earlier than continuing additional: throughout the area of information engineering, we usually encounter two eventualities: “backfilling” a desk or “restating” a desk. These processes, whereas sharing some similarities, have some refined variations. Backfilling, as a observe, is about populating lacking or incomplete knowledge in a dataset. Its utility is often directed in direction of updating historic knowledge or rectifying gaps. Conversely, restating a desk entails effecting substantial…

Leave a Reply

Your email address will not be published. Required fields are marked *