Generate a counterfactual evaluation of corn response to nitrogen with Amazon SageMaker JumpStart options
In his e book The Book of Why, Judea Pearl advocates for educating trigger and impact rules to machines so as to improve their intelligence. The accomplishments of deep studying are primarily only a kind of curve fitting, whereas causality may very well be used to uncover interactions between the techniques of the world underneath numerous constraints with out testing hypotheses instantly. This might present solutions that lead us to AGI (artificial generalized intelligence).
This answer proposes a causal inference framework utilizing Bayesian networks to signify causal dependencies and draw causal conclusions primarily based on noticed satellite tv for pc imagery and experimental trial knowledge within the type of simulated climate and soil situations. The case study is the causal relationship between nitrogen-based fertilizer software and the corn yields.
The satellite tv for pc imagery is processed utilizing purpose-built Amazon SageMaker geospatial capabilities and enriched with custom-built Amazon SageMaker Processing operations. The causal inference engine is deployed with Amazon SageMaker Asynchronous Inference.
On this put up, we exhibit methods to create this counterfactual evaluation utilizing Amazon SageMaker JumpStart options.
Resolution overview
The next diagram exhibits the structure for the end-to-end workflow.
Conditions
You want an AWS account to make use of this answer.
To run this JumpStart 1P Resolution and have the infrastructure deployed to your AWS account, you want to create an lively Amazon SageMaker Studio occasion (seek advice from Onboard to Amazon SageMaker Domain). When your Studio occasion is prepared, observe the directions in SageMaker JumpStart to launch the Crop Yield Counterfactuals answer.
Observe that this answer is at present obtainable within the US West (Oregon) Area solely.
Causal inference
Causality is all about understanding change, however methods to formalize this in statistics and machine studying (ML) shouldn’t be a trivial train.
On this crop yield examine, the nitrogen added as fertilizer and the yield outcomes is likely to be confounded. Equally, the nitrogen added as a fertilizer and the nitrogen leaching outcomes may very well be confounded as properly, within the sense {that a} frequent trigger can clarify their affiliation. Nevertheless, affiliation shouldn’t be causation. If we all know which noticed components confound the affiliation, we account for them, however what if there are different hidden variables accountable for confounding? Lowering the quantity of fertilizer gained’t essentially cut back residual nitrogen; equally, it may not drastically diminish the yield, whereas the soil and weather conditions may very well be the noticed components that confound the affiliation. The right way to deal with confounding is the central drawback of causal inference. A method launched by R. A. Fisher known as randomized managed trial goals to interrupt attainable confounding.
Nevertheless, within the absence of randomized management trials, there’s a want for causal inference purely from observational knowledge. There are methods to attach the causal inquiries to knowledge in observational research by writing the causal graphical mannequin on what we postulate as how issues occur. This includes claiming the corresponding traverses will seize the corresponding dependencies, whereas satisfying the graphical criterion for conditional ignorability (to what extent we are able to deal with causation as affiliation primarily based on the causal assumptions). After we now have postulated the construction, we are able to use the implied invariances to study from observational knowledge and plug in causal questions, inferring causal claims with out randomized management trials.
This answer makes use of each knowledge from simulated randomized management trials (RCTs) in addition to observational knowledge from satellite tv for pc imagery. A collection of simulations performed over 1000’s of fields and a number of years in Illinois (United States) are used to check the corn response to rising nitrogen charges for a broad mixture of climate and soil variation seen within the area. It addresses the limitation of utilizing trial knowledge restricted within the variety of soils and years it could actually discover by utilizing crop simulations of assorted farming eventualities and geographies. The database was calibrated and validated utilizing knowledge from greater than 400 trials within the area. Preliminary nitrogen focus within the soil was set randomly amongst an inexpensive vary.
Moreover, the database is enhanced with observations from satellite tv for pc imagery, whereas zonal statistics are derived from spectral indices so as to signify spatio-temporal modifications in vegetation, seen throughout geographies and phenological phases.
Causal inference with Bayesian networks
Structural causal fashions (SCMs) use graphical fashions to signify causal dependencies by incorporating each data-driven and human inputs. A selected kind of construction causal mannequin known as Bayesian networks is proposed to mannequin the crop phenology dynamics utilizing probabilistic expressions by representing variables as nodes and relationships between variables as edges. Nodes are indicators of crop development, soil and climate situations, and the perimeters between them signify spatio-temporal causal relationships. The guardian nodes are field-related parameters (together with the day of sowing and space planted), and the kid nodes are yield, nitrogen uptake, and nitrogen leaching metrics.
For extra data, seek advice from the database characterization and the guide for figuring out the corn development phases.
A couple of steps are required to construct a Bayesian networks mannequin (with CausalNex) earlier than we are able to use it for counterfactual and interventional evaluation. The construction of the causal mannequin is initially discovered from knowledge, whereas subject material experience (trusted literature or empirical beliefs) is used to postulate extra dependencies and independencies between random variables and intervention variables, in addition to asserting the construction is causal.
Utilizing NO TEARS, a steady optimization algorithm for construction studying, the graph construction describing conditional dependencies between variables is discovered from knowledge, with a set of constraints imposed on edges, guardian nodes, and baby nodes that aren’t allowed within the causal mannequin. This preserves the temporal dependencies between variables. See the next code:
"""
tabu_edges: Imposing edges that aren't allowed within the causal mannequin
tabu_parents: Imposing guardian nodes that aren't allowed within the causal mannequin
tabu_child: Imposing baby nodes that aren't allowed within the causal mannequin
"""
from causalnex.construction.notears import from_pandas
g_learned = from_pandas(
X,
tabu_edges=tabu_edges,
tabu_parent_nodes=tabu_parents,
tabu_child_nodes=tabu_child,
max_iter=100,
)
The subsequent step encodes area information in fashions and captures phenology dynamics, whereas avoiding spurious relationships. Multicollinearity evaluation, variation inflation issue evaluation, and world function significance utilizing SHAP evaluation are performed to extract insights and constraints on water stress variables (enlargement, phenology, and photosynthesis round flowering), climate and soil variables, spectral indices, and the nitrogen-based indicators:
"""
edges: Modifying the construction by imposing constraints on edges
"""
from causalnex.construction import StructureModel
g = StructureModel()
g.add_edges_from(
edges,
origin="professional"
)
Bayesian networks in CausalNex help solely discrete distributions. Any steady options, or options with a lot of classes, are discretized previous to fitting the Bayesian community:
from causalnex.discretiser.discretiser_strategy import (
DecisionTreeSupervisedDiscretiserMethod,
MDLPSupervisedDiscretiserMethod
)
discretiser = DecisionTreeSupervisedDiscretiserMethod(
mode="single",
tree_params={"max_depth": 2, "random_state": 2022},
)
discretiser.match(
feat_names=options,
dataframe=df,
target_continuous=True,
goal=goal,
)
After the construction is reviewed, the conditional likelihood distribution of every variable given its dad and mom may be discovered from knowledge, in a step known as chance estimation:
from causalnex.community import BayesianNetwork
bn = BayesianNetwork(g)
bn = bn.fit_node_states(discretised_data)
bn = bn.fit_cpds(
prepare,
technique="BayesianEstimator",
bayes_prior="K2",
)
Lastly, the construction and likelihoods are used to carry out observational inference on the fly, following a deterministic Junction Tree algorithm (JTA), and making interventions utilizing do-calculus. SageMaker Asynchronous Inference permits queuing incoming requests and processes them asynchronously. This feature is right for each observational and counterfactual inference eventualities, the place the method can’t be parallelized, thereby taking significant time to replace the possibilities all through the community, though a number of queries may be run in parallel. See the next code:
"""
Question the marginal chance of states within the graph given some observations.
These observations may be made wherever within the community,
and their influence can be propagated by means of to the node of curiosity.
"""
from causalnex.inference import InferenceEngine
ie = InferenceEngine(bn)
pseudo_observation = [{"day_sow":0}, {"day_sow":1}, {"day_sow":2}]
marginals_multi = ie.question(
pseudo_observation,
parallel=True,
num_cores=multiprocessing.cpu_count(),
)
# distribution earlier than intervention
marginals_before = ie.question()["Y_corn"]
# updating a node distribution
ie.do_intervention("N_fert", 0)
# impact of do on marginals
marginals_after = ie.question()["Y_corn"]
# Resetting the node distribution
ie.reset_do("N_fert")
For additional particulars, seek advice from the inference script.
The causal mannequin notebook is a step-by-step information on operating the previous steps.
Geospatial knowledge processing
Earth Observation Jobs (EOJs) are chained collectively to accumulate and rework satellite tv for pc imagery, whereas purpose-built operations and pre-trained fashions are used for cloud elimination, mosaicking, band math operations, and resampling. On this part, we talk about in additional element the geospatial processing steps.
Space of curiosity
Within the following determine, inexperienced polygons are the chosen counties, the orange grid is the database map (a grid of 10 x 10 km cells the place trials are performed within the area), and the grid of grayscale squares is the 100 km x 100 km Sentinel-2 UTM tiling grid.
Spatial files are used to map the simulated database with corresponding satellite tv for pc imagery, overlaying polygons of 10 km x 10 km cells that divide the state of Illinois (the place trials are performed within the area), counties polygons, and 100 km x 100 km Sentinel-2 UTM tiles. To optimize the geospatial knowledge processing pipeline, just a few close by Sentinel-2 tiles are first chosen. Subsequent, the aggregated geometries of tiles and cells are overlayed so as to receive the area of curiosity (RoI). The counties and the cell IDs which can be totally noticed throughout the RoI are chosen to type the polygon geometry handed onto the EOJs.
Time vary
For this train, the corn phenology cycle is split into three phases: the vegetative phases v5 to R1 (emergence, leaf collars, and tasseling), the reproductive phases R1 to R4 (silking, blister, milk, and dough) and the reproductive phases R5 (dented) and R6 (physiological maturity). Consecutive satellite tv for pc visits are acquired for every phenology stage inside a time vary of two weeks and a predefined space of curiosity (chosen counties), enabling spatial and temporal evaluation of satellite tv for pc imagery. The next determine illustrates these metrics.
Cloud elimination
Cloud elimination for Sentinel-2 knowledge makes use of an ML-based semantic segmentation mannequin to establish clouds within the picture, the place cloudy pixels are changed by with worth -9999 (nodata worth):
request_polygon_coordinates = [[(-90.571754, 39.839326), (-90.893651, 39.84092), (-90.916609, 39.845075), (-90.916071, 39.757168), (-91.147678, 39.75707), (-91.265848, 39.757258), (-91.365125, 39.758723), (-91.367962, 39.759124), (-91.365396, 39.777266), (-91.432919, 39.840554), (-91.446385, 39.870394), (-91.455887, 39.945538), (-91.460287, 39.980333), (-91.494865, 40.037421), (-91.510322, 40.127994), (-91.512974, 40.181062), (-91.510332, 40.201142), (-91.258828, 40.197299), (-90.911969, 40.193088), (-90.909756, 40.284394), (-90.450227, 40.276335), (-90.451502, 40.188892), (-90.199556, 40.183945), (-90.118966, 40.235263), (-90.033026, 40.377806), (-89.92468, 40.435921), (-89.717104, 40.435655), (-89.714927, 40.319218), (-89.602979, 40.320129), (-89.601604, 40.122432), (-89.578289, 39.976127), (-89.698259, 39.975309), (-89.701864, 39.916787), (-89.994506, 39.901925), (-89.994405, 39.87286), (-90.583534, 39.87675), (-90.582435, 39.854574), (-90.571754, 39.839326)]]
start_time="2018-08-15T00:00:00Z"
end_time="2018-09-15T00:00:00Z"
eoj_input_config = {
"RasterDataCollectionQuery": {
"RasterDataCollectionArn": 'arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8',
"AreaOfInterest": {
"AreaOfInterestGeometry": {
"PolygonGeometry": {"Coordinates": request_polygon_coordinates}
}
},
"TimeRangeFilter": {"StartTime": start_time, "EndTime": end_time},
"PropertyFilters": {
"Properties": [{"Property": {"EoCloudCover":
{"LowerBound": 0, "UpperBound": 10}}}],
"LogicalOperator": "AND",
},
}
}
eoj_config = {
"JobConfig": {
"CloudRemovalConfig": {
"AlgorithmName": "INTERPOLATION",
"InterpolationValue": "-9999",
"TargetBands": ["red", "green", "blue", "nir", "swir16"],
},
}
}
eojParams = {
"Identify": "cloudremoval",
"InputConfig": eoj_input_config,
**eoj_config,
"ExecutionRoleArn": role_arn,
}
eoj_response = sg_client.start_earth_observation_job(**eojParams)
After the EOJ is created, the ARN is returned and used to carry out the next geomosaic operation.
To get the standing of a job, you possibly can run sg_client.get_earth_observation_job(Arn = response['Arn']).
Geomosaic
The geomosaic EOJ is used to merge photos from a number of satellite tv for pc visits into a big mosaic, by overwriting nodata or clear pixels (together with the cloudy pixels) with pixels from different timestamps:
eoj_config = {"JobConfig": {"GeoMosaicConfig": {"AlgorithmName": "NEAR"}}}
eojParams = {
"Identify": "geomosaic",
"InputConfig": {"PreviousEarthObservationJobArn": eoj_arn},
**eoj_config,
"ExecutionRoleArn": role_arn,
}
eoj_response = sg_client.start_earth_observation_job(**eojParams)
After the EOJ is created, the ARN is returned and used to carry out the next resampling operation.
Resampling
Resampling is used to downscale the decision of the geospatial picture so as to match the decision of the crop masks (10–30 m decision rescaling):
eoj_config = {
"JobConfig": {
"ResamplingConfig": {
"OutputResolution": {"UserDefined": {"Worth": 30, "Unit": "METERS"}},
"AlgorithmName": "NEAR",
},
}
}
eojParams = {
"Identify": "resample",
"InputConfig": {"PreviousEarthObservationJobArn": eoj_arn},
**eoj_config,
"ExecutionRoleArn": role_arn,
}
eoj_response = sg_client.start_earth_observation_job(**eojParams)
After the EOJ is created, the ARN is returned and used to carry out the next band math operation.
Band math
Band math operations are used for reworking the observations from a number of spectral bands to a single band. It contains the next spectral indices:
- EVI2 – Two-Band Enhanced Vegetation Index
- GDVI – Generalized Distinction Vegetation Index
- NDMI – Normalized Distinction Moisture Index
- NDVI – Normalized Distinction Vegetation Index
- NDWI – Normalized Distinction Water Index
See the next code:
spectral_indices = [['EVI2', ' 2.5 * ( nir - red ) / ( nir + 2.4 * red + 1.0 ) '],
['GDVI', ' ( ( nir * * 2.0 ) - ( red * * 2.0 ) ) / ( ( nir * * 2.0 ) + ( red * * 2.0 ) ) '],
['NDMI', ' ( nir - swir16 ) / ( nir + swir16 ) '],
['NDVI', ' ( nir - red ) / ( nir + red ) '],
['NDWI', ' ( green - nir ) / ( green + nir ) ']]
eoj_config = {
"JobConfig": {
"BandMathConfig": {"CustomIndices": {"Operations": []}},
}
}
for indices in spectral_indices:
eoj_config["JobConfig"]["BandMathConfig"]["CustomIndices"]["Operations"].append(
{"Identify": indices[0], "Equation": indices[1][1:-1]}
)
eojParams = {
"Identify": "bandmath",
"InputConfig": {"PreviousEarthObservationJobArn": eoj_arn},
**eoj_config,
"ExecutionRoleArn": role_arn,
}
eoj_response = sg_client.start_earth_observation_job(**eojParams)
Zonal statistics
The spectral indices are additional enriched utilizing Amazon SageMaker Processing, the place GDAL-based {custom} logic is used to do the next:
- Merge the spectral indices right into a single multi-channel mosaic
- Reproject the mosaic to the crop mask‘s projection
- Apply the crop masks and reproject the mosaic to the cells polygons’s CRC
- Calculate zonal statistics for chosen polygons (10 km x 10 km cells)
With parallelized knowledge distribution, manifest files (for every crop phenological stage) are distributed throughout a number of situations utilizing the ShardedByS3Key
S3 knowledge distribution kind. For additional particulars, seek advice from the feature extraction script.
The geospatial processing notebook is a step-by-step information on operating the previous steps.
The next determine exhibits RGB channels of consecutive satellite tv for pc visits representing the vegetative and reproductive phases of the corn phenology cycle, with (proper) and with out (left) crop masks (CW 20, 26 and 33, 2018 Central Illinois).
Within the following determine, spectral indices (NDVI, EVI2, NDMI) of consecutive satellite tv for pc visits signify the vegetative and reproductive phases of the corn phenology cycle (CW 20, 26 and 33, 2018 Central Illinois).
Clear up
If you happen to not wish to use this answer, you possibly can delete the sources it created. After the answer is deployed in Studio, select Delete all sources to mechanically delete all customary sources that had been created when launching the answer, together with the S3 bucket.
Conclusion
This answer offers a blueprint to be used instances the place causal inference with Bayesian networks are the popular methodology for answering causal questions from a mix of knowledge and human inputs. The workflow contains an efficient implementation of the inference engine, which queues incoming queries and interventions and processes them asynchronously. The modular side allows the reuse of assorted elements, together with geospatial processing with purpose-built operations and pre-trained fashions, enrichment of satellite tv for pc imagery with custom-built GDAL operations, and multimodal function engineering (spectral indices and tabular knowledge).
As well as, you need to use this answer as a template for constructing gridded crop fashions the place nitrogen fertilizer administration and environmental coverage evaluation are performed.
For extra data, seek advice from Solution Templates and observe the guide to launch the Crop Yield Counterfactuals answer within the US West (Oregon) Area. The code is obtainable within the GitHub repo.
Citations
German Mandrini, Sotirios V. Archontoulis, Cameron M. Pittelkow, Taro Mieno, Nicolas F. Martin,
Simulated dataset of corn response to nitrogen over 1000’s of fields and a number of years in Illinois,
Data in Brief, Volume 40, 2022, 107753, ISSN 2352-3409
Helpful sources
In regards to the Authors
Paul Barna is a Senior Knowledge Scientist with the Machine Studying Prototyping Labs at AWS.