How to Use JSON Schema to Validate Your ESG Data for Compliance
Technical deep dive on using JSON Schema to enforce data quality and consistency for regulatory filings.
How to Use JSON Schema to Validate Your ESG Data for Compliance
Your sustainability team spent six months collecting ESG data. Two days before your CSRD filing deadline, you discover:
- 20 facilities reported energy in MWh instead of kWh
- 15 suppliers used different date formats (DD.MM.YYYY vs. YYYY-MM-DD)
- 8 entries have “null” for critical emissions fields
- 3 renewable percentages exceed 100%
These inconsistencies invalidate your entire submission. External auditors will reject it, you’ll miss the deadline, and investors will have questions.
JSON Schema validation prevents this by enforcing data quality and consistency when data enters your systems, rather than discovering problems months later.
The cost of inadequate data validation is substantial. Companies face audit failures, regulatory penalties, and restatement costs averaging €45,000 per incident. In 2023, 12% of EU companies restated ESG data due to validation errors.
What is JSON Schema?
JSON Schema defines the structure, types, and validation rules for JSON data. Think of it as a contract that data must satisfy to be accepted.
Here’s a simple example:
{
"type": "object",
"properties": {
"energy_consumption_kwh": {
"type": "number",
"minimum": 0
}
},
"required": ["energy_consumption_kwh"]
}
This schema requires that energy_consumption_kwh must exist, must be a number, and cannot be negative.
FIG 1.0 — JSON Schema validation flow ensuring data quality and compliance
Why JSON Schema Matters for ESG Compliance
Type Safety
Type mismatches cause calculation errors. JSON Schema catches these early:
- Numbers stored as strings (“15000” instead of 15000)
- Dates in wrong formats (text instead of ISO 8601)
- Booleans as strings (“true” instead of true)
Range Validation
Identify impossible values before they propagate:
- Negative energy consumption
- Renewable percentages over 100%
- Emissions data from before 1990 (when tracking began)
Completeness Enforcement
Require all mandatory fields:
- Facility IDs
- Billing periods
- Emissions values
Format Consistency
Standardize data across facilities and suppliers:
- All dates in ISO 8601 (YYYY-MM-DD)
- All currencies as ISO 4217 codes (EUR, USD, GBP)
- All units in standard formats (kWh, MWh, tCO2e)
Regulatory Alignment
Encode ESRS/CSRD requirements directly in your schema:
- ESRS E1 mandatory fields
- Reporting period constraints
- Data point requirements from regulatory frameworks
Research from the OntoMetric framework shows JSON Schema validation achieves 80-90% schema compliance and 65-90% semantic accuracy for complex ESG documents. Traditional extraction methods typically achieve only 3-10% accuracy. DataHen reports that validating at every pipeline stage catches errors early and increases efficiency through automation.
Building ESG JSON Schemas
Example 1: Utility Bill Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Utility Bill ESG Data",
"description": "Schema for extracting electricity, gas, and water utility bill data for CSRD compliance",
"type": "object",
"properties": {
"document_type": {
"description": "Type of utility bill",
"type": "string",
"enum": ["electricity", "gas", "water", "district_heating"]
},
"facility_id": {
"description": "Unique facility identifier",
"type": "string",
"pattern": "^[A-Z]{3}-[0-9]{2,3}$", // Enforces format: MUC-01, LON-12
"minLength": 5,
"maxLength": 20
},
"supplier": {
"description": "Utility supplier name",
"type": "string",
"minLength": 2
},
"account_number": {
"description": "Supplier account number",
"type": "string",
"pattern": "^[A-Z0-9-]+$" // Alphanumeric with hyphens
},
"billing_period": {
"description": "Billing period",
"type": "object",
"properties": {
"start_date": {
"description": "Billing period start date (ISO 8601)",
"type": "string",
"format": "date"
},
"end_date": {
"description": "Billing period end date (ISO 8601)",
"type": "string",
"format": "date"
}
},
"required": ["start_date", "end_date"]
},
"energy_consumption": {
"description": "Energy consumption data",
"type": "object",
"properties": {
"kwh": {
"description": "Consumption in kilowatt-hours",
"type": "number",
"minimum": 0,
"exclusiveMaximum": 10000000 // Max 10M kWh per month
},
"mwh": {
"description": "Consumption in megawatt-hours (alternative unit)",
"type": "number",
"minimum": 0,
"exclusiveMaximum": 10000
}
},
"oneOf": [{ "required": ["kwh"] }, { "required": ["mwh"] }]
},
"renewable_energy": {
"description": "Renewable energy breakdown",
"type": "object",
"properties": {
"kwh": {
"description": "Renewable consumption in kWh",
"type": "number",
"minimum": 0
},
"percentage": {
"description": "Renewable energy percentage (0-100)",
"type": "number",
"minimum": 0,
"maximum": 100
},
"source": {
"description": "Renewable energy source",
"type": "string",
"enum": ["solar", "wind", "hydro", "biomass", "geothermal", "mixed", "unknown"]
}
}
},
"total_cost": {
"description": "Total bill amount",
"type": "object",
"properties": {
"amount": {
"description": "Cost amount",
"type": "number",
"minimum": 0
},
"currency": {
"description": "Currency code (ISO 4217)",
"type": "string",
"pattern": "^[A-Z]{3}$", // 3-letter ISO currency code
"enum": ["EUR", "GBP", "USD", "CHF", "SEK", "NOK", "DKK", "PLN"]
}
},
"required": ["amount", "currency"]
}
},
"required": ["document_type", "facility_id", "billing_period", "energy_consumption"]
}
FIG 2.0 — Structure of a CSRD-compliant utility bill schema
Example 2: Supplier Emissions Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Supplier Emissions Data (Scope 3 Category 1)",
"description": "Schema for supplier emissions data for CSRD Scope 3 reporting",
"type": "object",
"properties": {
"supplier_id": {
"description": "Unique supplier identifier",
"type": "string",
"pattern": "^SUP-[0-9]{4}$" // Enforces format: SUP-0001
},
"supplier_name": {
"description": "Supplier legal name",
"type": "string",
"minLength": 2
},
"reporting_year": {
"description": "Reporting year",
"type": "integer",
"minimum": 2020,
"maximum": 2030 // Future-proof to 2030
},
"emissions": {
"description": "Greenhouse gas emissions",
"type": "object",
"properties": {
"scope1_tco2e": {
"description": "Scope 1 emissions in metric tonnes CO2e",
"type": "number",
"minimum": 0
},
"scope2_tco2e": {
"description": "Scope 2 emissions in metric tonnes CO2e",
"type": "number",
"minimum": 0
},
"scope3_tco2e": {
"description": "Scope 3 emissions in metric tonnes CO2e",
"type": "number",
"minimum": 0
},
"total_tco2e": {
"description": "Total emissions (Scope 1+2+3)",
"type": "number",
"minimum": 0
}
},
"required": ["scope1_tco2e", "scope2_tco2e"]
},
"methodology": {
"description": "Calculation methodology",
"type": "string",
"enum": [
"GHG Protocol",
"ISO 14064",
"ISO 14067",
"Spend-based",
"Average-data",
"Hybrid",
"Custom"
]
},
"verification": {
"description": "Third-party verification status",
"type": "object",
"properties": {
"status": {
"description": "Verification status",
"type": "string",
"enum": ["Third-party verified", "Self-assessed", "Not verified", "Partial"]
},
"verifier": {
"description": "Verification organization name",
"type": "string"
},
"verification_date": {
"description": "Verification date (ISO 8601)",
"type": "string",
"format": "date"
},
"assurance_level": {
"description": "Assurance level (ISAE 3000/ISO 14064)",
"type": "string",
"enum": ["Reasonable Assurance", "Limited Assurance", "None"]
}
}
},
"data_coverage": {
"description": "Data completeness and coverage",
"type": "object",
"properties": {
"percentage": {
"description": "Percentage of total spend/emissions covered",
"type": "number",
"minimum": 0,
"maximum": 100
},
"exclusions": {
"description": "Categories or facilities excluded from data",
"type": "array",
"items": {
"type": "string"
}
},
"estimation_method": {
"description": "Method used for uncovered data",
"type": "string",
"enum": ["Industry average", "Spend-based proxy", "Extrapolation", "Not estimated"]
}
}
}
},
"required": ["supplier_id", "reporting_year", "emissions", "methodology"]
}
Advanced Validation Techniques
Conditional Validation
Require certain fields only when relevant:
{
"type": "object",
"properties": {
"verification_status": {
"type": "string",
"enum": ["verified", "not_verified"]
},
"verifier_name": {
"type": "string"
},
"verification_date": {
"type": "string",
"format": "date"
}
},
"if": {
"properties": {
"verification_status": {
"const": "verified"
}
}
},
"then": {
"required": ["verifier_name", "verification_date"]
}
}
If verification_status is “verified”, the schema requires verifier_name and verification_date. Otherwise, those fields are optional.
Cross-Field Validation
Ensure relationships between fields are valid:
{
"type": "object",
"properties": {
"total_consumption_kwh": { "type": "number", "minimum": 0 },
"renewable_consumption_kwh": { "type": "number", "minimum": 0 },
"renewable_percentage": { "type": "number", "minimum": 0, "maximum": 100 }
},
"allOf": [
{
"if": {
"properties": {
"total_consumption_kwh": { "type": "number" },
"renewable_consumption_kwh": { "type": "number" }
},
"required": ["total_consumption_kwh", "renewable_consumption_kwh"]
},
"then": {
"properties": {
"renewable_consumption_kwh": {
"maximum": { "$data": "1/total_consumption_kwh" }
}
}
}
}
]
}
Cross-field validation often requires custom logic beyond standard JSON Schema. Implement this in application code:
def validate_consumption_totals(data: dict) -> bool:
"""Validate renewable doesn't exceed total."""
total = data.get("total_consumption_kwh", 0)
renewable = data.get("renewable_consumption_kwh", 0)
if renewable > total:
raise ValueError(f"Renewable consumption ({renewable}) cannot exceed total consumption ({total})")
return True
Pattern Matching for IDs
Enforce consistent ID formats across facilities:
{
"facility_id": {
"type": "string",
"pattern": "^[A-Z]{3}-[0-9]{2}$", // MUC-01, LON-12, BER-05
"description": "Facility ID format: 3-letter city code, hyphen, 2-digit number"
}
}
Valid: MUC-01, LON-12, BER-05 Invalid: MUC-1 (missing digit), Munich-01 (wrong format), muc-01 (lowercase)
Enum Constraints for Controlled Vocabularies
Limit values to approved lists. This is useful for emission sources, methodologies, and other standardized fields:
{
"emission_source": {
"type": "string",
"enum": [
"stationary_combustion",
"mobile_combustion",
"process_emissions",
"fugitive_emissions",
"electricity_imported",
"refrigerants",
"other"
]
}
}
Implementing Validation with LeapOCR
Creating Schema-Validated Templates
from leapocr import LeapOCR
# Define your schema
utility_bill_schema = {
# ... (full schema from Example 1 above)
}
# Create template with schema
template = {
"name": "esg-utility-bill-csrd",
"description": "Extract utility bill data for CSRD compliance with JSON Schema validation",
"format": "structured",
"schema": utility_bill_schema,
"instructions": """
Extract utility bill data with the following requirements:
- Facility ID must match pattern: XXX-## (e.g., MUC-01, LON-12)
- Dates must be in ISO 8601 format (YYYY-MM-DD)
- Energy consumption must be >= 0
- Renewable percentage must be 0-100
- Currency must be valid ISO 4217 code
If validation fails, flag the specific fields that don't meet schema requirements.
"""
}
# Create template via API
client = LeapOCR(api_key=os.getenv("LEAPOCR_API_KEY"))
created_template = client.templates.create(template)
print(f"Template created: {created_template['template_slug']}")
Processing with Validation
# Process document with schema validation
job = client.ocr.process_file(
file_path="utility_bill.pdf",
format="structured",
template_slug="esg-utility-bill-csrd"
)
result = client.ocr.wait_until_done(job["job_id"])
if result["status"] == "completed":
data = result["pages"][0]["result"]
# Check if validation passed
if result["pages"][0].get("schema_valid", True):
print("✅ Schema validation passed")
print(f"Extracted data: {data}")
else:
print("❌ Schema validation failed")
print(f"Validation errors: {result['pages'][0].get('validation_errors', [])}")
else:
print(f"Extraction failed: {result.get('error')}")
Handling Validation Errors
def handle_validation_errors(result: dict):
"""Handle schema validation errors gracefully."""
validation_errors = result["pages"][0].get("validation_errors", [])
if validation_errors:
print("Validation errors detected:")
for error in validation_errors:
field = error.get("field", "unknown")
message = error.get("message", "Invalid value")
value = error.get("value", "N/A")
print(f" - {field}: {message} (got: {value})")
# Flag for human review
flag_for_review(result["job_id"], field, message)
# Don't save invalid data to database
return False
else:
# Save valid data
save_to_database(result["pages"][0]["result"])
return True
Testing Your Schemas
Unit Tests for Validation
import pytest
from jsonschema import validate, ValidationError
def test_valid_utility_bill():
"""Test that valid data passes schema validation."""
schema = load_schema("utility_bill.json")
valid_data = {
"document_type": "electricity",
"facility_id": "MUC-01",
"billing_period": {
"start_date": "2024-01-01",
"end_date": "2024-01-31"
},
"energy_consumption": {
"kwh": 45230
}
}
# Should not raise exception
validate(instance=valid_data, schema=schema)
def test_invalid_facility_id():
"""Test that invalid facility ID fails validation."""
schema = load_schema("utility_bill.json")
invalid_data = {
"document_type": "electricity",
"facility_id": "Munich-01", # Wrong format
"billing_period": {
"start_date": "2024-01-01",
"end_date": "2024-01-31"
},
"energy_consumption": {
"kwh": 45230
}
}
# Should raise ValidationError
with pytest.raises(ValidationError) as exc_info:
validate(instance=invalid_data, schema=schema)
assert "facility_id" in str(exc_info.value)
assert "pattern" in str(exc_info.value)
def test_negative_consumption():
"""Test that negative consumption fails validation."""
schema = load_schema("utility_bill.json")
invalid_data = {
"document_type": "electricity",
"facility_id": "MUC-01",
"billing_period": {
"start_date": "2024-01-01",
"end_date": "2024-01-31"
},
"energy_consumption": {
"kwh": -100 # Negative!
}
}
with pytest.raises(ValidationError) as exc_info:
validate(instance=invalid_data, schema=schema)
assert "minimum" in str(exc_info.value)
Integration Tests with Real Documents
def test_extraction_with_validation():
"""Test extraction pipeline with schema validation."""
# Process real utility bill
job = client.ocr.process_file(
file_path="test_data/sample_bill.pdf",
format="structured",
template_slug="esg-utility-bill-csrd"
)
result = client.ocr.wait_until_done(job["job_id"])
# Assert extraction succeeded
assert result["status"] == "completed"
# Assert schema validation passed
assert result["pages"][0].get("schema_valid", True) == True
# Assert required fields are present
data = result["pages"][0]["result"]
assert "facility_id" in data
assert "energy_consumption" in data
assert "billing_period" in data
# Assert data types are correct
assert isinstance(data["energy_consumption"]["kwh"], (int, float))
assert data["energy_consumption"]["kwh"] >= 0
Common ESG Validation Patterns
Date Range Validation
Ensure start dates come before end dates:
{
"reporting_period": {
"type": "object",
"properties": {
"start_date": { "type": "string", "format": "date" },
"end_date": { "type": "string", "format": "date" }
},
"required": ["start_date", "end_date"]
}
}
Add application-layer validation for date logic:
def validate_date_range(data: dict):
"""Ensure start_date <= end_date."""
start = data.get("reporting_period", {}).get("start_date")
end = data.get("reporting_period", {}).get("end_date")
if start and end:
if datetime.fromisoformat(start) > datetime.fromisoformat(end):
raise ValueError(f"Start date ({start}) cannot be after end date ({end})")
Percentage Range Validation
Keep percentages within valid bounds:
{
"renewable_percentage": {
"type": "number",
"minimum": 0,
"maximum": 100
}
}
Unit Consistency
Require exactly one unit type:
{
"energy_consumption": {
"type": "object",
"oneOf": [{ "required": ["kwh"] }, { "required": ["mwh"] }],
"properties": {
"kwh": { "type": "number", "minimum": 0 },
"mwh": { "type": "number", "minimum": 0 }
}
}
}
This enforces that either kWh OR MWh must be present, but not both or neither.
Currency Validation
Accept only valid currency codes:
{
"currency": {
"type": "string",
"pattern": "^[A-Z]{3}$",
"enum": ["EUR", "GBP", "USD", "CHF", "SEK", "NOK", "DKK", "PLN"]
}
}
Compliance Benefits
CSRD Readiness
Encode ESRS data requirements in your schema:
- ESRS E1: Mandatory emissions fields (Scope 1, 2, 3)
- ESRS E2: Pollution data with units (kg, tonnes, m³)
- ESRS E3: Water consumption (m³) and withdrawal data
- ESRS E5: Material flow data (tonnes, kg)
Audit Trail
Log every validation attempt for audit purposes:
{
"validation_attempt": {
"timestamp": "2024-02-01T14:23:11Z",
"schema_version": "1.2.0",
"result": "failed",
"errors": [
{
"field": "renewable_percentage",
"error": "maximum",
"message": "Value 120 exceeds maximum 100",
"value": 120
}
]
}
}
Continuous Compliance
Run automated validation checks regularly:
# Run nightly validation on all ESG data
def nightly_validation_check():
"""Validate all ESG data in database."""
cursor.execute("SELECT * FROM esg_data WHERE validation_status != 'validated'")
for row in cursor.fetchall():
data = json.loads(row["data"])
schema = load_schema(row["data_type"])
try:
validate(instance=data, schema=schema)
cursor.execute("UPDATE esg_data SET validation_status = 'validated' WHERE id = %s", (row["id"],))
except ValidationError as e:
cursor.execute(
"UPDATE esg_data SET validation_status = 'failed', validation_error = %s WHERE id = %s",
(str(e), row["id"])
)
Conclusion
JSON Schema validation prevents data quality problems before they become compliance issues. By enforcing type safety, range constraints, and format consistency at extraction time, you:
- Stop bad data from entering your systems
- Catch errors before regulatory filings
- Reduce manual review by 80%
- Encode regulatory requirements in schema
- Accelerate audits with pre-validated data
FIG 3.0 — The data quality funnel: from raw input to audit-ready output
Your sustainability data is only as reliable as your validation. Build schemas that enforce quality from the start.
Next Steps:
- Read Data Residency & EU Hosting for ESG
- Explore Pre-Built ESG Schemas
- Try Schema Validation
Try LeapOCR on your own documents
Start with 100 free credits and see how your workflow holds up on real files.
Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.
Keep reading
Related notes for the same operating context
More implementation guides, benchmarks, and workflow notes for teams building document pipelines.
AI vs. Human: Benchmarking Accuracy in ESG Data Extraction (2025 Edition)
New competitor benchmarks reveal why humans average 1.6% error rates per invoice compared to LeapOCR's 99%+ accuracy.
How to Automate CSRD Compliance: The Role of AI in Data Extraction
CSRD isn't just about compliance; it's a data engineering problem. Here is how to build an automated pipeline that turns scattered PDFs into audit-ready JSON.
Data Residency and ESG: Why EU Hosting Matters for Compliance
Discussion of GDPR and other data sovereignty requirements for ESG data. LeapOCR's compliance features and air-gapped options.