Back to blog Technical guide

How to Use JSON Schema to Validate Your ESG Data for Compliance

Technical deep dive on using JSON Schema to enforce data quality and consistency for regulatory filings.

JSON Schema validation technical compliance ESG data quality
Published
January 18, 2025
Read time
11 min
Word count
2,348
How to Use JSON Schema to Validate Your ESG Data for Compliance preview

JSON Schema Validation Header

How to Use JSON Schema to Validate Your ESG Data for Compliance

Your sustainability team spent six months collecting ESG data. Two days before your CSRD filing deadline, you discover:

  • 20 facilities reported energy in MWh instead of kWh
  • 15 suppliers used different date formats (DD.MM.YYYY vs. YYYY-MM-DD)
  • 8 entries have “null” for critical emissions fields
  • 3 renewable percentages exceed 100%

These inconsistencies invalidate your entire submission. External auditors will reject it, you’ll miss the deadline, and investors will have questions.

JSON Schema validation prevents this by enforcing data quality and consistency when data enters your systems, rather than discovering problems months later.

The cost of inadequate data validation is substantial. Companies face audit failures, regulatory penalties, and restatement costs averaging €45,000 per incident. In 2023, 12% of EU companies restated ESG data due to validation errors.

What is JSON Schema?

JSON Schema defines the structure, types, and validation rules for JSON data. Think of it as a contract that data must satisfy to be accepted.

Here’s a simple example:

{
  "type": "object",
  "properties": {
    "energy_consumption_kwh": {
      "type": "number",
      "minimum": 0
    }
  },
  "required": ["energy_consumption_kwh"]
}

This schema requires that energy_consumption_kwh must exist, must be a number, and cannot be negative.

JSON Schema validation flow diagram showing type checking, range validation, and format enforcement FIG 1.0 — JSON Schema validation flow ensuring data quality and compliance

Why JSON Schema Matters for ESG Compliance

Type Safety

Type mismatches cause calculation errors. JSON Schema catches these early:

  • Numbers stored as strings (“15000” instead of 15000)
  • Dates in wrong formats (text instead of ISO 8601)
  • Booleans as strings (“true” instead of true)

Range Validation

Identify impossible values before they propagate:

  • Negative energy consumption
  • Renewable percentages over 100%
  • Emissions data from before 1990 (when tracking began)

Completeness Enforcement

Require all mandatory fields:

  • Facility IDs
  • Billing periods
  • Emissions values

Format Consistency

Standardize data across facilities and suppliers:

  • All dates in ISO 8601 (YYYY-MM-DD)
  • All currencies as ISO 4217 codes (EUR, USD, GBP)
  • All units in standard formats (kWh, MWh, tCO2e)

Regulatory Alignment

Encode ESRS/CSRD requirements directly in your schema:

  • ESRS E1 mandatory fields
  • Reporting period constraints
  • Data point requirements from regulatory frameworks

Research from the OntoMetric framework shows JSON Schema validation achieves 80-90% schema compliance and 65-90% semantic accuracy for complex ESG documents. Traditional extraction methods typically achieve only 3-10% accuracy. DataHen reports that validating at every pipeline stage catches errors early and increases efficiency through automation.

Building ESG JSON Schemas

Example 1: Utility Bill Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Utility Bill ESG Data",
  "description": "Schema for extracting electricity, gas, and water utility bill data for CSRD compliance",
  "type": "object",
  "properties": {
    "document_type": {
      "description": "Type of utility bill",
      "type": "string",
      "enum": ["electricity", "gas", "water", "district_heating"]
    },
    "facility_id": {
      "description": "Unique facility identifier",
      "type": "string",
      "pattern": "^[A-Z]{3}-[0-9]{2,3}$", // Enforces format: MUC-01, LON-12
      "minLength": 5,
      "maxLength": 20
    },
    "supplier": {
      "description": "Utility supplier name",
      "type": "string",
      "minLength": 2
    },
    "account_number": {
      "description": "Supplier account number",
      "type": "string",
      "pattern": "^[A-Z0-9-]+$" // Alphanumeric with hyphens
    },
    "billing_period": {
      "description": "Billing period",
      "type": "object",
      "properties": {
        "start_date": {
          "description": "Billing period start date (ISO 8601)",
          "type": "string",
          "format": "date"
        },
        "end_date": {
          "description": "Billing period end date (ISO 8601)",
          "type": "string",
          "format": "date"
        }
      },
      "required": ["start_date", "end_date"]
    },
    "energy_consumption": {
      "description": "Energy consumption data",
      "type": "object",
      "properties": {
        "kwh": {
          "description": "Consumption in kilowatt-hours",
          "type": "number",
          "minimum": 0,
          "exclusiveMaximum": 10000000 // Max 10M kWh per month
        },
        "mwh": {
          "description": "Consumption in megawatt-hours (alternative unit)",
          "type": "number",
          "minimum": 0,
          "exclusiveMaximum": 10000
        }
      },
      "oneOf": [{ "required": ["kwh"] }, { "required": ["mwh"] }]
    },
    "renewable_energy": {
      "description": "Renewable energy breakdown",
      "type": "object",
      "properties": {
        "kwh": {
          "description": "Renewable consumption in kWh",
          "type": "number",
          "minimum": 0
        },
        "percentage": {
          "description": "Renewable energy percentage (0-100)",
          "type": "number",
          "minimum": 0,
          "maximum": 100
        },
        "source": {
          "description": "Renewable energy source",
          "type": "string",
          "enum": ["solar", "wind", "hydro", "biomass", "geothermal", "mixed", "unknown"]
        }
      }
    },
    "total_cost": {
      "description": "Total bill amount",
      "type": "object",
      "properties": {
        "amount": {
          "description": "Cost amount",
          "type": "number",
          "minimum": 0
        },
        "currency": {
          "description": "Currency code (ISO 4217)",
          "type": "string",
          "pattern": "^[A-Z]{3}$", // 3-letter ISO currency code
          "enum": ["EUR", "GBP", "USD", "CHF", "SEK", "NOK", "DKK", "PLN"]
        }
      },
      "required": ["amount", "currency"]
    }
  },
  "required": ["document_type", "facility_id", "billing_period", "energy_consumption"]
}

Annotated JSON Schema structure showing field types and validation rules FIG 2.0 — Structure of a CSRD-compliant utility bill schema

Example 2: Supplier Emissions Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Supplier Emissions Data (Scope 3 Category 1)",
  "description": "Schema for supplier emissions data for CSRD Scope 3 reporting",
  "type": "object",
  "properties": {
    "supplier_id": {
      "description": "Unique supplier identifier",
      "type": "string",
      "pattern": "^SUP-[0-9]{4}$" // Enforces format: SUP-0001
    },
    "supplier_name": {
      "description": "Supplier legal name",
      "type": "string",
      "minLength": 2
    },
    "reporting_year": {
      "description": "Reporting year",
      "type": "integer",
      "minimum": 2020,
      "maximum": 2030 // Future-proof to 2030
    },
    "emissions": {
      "description": "Greenhouse gas emissions",
      "type": "object",
      "properties": {
        "scope1_tco2e": {
          "description": "Scope 1 emissions in metric tonnes CO2e",
          "type": "number",
          "minimum": 0
        },
        "scope2_tco2e": {
          "description": "Scope 2 emissions in metric tonnes CO2e",
          "type": "number",
          "minimum": 0
        },
        "scope3_tco2e": {
          "description": "Scope 3 emissions in metric tonnes CO2e",
          "type": "number",
          "minimum": 0
        },
        "total_tco2e": {
          "description": "Total emissions (Scope 1+2+3)",
          "type": "number",
          "minimum": 0
        }
      },
      "required": ["scope1_tco2e", "scope2_tco2e"]
    },
    "methodology": {
      "description": "Calculation methodology",
      "type": "string",
      "enum": [
        "GHG Protocol",
        "ISO 14064",
        "ISO 14067",
        "Spend-based",
        "Average-data",
        "Hybrid",
        "Custom"
      ]
    },
    "verification": {
      "description": "Third-party verification status",
      "type": "object",
      "properties": {
        "status": {
          "description": "Verification status",
          "type": "string",
          "enum": ["Third-party verified", "Self-assessed", "Not verified", "Partial"]
        },
        "verifier": {
          "description": "Verification organization name",
          "type": "string"
        },
        "verification_date": {
          "description": "Verification date (ISO 8601)",
          "type": "string",
          "format": "date"
        },
        "assurance_level": {
          "description": "Assurance level (ISAE 3000/ISO 14064)",
          "type": "string",
          "enum": ["Reasonable Assurance", "Limited Assurance", "None"]
        }
      }
    },
    "data_coverage": {
      "description": "Data completeness and coverage",
      "type": "object",
      "properties": {
        "percentage": {
          "description": "Percentage of total spend/emissions covered",
          "type": "number",
          "minimum": 0,
          "maximum": 100
        },
        "exclusions": {
          "description": "Categories or facilities excluded from data",
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "estimation_method": {
          "description": "Method used for uncovered data",
          "type": "string",
          "enum": ["Industry average", "Spend-based proxy", "Extrapolation", "Not estimated"]
        }
      }
    }
  },
  "required": ["supplier_id", "reporting_year", "emissions", "methodology"]
}

Advanced Validation Techniques

Conditional Validation

Require certain fields only when relevant:

{
  "type": "object",
  "properties": {
    "verification_status": {
      "type": "string",
      "enum": ["verified", "not_verified"]
    },
    "verifier_name": {
      "type": "string"
    },
    "verification_date": {
      "type": "string",
      "format": "date"
    }
  },
  "if": {
    "properties": {
      "verification_status": {
        "const": "verified"
      }
    }
  },
  "then": {
    "required": ["verifier_name", "verification_date"]
  }
}

If verification_status is “verified”, the schema requires verifier_name and verification_date. Otherwise, those fields are optional.

Cross-Field Validation

Ensure relationships between fields are valid:

{
  "type": "object",
  "properties": {
    "total_consumption_kwh": { "type": "number", "minimum": 0 },
    "renewable_consumption_kwh": { "type": "number", "minimum": 0 },
    "renewable_percentage": { "type": "number", "minimum": 0, "maximum": 100 }
  },
  "allOf": [
    {
      "if": {
        "properties": {
          "total_consumption_kwh": { "type": "number" },
          "renewable_consumption_kwh": { "type": "number" }
        },
        "required": ["total_consumption_kwh", "renewable_consumption_kwh"]
      },
      "then": {
        "properties": {
          "renewable_consumption_kwh": {
            "maximum": { "$data": "1/total_consumption_kwh" }
          }
        }
      }
    }
  ]
}

Cross-field validation often requires custom logic beyond standard JSON Schema. Implement this in application code:

def validate_consumption_totals(data: dict) -> bool:
  """Validate renewable doesn't exceed total."""
  total = data.get("total_consumption_kwh", 0)
  renewable = data.get("renewable_consumption_kwh", 0)

  if renewable > total:
    raise ValueError(f"Renewable consumption ({renewable}) cannot exceed total consumption ({total})")

  return True

Pattern Matching for IDs

Enforce consistent ID formats across facilities:

{
  "facility_id": {
    "type": "string",
    "pattern": "^[A-Z]{3}-[0-9]{2}$", // MUC-01, LON-12, BER-05
    "description": "Facility ID format: 3-letter city code, hyphen, 2-digit number"
  }
}

Valid: MUC-01, LON-12, BER-05 Invalid: MUC-1 (missing digit), Munich-01 (wrong format), muc-01 (lowercase)

Enum Constraints for Controlled Vocabularies

Limit values to approved lists. This is useful for emission sources, methodologies, and other standardized fields:

{
  "emission_source": {
    "type": "string",
    "enum": [
      "stationary_combustion",
      "mobile_combustion",
      "process_emissions",
      "fugitive_emissions",
      "electricity_imported",
      "refrigerants",
      "other"
    ]
  }
}

Implementing Validation with LeapOCR

Creating Schema-Validated Templates

from leapocr import LeapOCR

# Define your schema
utility_bill_schema = {
  # ... (full schema from Example 1 above)
}

# Create template with schema
template = {
  "name": "esg-utility-bill-csrd",
  "description": "Extract utility bill data for CSRD compliance with JSON Schema validation",
  "format": "structured",
  "schema": utility_bill_schema,
  "instructions": """
  Extract utility bill data with the following requirements:
  - Facility ID must match pattern: XXX-## (e.g., MUC-01, LON-12)
  - Dates must be in ISO 8601 format (YYYY-MM-DD)
  - Energy consumption must be >= 0
  - Renewable percentage must be 0-100
  - Currency must be valid ISO 4217 code

  If validation fails, flag the specific fields that don't meet schema requirements.
  """
}

# Create template via API
client = LeapOCR(api_key=os.getenv("LEAPOCR_API_KEY"))
created_template = client.templates.create(template)
print(f"Template created: {created_template['template_slug']}")

Processing with Validation

# Process document with schema validation
job = client.ocr.process_file(
  file_path="utility_bill.pdf",
  format="structured",
  template_slug="esg-utility-bill-csrd"
)

result = client.ocr.wait_until_done(job["job_id"])

if result["status"] == "completed":
  data = result["pages"][0]["result"]

  # Check if validation passed
  if result["pages"][0].get("schema_valid", True):
    print("✅ Schema validation passed")
    print(f"Extracted data: {data}")
  else:
    print("❌ Schema validation failed")
    print(f"Validation errors: {result['pages'][0].get('validation_errors', [])}")
else:
  print(f"Extraction failed: {result.get('error')}")

Handling Validation Errors

def handle_validation_errors(result: dict):
  """Handle schema validation errors gracefully."""
  validation_errors = result["pages"][0].get("validation_errors", [])

  if validation_errors:
    print("Validation errors detected:")

    for error in validation_errors:
      field = error.get("field", "unknown")
      message = error.get("message", "Invalid value")
      value = error.get("value", "N/A")

      print(f"  - {field}: {message} (got: {value})")

      # Flag for human review
      flag_for_review(result["job_id"], field, message)

    # Don't save invalid data to database
    return False
  else:
    # Save valid data
    save_to_database(result["pages"][0]["result"])
    return True

Testing Your Schemas

Unit Tests for Validation

import pytest
from jsonschema import validate, ValidationError

def test_valid_utility_bill():
  """Test that valid data passes schema validation."""
  schema = load_schema("utility_bill.json")

  valid_data = {
    "document_type": "electricity",
    "facility_id": "MUC-01",
    "billing_period": {
      "start_date": "2024-01-01",
      "end_date": "2024-01-31"
    },
    "energy_consumption": {
      "kwh": 45230
    }
  }

  # Should not raise exception
  validate(instance=valid_data, schema=schema)

def test_invalid_facility_id():
  """Test that invalid facility ID fails validation."""
  schema = load_schema("utility_bill.json")

  invalid_data = {
    "document_type": "electricity",
    "facility_id": "Munich-01",  # Wrong format
    "billing_period": {
      "start_date": "2024-01-01",
      "end_date": "2024-01-31"
    },
    "energy_consumption": {
      "kwh": 45230
    }
  }

  # Should raise ValidationError
  with pytest.raises(ValidationError) as exc_info:
    validate(instance=invalid_data, schema=schema)

  assert "facility_id" in str(exc_info.value)
  assert "pattern" in str(exc_info.value)

def test_negative_consumption():
  """Test that negative consumption fails validation."""
  schema = load_schema("utility_bill.json")

  invalid_data = {
    "document_type": "electricity",
    "facility_id": "MUC-01",
    "billing_period": {
      "start_date": "2024-01-01",
      "end_date": "2024-01-31"
    },
    "energy_consumption": {
      "kwh": -100  # Negative!
    }
  }

  with pytest.raises(ValidationError) as exc_info:
    validate(instance=invalid_data, schema=schema)

  assert "minimum" in str(exc_info.value)

Integration Tests with Real Documents

def test_extraction_with_validation():
  """Test extraction pipeline with schema validation."""
  # Process real utility bill
  job = client.ocr.process_file(
    file_path="test_data/sample_bill.pdf",
    format="structured",
    template_slug="esg-utility-bill-csrd"
  )

  result = client.ocr.wait_until_done(job["job_id"])

  # Assert extraction succeeded
  assert result["status"] == "completed"

  # Assert schema validation passed
  assert result["pages"][0].get("schema_valid", True) == True

  # Assert required fields are present
  data = result["pages"][0]["result"]
  assert "facility_id" in data
  assert "energy_consumption" in data
  assert "billing_period" in data

  # Assert data types are correct
  assert isinstance(data["energy_consumption"]["kwh"], (int, float))
  assert data["energy_consumption"]["kwh"] >= 0

Common ESG Validation Patterns

Date Range Validation

Ensure start dates come before end dates:

{
  "reporting_period": {
    "type": "object",
    "properties": {
      "start_date": { "type": "string", "format": "date" },
      "end_date": { "type": "string", "format": "date" }
    },
    "required": ["start_date", "end_date"]
  }
}

Add application-layer validation for date logic:

def validate_date_range(data: dict):
  """Ensure start_date <= end_date."""
  start = data.get("reporting_period", {}).get("start_date")
  end = data.get("reporting_period", {}).get("end_date")

  if start and end:
    if datetime.fromisoformat(start) > datetime.fromisoformat(end):
      raise ValueError(f"Start date ({start}) cannot be after end date ({end})")

Percentage Range Validation

Keep percentages within valid bounds:

{
  "renewable_percentage": {
    "type": "number",
    "minimum": 0,
    "maximum": 100
  }
}

Unit Consistency

Require exactly one unit type:

{
  "energy_consumption": {
    "type": "object",
    "oneOf": [{ "required": ["kwh"] }, { "required": ["mwh"] }],
    "properties": {
      "kwh": { "type": "number", "minimum": 0 },
      "mwh": { "type": "number", "minimum": 0 }
    }
  }
}

This enforces that either kWh OR MWh must be present, but not both or neither.

Currency Validation

Accept only valid currency codes:

{
  "currency": {
    "type": "string",
    "pattern": "^[A-Z]{3}$",
    "enum": ["EUR", "GBP", "USD", "CHF", "SEK", "NOK", "DKK", "PLN"]
  }
}

Compliance Benefits

CSRD Readiness

Encode ESRS data requirements in your schema:

  • ESRS E1: Mandatory emissions fields (Scope 1, 2, 3)
  • ESRS E2: Pollution data with units (kg, tonnes, m³)
  • ESRS E3: Water consumption (m³) and withdrawal data
  • ESRS E5: Material flow data (tonnes, kg)

Audit Trail

Log every validation attempt for audit purposes:

{
  "validation_attempt": {
    "timestamp": "2024-02-01T14:23:11Z",
    "schema_version": "1.2.0",
    "result": "failed",
    "errors": [
      {
        "field": "renewable_percentage",
        "error": "maximum",
        "message": "Value 120 exceeds maximum 100",
        "value": 120
      }
    ]
  }
}

Continuous Compliance

Run automated validation checks regularly:

# Run nightly validation on all ESG data
def nightly_validation_check():
  """Validate all ESG data in database."""
  cursor.execute("SELECT * FROM esg_data WHERE validation_status != 'validated'")

  for row in cursor.fetchall():
    data = json.loads(row["data"])
    schema = load_schema(row["data_type"])

    try:
      validate(instance=data, schema=schema)
      cursor.execute("UPDATE esg_data SET validation_status = 'validated' WHERE id = %s", (row["id"],))
    except ValidationError as e:
      cursor.execute(
        "UPDATE esg_data SET validation_status = 'failed', validation_error = %s WHERE id = %s",
        (str(e), row["id"])
      )

Conclusion

JSON Schema validation prevents data quality problems before they become compliance issues. By enforcing type safety, range constraints, and format consistency at extraction time, you:

  • Stop bad data from entering your systems
  • Catch errors before regulatory filings
  • Reduce manual review by 80%
  • Encode regulatory requirements in schema
  • Accelerate audits with pre-validated data

Data quality funnel showing unstructured data becoming compliance-ready via schema validation FIG 3.0 — The data quality funnel: from raw input to audit-ready output

Your sustainability data is only as reliable as your validation. Build schemas that enforce quality from the start.


Next Steps:

Try LeapOCR on your own documents

Start with 100 free credits and see how your workflow holds up on real files.

Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.

Keep reading

Related notes for the same operating context

More implementation guides, benchmarks, and workflow notes for teams building document pipelines.