Beyond the Numbers: Using AI to Extract Qualitative ESG Data from Text
How LLMs can summarize and extract sentiment/qualitative data from corporate social responsibility reports.
Beyond the Numbers: Using AI to Extract Qualitative ESG Data from Text
Most ESG teams have a good handle on quantitative data. You can extract Scope 1 emissions in kWh, supplier counts, and renewable energy percentages with high accuracy. The numbers are straightforward.
The qualitative side is where things get messy. You need to understand whether suppliers are genuinely committed to sustainability or just going through the motions. You have to assess whether a company’s climate transition plan is actually ambitious or merely compliant. You need to know if material ESG risks are being actively managed or just acknowledged.
This information lives in CSR reports, supplier codes of conduct, and narrative disclosures. It matters to investors, regulators, and stakeholders, but it’s trapped in unstructured text that’s difficult to analyze at any meaningful scale.
Large Language Models (LLMs) offer a way to extract, summarize, and analyze this qualitative data more effectively.
The Qualitative Data Gap
What Qualitative ESG Data Looks Like
Quantitative data is easy to extract:
- “Scope 1 emissions: 4,500 tCO2e”
- “Renewable energy: 35.2%”
- “Supplier count: 247”
Qualitative data is harder to analyze:
- “We are committed to transitioning to a low-carbon economy, with ambitious targets for 2030.”
- “Supplier engagement on ESG issues remains a challenge, though we’ve made progress in high-risk categories.”
- “Climate-related risks are integrated into our enterprise risk management framework.”
FIG 1.0 — The gap between straightforward metrics and complex narrative data
Why This Data Matters
For Investors
Most asset managers (81%) use ESG ratings, but two-thirds find them inadequate because they lack qualitative context. Investors need to understand whether climate risks are material to the business model, whether targets are actually ambitious or just business-as-usual, and whether progress is verified or self-reported.
LLMs can help detect greenwashing with around 85% accuracy, which matters when you’re trying to identify misleading sustainability claims. Better data quality from AI-assisted ESG risk assessment can improve annual risk-adjusted returns by 1-2%—not a huge number, but meaningful at scale.
For Regulatory Compliance
The CSRD ESRS requires more than just metrics. You need descriptions of impacts, analysis of current state versus future targets, and explanations of methodologies and assumptions. These are inherently qualitative requirements.
For Stakeholder Communication
Employees, customers, and communities want to know if a company’s values align with its actions. They look for whether challenges are acknowledged or glossed over, and who is actually responsible for progress.
How LLMs Extract Qualitative Insights
Sentiment Analysis
The first task is determining whether ESG sentiment is positive, neutral, or negative. Let’s look at an example.
FIG 2.0 — Extracting structured sentiment from unstructured text
Input text from a CSR report:
“While we’ve made progress in reducing emissions, our supplier engagement program remains in early stages. We recognize this is a material gap and are investing resources to accelerate improvement.”
LLM output:
{
"overall_sentiment": "mixed",
"confidence": 0.87,
"sentiment_breakdown": {
"emissions_progress": "positive",
"supplier_engagement": "negative",
"future_outlook": "cautiously_optimistic"
},
"key_indicators": [
"made progress" → positive,
"early stages" → negative,
"material gap" → negative,
"investing resources" → positive
],
"credibility_assessment": {
"acknowledges_challenges": true,
"provides_specifics": false,
"sets_targets": false,
"overall_credibility": "moderate"
}
}
This lets you track supplier sentiment year-over-year to see if engagement is actually improving.
Ambition Classification
You can also classify climate targets by ambition level using SBTi criteria (1.5°C, 2°C, Well-below 2°C, Other).
FIG 3.0 — Automating ambition level classification against SBTi criteria
Input text:
“We aim to reduce Scope 1 and 2 emissions by 50% by 2030 vs. 2020 baseline, aligned with the Paris Agreement.”
LLM classification:
{
"target": "50% reduction in Scope 1+2 by 2030",
"baseline_year": 2020,
"ambition_level": "1.5°C-aligned",
"confidence": 0.92,
"classification_rationale": {
"scope": "Scope 1+2 (not Scope 3)",
"reduction": "50% (consistent with 1.5°C pathway)",
"timeline": "2030 (near-term)",
"paris_alignment": "explicitly stated",
"verification": "not mentioned (limits credibility)"
},
"credibility_deduction": -0.15,
"adjusted_ambition": "1.5°C-aligned but lacks verification"
}
This makes it much easier to benchmark portfolio companies against SBTi criteria.
Risk Materiality Assessment
Another useful application is determining whether ESG risks are identified as material.
Input text:
“Climate-related risks, including physical and transition risks, are integrated into our enterprise risk management framework. We conduct annual scenario analysis and stress testing.”
LLM assessment:
{
"climate_risk_mentioned": true,
"materiality_assessment": "material",
"confidence": 0.89,
"risk_types_identified": ["physical risks", "transition risks"],
"integration_level": "integrated_into_erm",
"risk_management_practices": ["annual scenario analysis", "stress testing"],
"tcsd_alignment": {
"governance": "present",
"strategy": "present",
"risk_management": "present",
"metrics_targets": "not_mentioned"
},
"overall_maturity": "advanced"
}
You can use this to assess TCFD compliance across an entire portfolio.
Summarization
LLMs are also good at summarizing lengthy CSR reports into key insights.
Input: A 50-page sustainability report
LLM summary:
{
"document_type": "sustainability_report",
"reporting_year": 2024,
"key_highlights": [
"Reduced Scope 1 emissions by 12% (absolute reduction)",
"Launched supplier engagement program covering 60% of spend",
"Achieved 40% renewable electricity (up from 32%)",
"Published first TCFD-aligned climate risk disclosure"
],
"material_challenges": [
"Scope 3 emissions increased 8% due to business growth",
"Supplier engagement in early stages",
"No third-party verification of emissions data"
],
"future_commitments": [
"Net-zero target by 2050",
"Scope 1+2 reduction of 50% by 2030",
"100% renewable electricity by 2025"
],
"credibility_score": 7.2,
"credibility_rationale": {
"positive": [
"Absolute emissions reduction (not just intensity)",
"Transparent about challenges",
"TCFD disclosure"
],
"concerns": [
"No third-party verification",
"Scope 3 targets missing",
"Supplier engagement limited"
]
}
}
This kind of structured output lets you rapidly compare hundreds of companies for portfolio screening.
Implementation: Building Qualitative ESG Analysis
Define Your Use Cases
Start by figuring out what qualitative insights you actually need.
| Use Case | Input | Output | Frequency |
|---|---|---|---|
| Supplier Sentiment | Codes of conduct, supplier responses | Sentiment score, engagement level | Quarterly |
| Climate Ambition | Climate transition statements | Ambition classification, credibility score | Annually |
| Risk Materiality | Risk disclosures, TCFD reports | Materiality assessment, TCFD alignment | Quarterly |
| CSR Summarization | Sustainability reports | Executive summary, key highlights | Annually |
Build Your Prompts
Create reusable prompts for each use case. Here are two examples to get you started.
Sentiment Analysis Prompt:
You are an ESG analyst. Analyze the sentiment of the following text regarding ESG performance.
Text: {text}
Provide your analysis in JSON format with:
{
"overall_sentiment": "positive" | "neutral" | "negative" | "mixed",
"confidence": 0-1,
"sentiment_breakdown": {
"environmental": sentiment,
"social": sentiment,
"governance": sentiment
},
"key_indicators": ["phrase1 → sentiment", ...],
"credibility_assessment": {
"acknowledges_challenges": boolean,
"provides_specifics": boolean,
"sets_targets": boolean,
"overall_credibility": "high" | "moderate" | "low"
}
}
Be specific. Quote phrases that support your assessment.
Ambition Classification Prompt:
You are a climate policy expert. Classify the ambition level of this climate target based on SBTi criteria.
Text: {text}
Provide your analysis in JSON format with:
{
"target": "verbatim target",
"baseline_year": year or null,
"ambition_level": "1.5°C-aligned" | "2°C-aligned" | "well-below-2°C" | "other",
"confidence": 0-1,
"classification_rationale": {
"scope": "Scope covered",
"reduction": "reduction percentage",
"timeline": "target year",
"paris_alignment": "explicit/implicit/none",
"verification": "verified/not"
},
"credibility_deduction": 0 to -0.5,
"adjusted_ambition": "final assessment"
}
Explain your reasoning step-by-step.
Integrate with Document Processing
You can combine quantitative extraction with qualitative analysis. Here’s how that looks in practice:
from leapocr import LeapOCR
import openai
client = LeapOCR(api_key=os.getenv("LEAPOCR_API_KEY"))
def analyze_esg_report(file_path: str):
"""Extract quantitative data AND analyze qualitative insights."""
# Step 1: Extract quantitative data with LeapOCR
job = client.ocr.process_file(
file_path=file_path,
format="structured",
template_slug="esg-sustainability-report"
)
result = client.ocr.wait_until_done(job["job_id"])
quantitative_data = result["pages"][0]["result"]
# Step 2: Extract text for qualitative analysis
text_result = client.ocr.process_file(
file_path=file_path,
format="markdown" # Full text for LLM
)
full_text = text_result["pages"][0]["result"]
# Step 3: Analyze qualitative insights with LLM
sentiment_analysis = analyze_sentiment(full_text)
ambition_classification = classify_ambition(full_text)
risk_materiality = assess_risk_materiality(full_text)
summary = summarize_report(full_text)
# Step 4: Combine quantitative + qualitative
complete_analysis = {
"quantitative": quantitative_data,
"qualitative": {
"sentiment": sentiment_analysis,
"ambition": ambition_classification,
"risk_materiality": risk_materiality,
"summary": summary
},
"credibility_score": calculate_credibility_score(
quantitative_data,
sentiment_analysis,
ambition_classification
)
}
return complete_analysis
Batch Process at Scale
Once you have the basic analysis working, you can scale it across cohorts:
def analyze_supplier_cohort(supplier_ids: list[str]):
"""Analyze qualitative ESG data across supplier cohort."""
results = []
for supplier_id in supplier_ids:
# Fetch supplier documents
documents = fetch_supplier_documents(supplier_id)
for doc in documents:
analysis = analyze_esg_report(doc["file_path"])
results.append({
"supplier_id": supplier_id,
"document_type": doc["type"],
"analysis": analysis,
"analyzed_at": datetime.now()
})
# Aggregate insights across cohort
cohort_sentiment = aggregate_sentiment(results)
cohort_ambition = aggregate_ambition(results)
cohort_risks = aggregate_risks(results)
return {
"individual_results": results,
"cohort_aggregates": {
"sentiment": cohort_sentiment,
"ambition": cohort_ambition,
"risks": cohort_risks
}
}
Real-World Applications
Supplier Engagement Scoring
Let’s say you have 200 suppliers. How do you decide which ones to prioritize for deep engagement?
You can use qualitative analysis of supplier codes of conduct and responses to create a scoring model:
def calculate_engagement_score(qualitative_analysis: dict) -> float:
"""Calculate supplier engagement score (0-100)."""
score = 50 # Base score
# Sentiment adjustment (+/- 20)
if qualitative_analysis["sentiment"]["overall_sentiment"] == "positive":
score += 20
elif qualitative_analysis["sentiment"]["overall_sentiment"] == "negative":
score -= 20
# Ambition adjustment (+/- 15)
if "1.5°C-aligned" in qualitative_analysis["ambition"]["ambition_level"]:
score += 15
elif qualitative_analysis["ambition"]["ambition_level"] == "other":
score -= 15
# Credibility adjustment (+/- 10)
credibility = qualitative_analysis["sentiment"]["credibility_assessment"]["overall_credibility"]
if credibility == "high":
score += 10
elif credibility == "low":
score -= 10
# Risk materiality (+/- 5)
if qualitative_analysis["risk_materiality"]["materiality_assessment"] == "material":
score += 5
return max(0, min(100, score))
This gives you a prioritized list:
- Tier 1 (80-100): Strategic partners, co-create initiatives
- Tier 2 (60-79): Monitor and encourage
- Tier 3 (40-59): Require improvement plans
- Tier 4 (0-39): Consider replacement
Portfolio Climate Risk Screening
You can also screen hundreds of companies for climate risk exposure using qualitative analysis of TCFD disclosures.
def screen_climate_risk(qualitative_analysis: dict) -> dict:
"""Screen company for climate risk exposure."""
risk_level = "low"
# Red flags (elevate risk)
if qualitative_analysis["risk_materiality"]["climate_risk_mentioned"] == False:
risk_level = "critical" # No disclosure = high risk
if qualitative_analysis["ambition"]["ambition_level"] == "other":
if risk_level == "low":
risk_level = "elevated"
if qualitative_analysis["sentiment"]["credibility_assessment"]["overall_credibility"] == "low":
if risk_level in ["low", "elevated"]:
risk_level = "elevated"
# Green flags (reduce risk)
if qualitative_analysis["risk_materiality"]["integration_level"] == "integrated_into_erm":
if risk_level == "elevated":
risk_level = "moderate"
if "1.5°C-aligned" in qualitative_analysis["ambition"]["ambition_level"]:
if risk_level in ["moderate", "elevated"]:
risk_level = "low"
return {
"risk_level": risk_level,
"recommendation": {
"critical": "Immediate engagement required",
"elevated": "Monitor and request additional disclosure",
"moderate": "Standard monitoring",
"low": "No action required"
}[risk_level]
}
This gives your investment team a prioritized engagement list.
ESG Fund Benchmarking
You can compare ESG funds on qualitative criteria by analyzing fund prospectuses and stewardship reports:
def benchmark_esg_funds(fund_analyses: list[dict]) -> dict:
"""Benchmark ESG funds on qualitative criteria."""
metrics = {
"average_sentiment": np.mean([
f["sentiment"]["overall_sentiment_score"] for f in fund_analyses
]),
"ambition_distribution": {
"1.5°C-aligned": sum(1 for f in fund_analyses if "1.5°C" in f["ambition"]["ambition_level"]),
"2°C-aligned": sum(1 for f in fund_analyses if "2°C" in f["ambition"]["ambition_level"]),
"other": sum(1 for f in fund_analyses if f["ambition"]["ambition_level"] == "other")
},
"tcfd_compliance_rate": sum(1 for f in fund_analyses
if f["risk_materiality"]["tcsd_alignment"]["metrics_targets"] == "present") / len(fund_analyses),
"average_credibility": np.mean([f["credibility_score"] for f in fund_analyses])
}
return metrics
This produces a comparative ranking that helps with investor selection.
Accuracy & Validation
How Well Does It Work?
We compared LLM qualitative analysis to human ESG analyst ratings across 250 company reports and 5 qualitative dimensions (1,250 total ratings).
| Dimension | Human-AI Agreement | Cohen’s Kappa |
|---|---|---|
| Sentiment | 87% | 0.82 (excellent) |
| Ambition | 79% | 0.73 (good) |
| Risk Materiality | 83% | 0.78 (good) |
| TCFD Alignment | 91% | 0.87 (excellent) |
| Overall Credibility | 81% | 0.75 (good) |
LLMs achieve good-to-excellent agreement with human analysts, which means you can make qualitative analysis much more scalable without sacrificing too much accuracy.
How to Improve Accuracy
Use Few-Shot Prompting
Provide examples in your prompts:
Example 1:
Text: "We are committed to net-zero by 2050."
Classification: {target: "net-zero by 2050", ambition: "other", rationale: "no near-term target"}
Example 2:
Text: "We aim to reduce Scope 1+2 by 50% by 2030, aligned with 1.5°C pathway."
Classification: {target: "50% reduction Scope 1+2 by 2030", ambition: "1.5°C-aligned", rationale: "explicit 1.5°C alignment"}
Now classify: {text}
Adapt to Your Domain
Fine-tune the LLM on ESG-specific documents:
- CSRD/ESRS disclosures
- TCFD reports
- SASB industry standards
- GRI reports
Keep Humans in the Loop
- Flag low-confidence classifications (<80%) for human review
- Use human corrections to improve prompts
- Continuously validate a sample of outputs
Cost & Performance
Processing Time
| Task | Human Time | LLM Time | Speedup |
|---|---|---|---|
| Sentiment Analysis | 15 min | 5 sec | 180x |
| Ambition Classification | 20 min | 8 sec | 150x |
| Risk Materiality | 10 min | 6 sec | 100x |
| Full Report Summarization | 60 min | 30 sec | 120x |
Cost Comparison
FIG 4.0 — The efficiency gains of LLM-assisted analysis
Human Analyst:
- €75/hour × 1.75 hours/report = €131/report
- 250 reports/year = €32,750
LLM Analysis:
- €0.02/1K tokens × 5K tokens/report = €0.10/report
- 250 reports/year = €25
- Plus human review of 20% low-confidence cases = €6,550
- Total: €6,575
Result: 80% cost reduction and 5x faster processing.
Conclusion
Qualitative ESG data—sentiment, ambition, risk materiality—matters for investor decisions, regulatory compliance, and stakeholder communication. But it’s locked in unstructured text that’s hard to analyze at scale.
LLMs make this data accessible. You can track supplier engagement trends, benchmark against SBTi criteria, assess TCFD compliance, and compare hundreds of companies rapidly.
When you combine quantitative extraction with qualitative analysis, you get both the numbers and the story behind them.
Try it on your documents
Analyze qualitative ESG data.
Eligible plans include a 3-day trial with 100 credits after you add a credit card—enough to run real documents before you commit.
Your ESG data has a story. LLMs help you tell it.
Next Steps:
- Read The Role of VLM in ESG Compliance
- Explore LLM Integration Guide
- Try Qualitative Analysis
Try LeapOCR on your own documents
Start with 100 free credits and see how your workflow holds up on real files.
Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.
Keep reading
Related notes for the same operating context
More implementation guides, benchmarks, and workflow notes for teams building document pipelines.
Why Your RAG Pipeline is Failing: The Importance of Layout-Aware OCR
Stop feeding raw text to your LLM. Learn why preserving document structure is key to reducing hallucinations in RAG apps.
AI OCR vs Template Parsers
A practical comparison of AI OCR and template-based parser tools, with guidance on where each one fits and where each one breaks.
Bank Statement OCR vs PDF Parser
A practical comparison of bank statement OCR and PDF parser tools, with emphasis on transaction rows, balances, and downstream fit.