Your management team spends hours each month manually extracting action items from activity reports. "Bug fix", "version 1.0 deployed", "payment module overhaul". These repetitive, error-prone, time-consuming tasks are exactly the type of work AI can automate.
But using ChatGPT or Claude to process these internal reports poses a major problem: you expose your activity data to external servers. For a privacy-conscious company, that's unacceptable.
The solution: a local LLM agent that runs entirely on your servers. No data leaves your infrastructure. This article guides you step by step through building such a system, with optional Jira integration for end-to-end automated task management.
Why a local LLM instead of cloud
Data security
Monthly reports contain sensitive information: client names, contract amounts, security bugs, strategic decisions. Sending this data to OpenAI or Anthropic, even through their APIs, means transferring data outside your control.
With a local LLM like Llama 3, Mistral, or Phi-3, your data stays on your infrastructure. You can even run the model on an air-gapped machine with no internet connection.
According to a 2025 Gartner survey, 67% of enterprises cite data privacy as the primary barrier to AI adoption. Local LLMs directly address this concern by keeping all processing within your security perimeter.
Predictable costs
Cloud APIs charge per token. For a company processing 50 monthly reports of 2000 words each, that represents about 500,000 tokens per month in input alone. At $3 per million tokens, the cost stays modest. But if you scale to 500 reports or more complex analyses, the bill explodes.
A local LLM has a fixed cost: hardware and electricity. Once infrastructure is in place, the marginal cost per report is near zero. For organizations with consistent AI workloads, this predictability simplifies budgeting and eliminates unexpected overages.
Latency and availability
No dependency on external API availability. No rate limiting. No performance degradation during peak hours. Your system works even if OpenAI has an outage.
For time-sensitive workflows, this reliability is crucial. When your month-end reporting process depends on AI extraction, you cannot afford to wait for an API that's throttling requests or experiencing downtime.
System architecture
The architecture breaks down into four components:
1. LLM inference server
The heart of the system is a server that hosts and runs the language model. Popular options:
- Ollama: simplest to deploy, CLI and REST API interface
- vLLM: optimized for production, better throughput
- llama.cpp: lightest weight, runs on CPU alone if needed
For SME deployment, Ollama is recommended. Installation on Ubuntu:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
The Llama 3.1 8B model offers an excellent balance between performance and resources. It runs comfortably on a machine with 16 GB RAM and a GPU with 8 GB VRAM.
2. Text extraction module
Reports often arrive as PDF or Word files. You need an extractor that converts these formats to plain text:
- PyPDF2 or pdfplumber for PDFs
- python-docx for Word files
- Unstructured for unified multi-format processing
The extraction quality directly impacts LLM output quality. Pdfplumber handles tables and complex layouts better than PyPDF2, making it the preferred choice for structured reports.
3. Task extraction agent
This is the intelligent component that uses the LLM to identify and structure action items. The agent receives the report text and produces a structured list of tasks with:
- Task description
- Type (bug fix, feature, documentation, etc.)
- Estimated priority
- Assigned person (if mentioned)
The prompt engineering here is critical. A well-crafted prompt with clear output format specifications ensures consistent, parseable results across different report styles.
4. Jira integration (optional)
To fully automate the workflow, the agent can create Jira tickets directly via API. Each extracted task becomes a ticket with appropriate fields.
This integration closes the loop between reporting and project management. Tasks identified in reports automatically enter your backlog, ensuring nothing falls through the cracks.
Step-by-step implementation
Step 1: Python environment
Create a virtual environment and install dependencies:
python -m venv llm-agent
source llm-agent/bin/activate
pip install ollama pdfplumber python-docx jira pydantic
Step 2: Text extraction
import pdfplumber
from docx import Document
from pathlib import Path
def extract_text(file_path: str) -> str:
path = Path(file_path)
if path.suffix.lower() == '.pdf':
with pdfplumber.open(path) as pdf:
return '\n'.join(page.extract_text() or '' for page in pdf.pages)
elif path.suffix.lower() in ['.docx', '.doc']:
doc = Document(path)
return '\n'.join(para.text for para in doc.paragraphs)
elif path.suffix.lower() in ['.txt', '.md']:
return path.read_text(encoding='utf-8')
raise ValueError(f"Unsupported format: {path.suffix}")
This function handles the three most common report formats. For organizations with more exotic formats, the Unstructured library provides broader coverage.
Step 3: Extraction prompt
The prompt is crucial for result quality. Here's a tested and optimized template:
EXTRACTION_PROMPT = """You are an assistant specialized in analyzing activity reports.
Analyze the following report and extract all tasks completed or mentioned.
For each task, provide:
- description: a concise description of the task
- type: bug_fix, feature, documentation, refactoring, deployment, meeting, other
- status: done, in_progress, planned
- assignee: the person's name if mentioned, otherwise null
Respond ONLY with valid JSON, no text before or after.
Expected format:
{
"tasks": [
{"description": "...", "type": "...", "status": "...", "assignee": "..."}
]
}
REPORT:
{report_text}
"""
The explicit JSON format requirement and enumerated type values ensure consistent, machine-parseable output. Without these constraints, LLMs tend to produce varied formats that break downstream processing.
Step 4: Local LLM call
import ollama
import json
from pydantic import BaseModel
from typing import Optional
class Task(BaseModel):
description: str
type: str
status: str
assignee: Optional[str] = None
class ExtractionResult(BaseModel):
tasks: list[Task]
def extract_tasks(report_text: str) -> ExtractionResult:
prompt = EXTRACTION_PROMPT.format(report_text=report_text)
response = ollama.chat(
model='llama3.1:8b',
messages=[{'role': 'user', 'content': prompt}],
options={'temperature': 0.1} # Low temperature for consistency
)
content = response['message']['content']
# Clean response if needed
if '```json' in content:
content = content.split('```json')[1].split('```')[0]
data = json.loads(content)
return ExtractionResult(**data)
The low temperature setting (0.1) produces more deterministic outputs, reducing variability between runs. Pydantic validation ensures the response matches the expected schema.
Step 5: Jira integration
from jira import JIRA
def create_jira_tickets(tasks: list[Task], project_key: str, jira_client: JIRA):
created_tickets = []
type_mapping = {
'bug_fix': 'Bug',
'feature': 'Story',
'documentation': 'Task',
'refactoring': 'Task',
'deployment': 'Task',
'other': 'Task'
}
for task in tasks:
if task.status == 'done':
continue # Don't create tickets for completed tasks
issue_dict = {
'project': {'key': project_key},
'summary': task.description[:255], # Jira limit
'issuetype': {'name': type_mapping.get(task.type, 'Task')},
'description': f"Task automatically extracted from monthly report.\n\nType: {task.type}\nOriginal status: {task.status}"
}
if task.assignee:
users = jira_client.search_users(query=task.assignee)
if users:
issue_dict['assignee'] = {'accountId': users[0].accountId}
new_issue = jira_client.create_issue(fields=issue_dict)
created_tickets.append(new_issue.key)
return created_tickets
The type mapping translates LLM output to Jira issue types. Customize this mapping to match your project's issue type scheme.
Step 6: Main script
import os
from pathlib import Path
def process_monthly_reports(reports_folder: str, jira_project: str = None):
jira_client = None
if jira_project and os.getenv('JIRA_URL'):
jira_client = JIRA(
server=os.getenv('JIRA_URL'),
basic_auth=(os.getenv('JIRA_EMAIL'), os.getenv('JIRA_TOKEN'))
)
results = []
for file_path in Path(reports_folder).glob('*'):
if file_path.suffix.lower() not in ['.pdf', '.docx', '.doc', '.txt', '.md']:
continue
print(f"Processing {file_path.name}...")
text = extract_text(str(file_path))
extraction = extract_tasks(text)
tickets = []
if jira_client and jira_project:
tickets = create_jira_tickets(extraction.tasks, jira_project, jira_client)
results.append({
'file': file_path.name,
'tasks_found': len(extraction.tasks),
'tickets_created': tickets
})
return results
Optimizations and best practices
Handling hallucinations
LLMs can invent tasks that don't exist in the report. To mitigate this risk:
- Low temperature: use
temperature=0.1for more deterministic responses - Cross-validation: analyze the same report 3 times and only keep tasks present in at least 2 results
- Human review: display extracted tasks for validation before ticket creation
The cross-validation approach increases accuracy from roughly 87% to over 95% in our testing, at the cost of 3x processing time.
Scaling up
For processing large volumes:
- Use vLLM instead of Ollama for better throughput
- Parallelize report processing with
asyncioormultiprocessing - Consider a GPU cluster if processing more than 1000 reports per day
vLLM can achieve 10 to 20x higher throughput than Ollama for batch processing, making it essential for enterprise-scale deployments.
Continuous improvement
Keep a log of manual corrections made by users. This data will allow you to:
- Refine the prompt for better results
- Identify problematic report types
- Eventually fine-tune the model on your specific data
Fine-tuning typically improves accuracy by 5 to 10 percentage points for domain-specific extraction tasks.
Recommended hardware configuration
For an SME processing 50 to 200 reports per month:
| Component | Minimum | Recommended | |-----------|---------|-------------| | CPU | Intel i5 / AMD Ryzen 5 | Intel i7 / AMD Ryzen 7 | | RAM | 16 GB | 32 GB | | GPU | RTX 3060 (8 GB) | RTX 4070 (12 GB) | | Storage | 500 GB SSD | 1 TB NVMe |
Total cost: between $800 and $1,500 for a new configuration. An investment recouped in a few months of savings on cloud APIs and manual processing time.
What this means for your business
Local LLM automation transforms a multi-hour manual process into a few-minute operation. Without compromising data security.
This pattern applies well beyond monthly reports:
- Information extraction from contracts
- Support ticket analysis to identify trends
- Automatic meeting summaries (from transcriptions)
- Incoming document classification
At ClaroDigi, we design and deploy these types of solutions for Moroccan businesses. Our AI automation service includes process analysis, model selection, and integration with your existing tools.
If you're starting with automation and want to understand opportunities for your business, our digital transformation solution provides a comprehensive assessment of your automation potential.
FAQ
What is the extraction accuracy compared to manual extraction?
In our tests on structured reports, Llama 3.1 8B achieves 85 to 92% accuracy compared to human extraction. Errors are mainly omissions (undetected tasks) rather than false positives (invented tasks). Multi-pass cross-validation improves this rate to over 95%.
Can I use a smaller model to reduce hardware costs?
Yes. Phi-3 Mini (3.8B parameters) runs on 8 GB RAM without GPU and offers decent results for simple task extraction. Quality drops on complex or poorly structured reports. Start with Phi-3 and move to Llama 3.1 if results are insufficient.
How do I handle reports in multiple languages?
Recent models like Llama 3.1 and Mistral are multilingual. They natively handle French, English, and many other languages. Simply adapt the prompt to the report language or keep an English prompt (models understand English instructions even when analyzing French text).
Does Jira integration work with Jira Server or only Cloud?
The jira-python library supports both. For Jira Server, use a personal API token. For Jira Cloud, use an API token created from Atlassian account settings. Configuration differs slightly but code remains identical.
How long does it take to process a 10-page report?
With Llama 3.1 8B on an RTX 4070, expect 15 to 30 seconds per 2000-word report. Text extraction (PDF to text) adds 1 to 2 seconds. For 50 reports, complete processing takes under 30 minutes, versus several hours manually.
