From Scattered Data to Production AI: The A&S Soft Approach
From Scattered Data to Production AI: The A&S Soft Approach
There's a version of AI adoption that looks like this: a business signs up for an AI tool, connects it to a few documents, and calls it done. For a while, it seems to work. Then the cracks appear — wrong answers, missed context, frustrated users.
The problem isn't the AI. The problem is that raw business data is almost never ready to power a reliable AI system.
This post walks through how we approach the transformation from scattered, unstructured data to a production-grade AI system — and why each step matters.
Why Data Transformation Is the Core Differentiator
Most AI vendors focus on the model. We focus on the data.
The reason is simple: a state-of-the-art language model fed poor data will produce poor results. A simpler model fed clean, well-structured data will outperform it every time.
Data transformation is unglamorous work. It doesn't make for impressive demos. But it's the difference between an AI system that works in a controlled environment and one that works in your actual business.
Here's what we mean by "scattered data":
- Support tickets spread across three different helpdesk tools
- Product documentation in a mix of PDFs, Word documents, and Confluence pages
- Policy information that exists in email threads and Slack messages
- FAQs that were last updated eighteen months ago
- Onboarding guides that contradict each other
Sound familiar? This is the starting point for most of the businesses we work with.
The Transformation Pipeline
Stage 1: Ingestion
The first step is getting everything into one place in a format we can work with.
// Example: reading a PDF and extracting text
import { readFileSync } from 'fs'
import { PDFExtract } from 'pdf.js-extract'
async function ingestPDF(filePath: string): Promise<string> {
const extractor = new PDFExtract()
const data = await extractor.extract(filePath, {})
return data.pages
.flatMap(page => page.content)
.map(item => item.str)
.join(' ')
}We handle:
- PDFs — product manuals, policy documents, contracts
- Word documents — internal guides, SOPs, training materials
- Spreadsheets — pricing tables, product catalogues, FAQs
- Web content — help centre articles, public documentation
- Structured data — support ticket histories, CRM records
The goal at this stage is completeness. We want everything in, even if it's messy.
Stage 2: Cleaning and Deduplication
Raw ingested content is full of noise. Headers, footers, navigation elements, repeated boilerplate, and duplicate records all degrade retrieval quality.
Common cleaning operations:
- Remove HTML tags and formatting artefacts
- Normalise whitespace and encoding
- Strip repeated headers and footers from paginated documents
- Identify and merge duplicate records
- Flag and remove content that is clearly outdated
Deduplication is particularly important. If the same policy is described in three slightly different ways across three documents, a retrieval system will surface all three — and they may contradict each other.
Stage 3: Structuring and Chunking
A language model can't retrieve a 50-page document. It retrieves chunks — small, semantically coherent units of text.
Chunking strategy matters enormously. Chunk too small and you lose context. Chunk too large and retrieval becomes imprecise.
We use a combination of approaches depending on the content type:
| Content Type | Chunking Strategy |
|---|---|
| FAQs | One question-answer pair per chunk |
| Long-form documentation | Paragraph-level chunks with section context |
| Support tickets | One ticket per chunk, with metadata |
| Policy documents | Section-level chunks with document title |
| Product catalogues | One product per chunk |
Each chunk is also enriched with metadata — source document, date, category, product line — so we can filter retrieval results before they reach the language model.
Stage 4: Vector Storage
Once content is chunked and cleaned, we embed it — converting each chunk into a numerical representation that captures its semantic meaning.
This is what enables semantic search: finding content that means the same thing even when the words are different.
User query: "How do I cancel my subscription?"
Semantically similar chunks retrieved:
→ "To terminate your account, navigate to Settings > Billing..."
→ "Subscription cancellation can be initiated from the account dashboard..."
→ "If you wish to end your plan, contact support at..."
Without semantic search, a keyword-based system would miss all three of these if the user didn't use the exact word "cancel."
Stage 5: Retrieval and Ranking
Retrieval is where most systems get it wrong.
A naive retrieval system returns the top-N most similar chunks. But similarity isn't the same as relevance. A chunk can be semantically similar to a query without actually answering it.
We layer several signals to improve ranking:
- Semantic similarity — how closely the chunk matches the query meaning
- Recency — more recent content is weighted higher for time-sensitive topics
- Source authority — official documentation is weighted higher than informal notes
- Metadata filters — restrict retrieval to relevant product lines or categories
The result is a ranked list of chunks that are not just similar to the query, but likely to answer it.
Stage 6: The LLM Response Layer
Only at this stage does the language model come in.
The retrieved chunks are assembled into a context window and passed to the model with a carefully designed prompt. The model's job is to synthesise the retrieved information into a coherent, accurate response.
Key design decisions at this stage:
- Prompt engineering — how to instruct the model to use only the provided context
- Citation handling — whether to include source references in responses
- Fallback behaviour — what to do when no relevant content is found
- Tone and format — matching the response style to the use case
Stage 7: Evaluation and Feedback
A system you can't measure is a system you can't improve.
We build evaluation into every system from the start. This includes:
- Automated accuracy testing against a set of known question-answer pairs
- Retrieval quality metrics — are the right chunks being surfaced?
- Response quality scoring — are the answers accurate and complete?
- User feedback collection — thumbs up/down, escalation rates, resolution rates
"You can't improve what you don't measure. And in AI systems, what you don't measure will quietly degrade."
Stage 8: Continuous Updates
Business data changes constantly. New products are launched. Policies are updated. Staff turn over and take institutional knowledge with them.
An AI system that isn't updated becomes a liability. It gives outdated answers with the same confidence as accurate ones.
We handle ongoing maintenance as part of every engagement:
- Scheduled re-ingestion of updated source documents
- Monitoring dashboards for accuracy and usage metrics
- Alert thresholds for accuracy drops below acceptable levels
- Quarterly reviews to identify new automation opportunities
What This Looks Like in Practice
Here's a simplified example of how this pipeline transforms a real business problem.
The situation: A software company has a support team handling 400 tickets per week. 60% of those tickets are questions that could be answered from existing documentation. The documentation exists — it's just spread across a help centre, a Confluence wiki, and a folder of PDFs.
The problem: Support agents spend 30–40% of their time searching for answers they've already found before. New agents take 6–8 weeks to become productive. Response times average 4 hours.
After the transformation pipeline:
- All documentation is ingested, cleaned, and chunked into ~2,400 retrievable units
- An AI support assistant is deployed that retrieves relevant chunks and drafts responses
- Agents review and send AI-drafted responses rather than writing from scratch
- New agents are productive within their first week
- Response times drop to under 30 minutes
- The support team handles 40% more volume without additional headcount
The AI didn't replace the support team. It made them dramatically more effective.
Starting Points
Not every business is ready for a full AI system implementation. That's fine — and it's better to know before you invest.
Our AI Support Audit & Automation Plan is designed for businesses that want an honest assessment before committing to a larger project. In a short, fixed-scope engagement, we:
- Audit your current data sources and quality
- Map your support workflows and identify automation opportunities
- Assess your tooling and integration requirements
- Deliver a prioritised roadmap with realistic effort and ROI estimates
No jargon. No inflated promises. Just a clear picture of where you stand and what's actually possible.
Further Reading
If you want to go deeper on any of the topics covered here, these are worth reading:
- Retrieval-Augmented Generation (RAG) — original paper
- Chunking strategies for RAG systems
- Evaluating RAG systems
Ready to talk about your data? Get in touch — we'll start with an honest conversation about where you are and what's realistic.