From Scattered Data to Production AI: The A&S Soft Approach

There's a version of AI adoption that looks like this: a business signs up for an AI tool, connects it to a few documents, and calls it done. For a while, it seems to work. Then the cracks appear — wrong answers, missed context, frustrated users.

The problem isn't the AI. The problem is that raw business data is almost never ready to power a reliable AI system.

This post walks through how we approach the transformation from scattered, unstructured data to a production-grade AI system — and why each step matters.

Why Data Transformation Is the Core Differentiator

Most AI vendors focus on the model. We focus on the data.

The reason is simple: a state-of-the-art language model fed poor data will produce poor results. A simpler model fed clean, well-structured data will outperform it every time.

Data transformation is unglamorous work. It doesn't make for impressive demos. But it's the difference between an AI system that works in a controlled environment and one that works in your actual business.

Here's what we mean by "scattered data":

Support tickets spread across three different helpdesk tools
Product documentation in a mix of PDFs, Word documents, and Confluence pages
Policy information that exists in email threads and Slack messages
FAQs that were last updated eighteen months ago
Onboarding guides that contradict each other

Sound familiar? This is the starting point for most of the businesses we work with.

The Transformation Pipeline

Stage 1: Ingestion

The first step is getting everything into one place in a format we can work with.

// Example: reading a PDF and extracting text
import { readFileSync } from 'fs'
import { PDFExtract } from 'pdf.js-extract'
 
async function ingestPDF(filePath: string): Promise<string> {
  const extractor = new PDFExtract()
  const data = await extractor.extract(filePath, {})
  
  return data.pages
    .flatMap(page => page.content)
    .map(item => item.str)
    .join(' ')
}

We handle:

PDFs — product manuals, policy documents, contracts
Word documents — internal guides, SOPs, training materials
Spreadsheets — pricing tables, product catalogues, FAQs
Web content — help centre articles, public documentation
Structured data — support ticket histories, CRM records

The goal at this stage is completeness. We want everything in, even if it's messy.

Stage 2: Cleaning and Deduplication

Raw ingested content is full of noise. Headers, footers, navigation elements, repeated boilerplate, and duplicate records all degrade retrieval quality.

Common cleaning operations:

Remove HTML tags and formatting artefacts
Normalise whitespace and encoding
Strip repeated headers and footers from paginated documents
Identify and merge duplicate records
Flag and remove content that is clearly outdated

Deduplication is particularly important. If the same policy is described in three slightly different ways across three documents, a retrieval system will surface all three — and they may contradict each other.

Stage 3: Structuring and Chunking

A language model can't retrieve a 50-page document. It retrieves chunks — small, semantically coherent units of text.

Chunking strategy matters enormously. Chunk too small and you lose context. Chunk too large and retrieval becomes imprecise.

We use a combination of approaches depending on the content type:

Content Type	Chunking Strategy
FAQs	One question-answer pair per chunk
Long-form documentation	Paragraph-level chunks with section context
Support tickets	One ticket per chunk, with metadata
Policy documents	Section-level chunks with document title
Product catalogues	One product per chunk

Each chunk is also enriched with metadata — source document, date, category, product line — so we can filter retrieval results before they reach the language model.

Stage 4: Vector Storage

Once content is chunked and cleaned, we embed it — converting each chunk into a numerical representation that captures its semantic meaning.

This is what enables semantic search: finding content that means the same thing even when the words are different.

User query: "How do I cancel my subscription?"

Semantically similar chunks retrieved:
  → "To terminate your account, navigate to Settings > Billing..."
  → "Subscription cancellation can be initiated from the account dashboard..."
  → "If you wish to end your plan, contact support at..."

Without semantic search, a keyword-based system would miss all three of these if the user didn't use the exact word "cancel."

Stage 5: Retrieval and Ranking

Retrieval is where most systems get it wrong.

A naive retrieval system returns the top-N most similar chunks. But similarity isn't the same as relevance. A chunk can be semantically similar to a query without actually answering it.

We layer several signals to improve ranking:

Semantic similarity — how closely the chunk matches the query meaning
Recency — more recent content is weighted higher for time-sensitive topics
Source authority — official documentation is weighted higher than informal notes
Metadata filters — restrict retrieval to relevant product lines or categories

The result is a ranked list of chunks that are not just similar to the query, but likely to answer it.

Stage 6: The LLM Response Layer

Only at this stage does the language model come in.

The retrieved chunks are assembled into a context window and passed to the model with a carefully designed prompt. The model's job is to synthesise the retrieved information into a coherent, accurate response.

Key design decisions at this stage:

Prompt engineering — how to instruct the model to use only the provided context
Citation handling — whether to include source references in responses
Fallback behaviour — what to do when no relevant content is found
Tone and format — matching the response style to the use case

Stage 7: Evaluation and Feedback

A system you can't measure is a system you can't improve.

We build evaluation into every system from the start. This includes:

Automated accuracy testing against a set of known question-answer pairs
Retrieval quality metrics — are the right chunks being surfaced?
Response quality scoring — are the answers accurate and complete?
User feedback collection — thumbs up/down, escalation rates, resolution rates

"You can't improve what you don't measure. And in AI systems, what you don't measure will quietly degrade."

Stage 8: Continuous Updates

Business data changes constantly. New products are launched. Policies are updated. Staff turn over and take institutional knowledge with them.

An AI system that isn't updated becomes a liability. It gives outdated answers with the same confidence as accurate ones.

We handle ongoing maintenance as part of every engagement:

Scheduled re-ingestion of updated source documents
Monitoring dashboards for accuracy and usage metrics
Alert thresholds for accuracy drops below acceptable levels
Quarterly reviews to identify new automation opportunities

What This Looks Like in Practice

Here's a simplified example of how this pipeline transforms a real business problem.

The situation: A software company has a support team handling 400 tickets per week. 60% of those tickets are questions that could be answered from existing documentation. The documentation exists — it's just spread across a help centre, a Confluence wiki, and a folder of PDFs.

The problem: Support agents spend 30–40% of their time searching for answers they've already found before. New agents take 6–8 weeks to become productive. Response times average 4 hours.

After the transformation pipeline:

All documentation is ingested, cleaned, and chunked into ~2,400 retrievable units
An AI support assistant is deployed that retrieves relevant chunks and drafts responses
Agents review and send AI-drafted responses rather than writing from scratch
New agents are productive within their first week
Response times drop to under 30 minutes
The support team handles 40% more volume without additional headcount

The AI didn't replace the support team. It made them dramatically more effective.

Starting Points

Not every business is ready for a full AI system implementation. That's fine — and it's better to know before you invest.

Our AI Support Audit & Automation Plan is designed for businesses that want an honest assessment before committing to a larger project. In a short, fixed-scope engagement, we:

Audit your current data sources and quality
Map your support workflows and identify automation opportunities
Assess your tooling and integration requirements
Deliver a prioritised roadmap with realistic effort and ROI estimates

No jargon. No inflated promises. Just a clear picture of where you stand and what's actually possible.