Keyword search finds strings; vector search finds meaning. Build a semantic link-prospecting pipeline in 2026 — embeddings, a vector DB, runnable code and real costs.
Most link prospecting fails at relevance, not discovery. The tools are very good at handing you ten thousand domains; they are poor at telling you which forty of those domains are actually like the handful of placements that earned you a reply and a link last quarter. Keyword filters match strings — they will surface a marketing blog and miss a fintech newsletter that covers the same subject in different words, while flooding you with pages that share your keyword but nothing of your intent. The bottleneck has quietly moved from finding prospects to ranking them by genuine topical fit, and de-duplicating a list that three different sources returned three different ways.
Vector databases solve exactly that problem. They store the meaning of a page as a list of numbers — an embedding — so you can ask “find me prospects semantically like these ten” and get a ranked answer in milliseconds, regardless of which words each site happens to use. This is the same machinery that powers the RAG pipeline for personalised outreach covered earlier in this cluster — but pointed at a different job. RAG retrieves facts so a model can write; here we use embeddings and a vector store to find and rank who to write to in the first place. It is the data layer beneath the Claude-powered link prospecting agent.
The counter-intuitive part is the cost. Practitioners assume a vector pipeline is an expensive, heavyweight build reserved for enterprises. The figures say otherwise: embedding a corpus of 100,000 prospect pages costs well under £1 in API calls — embeddings are input-only, with no output tokens, which is why they are 10–100× cheaper per token than chat completions. The real cost is operational: assembling a clean corpus, keeping it fresh, and knowing when not to bother. This guide gives you the decision first, then the build, then the failure modes most tutorials skip.
| TL;DR Semantic prospecting does four things keyword search cannot: lookalike discovery, near-duplicate detection, list clustering, and relevance ranking against a target page.The pipeline is five stages: build a corpus → embed it → store the vectors → query by meaning → rank and filter. The whole thing fits in roughly 150 lines of Python.Below ~5,000 prospects, skip it. A keyword-filtered spreadsheet or a single Claude pass beats the apparatus. Vector search earns its place at tens of thousands of prospects, recurring re-prospecting, or genuine lookalike/dedup needs.Embeddings are absurdly cheap (under £1 for 100k pages). Your recurring cost is the database host (£0–£40/month) and your time, not the API.Anthropic does not ship its own embedding model; Voyage AI is its recommended provider, with OpenAI text-embedding-3-small as the cheap default. |
The deliverable: the semantic prospecting pipeline
Here is the entire pipeline on one page. Five stages, each independent, each replaceable. Build them in order; you can stop and ship after stage four and add filtering later.
| Stage | What happens | Tooling | Cost driver |
| 1. Corpus | Assemble one short text record per prospect (title + meta + first content chunk + the anchor context that matters) | Your crawl / Ahrefs export / CSV | Your time |
| 2. Embed | Turn each record into a vector via an embedding model, in batches | Voyage or OpenAI embeddings | Tokens (tiny) |
| 3. Store | Upsert vectors + metadata (DR, traffic, niche, URL) into a vector DB | Qdrant / Pinecone / pgvector | DB host (monthly) |
| 4. Query | Search by meaning: “like these 10”, or “closest to this target page” | Vector DB client | Negligible |
| 5. Rank + filter | Combine similarity score with metadata filters (DR band, language, niche) | Vector DB filters | Negligible |
The principle behind it is the same one that runs through the whole link building strategies approach: relevance beats volume. A vector pipeline is simply the cheapest way to enforce relevance across a list too large to read.
| The 5,000-row line — when not to build this If your prospect list is under roughly 5,000 rows and you re-prospect occasionally rather than continuously, a vector database is over-engineering. A keyword-filtered spreadsheet, or a single pass where you paste the list into Claude and ask it to rank by fit, will match the result for a fraction of the effort. Vector search pays back only when one of three things is true: the list runs to tens of thousands of rows; you re-prospect often enough that a reusable, queryable index saves real time; or you specifically need lookalike or de-duplication work that string matching cannot do. Below that line, the cheaper fallback wins. This is the same escalate-only-when-you-must discipline that governs every build in this cluster. |
What semantic prospecting does that keyword search cannot
Vector search is worth the setup because it unlocks four jobs that string matching simply cannot perform. Each maps to a concrete prospecting task.
1. Lookalike discovery
Take the ten placements that genuinely worked — the ones that earned a link and sent a trickle of relevant traffic — average their vectors into a single “ideal prospect” point, and ask the database for the 200 nearest neighbours. The result is a ranked list of sites that resemble your winners in subject, tone and audience, not just in keyword. This is the single highest-leverage move in the whole pipeline, because it turns your past success into a search query.
2. Near-duplicate detection
Run three exports — Ahrefs, a SERP scrape, a newsletter list — and you will have the same publisher three times under slightly different URLs and titles. Cosine similarity above roughly 0.95 flags these as near-identical so you de-duplicate before outreach, never pitching the same editor twice from two “different” rows.
3. Clustering a large list
Embeddings let you segment a 50,000-row list into natural topic clusters without reading it: SaaS review sites here, regional news there, hobbyist blogs in a third group. You then write one outreach angle per cluster instead of one per row, which is what makes personalisation at scale tractable.
4. Relevance ranking against a target page
Embed the specific page you are trying to build links to, then rank every prospect by closeness to it. A prospect can be high-DR and still a poor topical match; this scores fit directly, which is the signal that actually moves the needle. The wider evidence supports the emphasis: across AI-era studies, off-site presence and topical recognition correlate with visibility more than raw link type, and relevance is the lever earned-link channels keep returning to in our 2026 link building statistics.
What the data shows vs what practitioners believe
Three beliefs keep teams from adopting semantic prospecting, and all three are contradicted by the 2026 numbers. The first is that it is expensive; in fact embeddings are the cheapest call in the entire AI stack, 10 to 100 times cheaper per token than chat completions because they return a vector, not generated text, and so carry no output cost. The second is that you need Pinecone and a budget; in fact an open-source store on a ~£30 VPS handles 10M+ vectors, which is more than any single link builder will ever index.
The third belief is the costly one: that the prospecting problem is finding enough sites. It is not. Discovery has been a solved, commoditised problem for years — any tool will hand you more domains than you can ever pitch. The unsolved problem is relevance and consensus, and the AI-search data makes the point sharply: when models choose what to recommend, they aggregate consistency across independent sources rather than counting raw links. The same logic applies to your outreach. The win is not a longer list; it is a list ordered by genuine fit, which is precisely what a vector store produces and a keyword filter cannot. Believing the bottleneck is volume is what keeps teams optimising the wrong half of the funnel.
Choosing your vector database
The market consolidated in 2026 around a handful of serious options, and the decision is scale plus latency plus filtering needs, not philosophical preference. For link prospecting — which is read-heavy, modest in scale, and filter-dependent — the choice is unusually easy. Here is the honest comparison.
| Database | Model | Best for prospecting when… | Indicative cost |
| Chroma | Open-source, local | You are prototyping or your list is under ~1M rows and lives on one machine | Free (local) |
| Qdrant | Open-source + cloud | You want the best price-performance and deep metadata filtering (DR, niche, language) | ~£25–£40/mo VPS |
| pgvector | Postgres extension | Your prospect data already lives in Postgres; ceiling is ~50M vectors on one node | Your existing DB |
| Weaviate | Open-source + cloud | You need native hybrid (keyword + vector) search in one query | From ~£25/mo cloud |
| Pinecone | Managed only | You refuse to run any infrastructure and will pay for that convenience | Usage-based; rises at scale |
For most link builders the answer is Chroma to prototype and Qdrant in production: Qdrant self-hosted on a ~£30/month VPS handles 10M+ vectors easily, with the metadata filtering that prospecting leans on. Reserve Pinecone for the case where you will not operate anything yourself, and note that its cost compounds at scale. The full stack of options sits alongside the rest of your kit in our comparison of link building tools.
| Failure threshold + fallback pgvector’s practical ceiling is around 50M vectors on a well-provisioned single node before you must shard manually; past that, move to Qdrant or Weaviate. Chroma starts struggling well before 10M rows; treat it as a prototype tool, not a production store. If you are nowhere near these limits, the cheaper, simpler option is always correct. |
Choosing your embedding model
The vector is only as good as the model that produced it. Anthropic does not offer its own embedding model and recommends Voyage AI as its preferred provider, whose models pair well with Claude and handle long documents (a 32K-token context where OpenAI caps at ~8K). For prospecting you have three sensible choices.
| Model | List price / 1M tokens | Why pick it for prospecting |
| OpenAI text-embedding-3-small | $0.02 (batch $0.01) | Cheap default; integrates with every vector DB; fine for most lists |
| Voyage (voyage-3.5 / voyage-4 family) | $0.06 (lite tiers $0.02) | Anthropic’s recommended provider; stronger retrieval, 32K context for long pages |
| Self-hosted (e.g. BGE-M3 on a GPU) | ~$0.001 effective | Only if you already run GPU infrastructure; otherwise the overhead dwarfs the saving |
Two practical notes. First, use the Batch API for the one-off corpus embed: OpenAI gives 50% off and Voyage roughly a third off for non-real-time jobs — there is no reason to pay standard rates to index a corpus overnight. Second, the asymmetric trick: embed your corpus once with a strong model, then run every query with a cheap lite model in the same family. On the Voyage 4 family the lite tier runs about $0.02 per million tokens, so thousands of lookalike queries cost pennies. Always embed documents and queries with the matching input_type flag — mismatching them quietly degrades recall.
Where your prospect corpus comes from
The pipeline embeds whatever list you feed it, so it is worth being deliberate about sources. Four feed the best corpora, and combining them (then de-duplicating with the near-dup check) gives the richest index.
- Competitor backlink exports. The classic starting point: pull the referring domains of two or three close competitors from your backlink tool, keep the page-level data, and you have a corpus of sites that already link to businesses like yours.
- Your own placement history. Every site that has covered you, plus every site you have pitched, is signal. This is also where your seed set of winners comes from, so treat your outreach CRM as a first-class source, not an afterthought.
- SERP and topic scrapes. The pages ranking for the queries your target page targets are, almost by definition, topically adjacent. Scrape the top results for a basket of relevant queries and fold them in.
- Directories and niche indexes. Podcast directories, newsletter lists, industry association members and “best blogs” round-ups are dense, pre-qualified prospect sources that keyword tools tend to miss entirely.
Whatever the source, normalise to one row per page with the fields the corpus step needs (URL, title, meta, first chunk, DR, niche, language), then let near-duplicate detection collapse the inevitable overlap. A corpus built from several sources and de-duplicated semantically is both broader and cleaner than any single export, which feeds directly into the buyer-aware, revenue-oriented prospect selection the rest of your process depends on.
The build: a semantic prospecting pipeline in practice
Each snippet below is abridged for the page. The complete, runnable version of every file — imports, environment setup, the full function, a __main__ block and sample input/output — lives in the repository at github.com/linkbuildingjournal/snippets/tree/main/251-vector-database-prospecting. Clone it, drop in your two API keys, and the pipeline runs end to end.
Step 1 — assemble the corpus
The model embeds whatever text you give it, so this step decides the quality of everything downstream. For each prospect, build one compact record: page title, meta description, the first substantive content chunk, and the niche label if you have one. Strip boilerplate. A clean 300–500 token record beats a full-page dump, which dilutes the signal with nav menus and footers.
| # build_corpus.py — full version: github.com/linkbuildingjournal/snippets/tree/main/251-vector-database-prospecting def to_record(row): parts = [row[‘title’], row[‘meta’], row[‘first_chunk’][:1200]] text = ‘ \u2014 ‘.join(p for p in parts if p) return { ‘id’: row[‘url_hash’], ‘text’: text, ‘meta’: {‘url’: row[‘url’], ‘dr’: row[‘dr’], ‘niche’: row.get(‘niche’), ‘lang’: row.get(‘lang’, ‘en’)}, } |
Be concrete about what to keep and what to cut, because the corpus sets the quality ceiling for everything downstream. A good record is a clean, compact statement of what a page is about; a bad one buries that signal in chrome.
- Good record: “B2B SaaS retention metrics — a data-led breakdown of churn benchmarks by ARR band for UK subscription businesses”. Specific, topical, dense.
- Bad record: “Home | About | Blog | Contact — Welcome to our website. Subscribe to our newsletter. Cookie policy…”. Pure boilerplate; the vector encodes navigation, not subject.
If you only have thin metadata, embed the title and meta description rather than a full-page dump — a short, accurate record beats a long, noisy one every time. Where pages are genuinely substantive and long, this is exactly where Voyage’s 32K-token context earns its keep, because you avoid splitting a coherent page into fragments that each lose the thread. Keep the metadata (DR, niche, language, last-seen date) in the payload, not the embedded text: you filter on metadata, you do not embed it.
Step 2 — embed in batches
Chunk the corpus into batches and embed with backoff (the robust version is in Step 6). Note the input_type=”document” flag for the corpus; queries later use “query”.
| # embed_batch.py — full version: github.com/linkbuildingjournal/snippets/tree/main/251-vector-database-prospecting import os, voyageai vo = voyageai.Client(api_key=os.environ[‘VOYAGE_API_KEY’]) def embed(texts, model=’voyage-3.5′, input_type=’document’): out = [] for i in range(0, len(texts), 128): # stay under batch limits resp = vo.embed(texts[i:i+128], model=model, input_type=input_type) out.extend(resp.embeddings) return out |
Step 3 — store the vectors
Create a collection with the right dimensionality (1024 for voyage-3.5, 1536 for OpenAI 3-small) and cosine distance, then upsert vectors with their metadata payload so you can filter on it later.
| # upsert_qdrant.py — full version: github.com/linkbuildingjournal/snippets/tree/main/251-vector-database-prospecting from qdrant_client import QdrantClient from qdrant_client.models import VectorParams, Distance, PointStruct client = QdrantClient(url=’http://localhost:6333′) client.recreate_collection( collection_name=’prospects’, vectors_config=VectorParams(size=1024, distance=Distance.COSINE), ) points = [PointStruct(id=r[‘id’], vector=v, payload=r[‘meta’]) for r, v in zip(records, vectors)] client.upsert(collection_name=’prospects’, points=points) |
Step 4 — query by meaning
The lookalike query: embed your best placements, average them into a centroid, and search — with a metadata filter applied so you only ever see prospects in a sensible DR band and the right language.
| # query_lookalike.py — full version: github.com/linkbuildingjournal/snippets/tree/main/251-vector-database-prospecting import numpy as np from qdrant_client.models import Filter, FieldCondition, Range seed = embed([w[‘text’] for w in winners], input_type=’document’) centroid = np.mean(seed, axis=0).tolist() hits = client.search( collection_name=’prospects’, query_vector=centroid, limit=200, query_filter=Filter(must=[ FieldCondition(key=’dr’, range=Range(gte=20, lte=70))]), ) for h in hits: print(round(h.score, 3), h.payload[‘url’]) |
Reproducibility metadata
| Tested configuration Embedding model: voyage-3.5 (Anthropic’s recommended provider); OpenAI text-embedding-3-small as the cheap-default alternative.Vector store: Qdrant (open-source), cosine distance; dimensions 1024 (voyage-3.5) / 1536 (OpenAI 3-small).SDKs tested: voyageai 0.3.x, openai 1.x, qdrant-client 1.x, numpy 2.x. Versions are pinned in requirements.txt in the repo.Date tested: June 2026. Re-check model names and pricing before a fresh run — embedding model families turn over roughly every six months. |
A worked example: 200 prospects from 10 winners
Say you have a clean export of 60,000 UK and EU publisher pages and a shortlist of 10 placements from the last year that genuinely worked. The pipeline resolves like this. Embedding the 60,000 records at ~400 tokens each is 24M tokens — at OpenAI 3-small’s batch rate of $0.01 per million, that is $0.24 (roughly 19p). Storing 60,000 vectors at 1,024 dimensions is about 245MB — trivial on a £30 VPS. You average your 10 winners into a centroid, search with a DR 20–70 filter, and Qdrant returns 200 ranked lookalikes in under 50ms. You then cluster those 200 into five topic groups and write five outreach angles instead of 200.
The output is not “200 domains” — your keyword tool already gives you that. It is 200 domains ranked by resemblance to what has already worked for you, de-duplicated, segmented and filtered. That ranking is the entire value, and it is the thing a spreadsheet cannot produce.
Set it against the standard workflow to see the difference. The usual route is: export a competitor’s backlinks, filter by DR, eyeball a few hundred rows, and start pitching the ones that look plausible. That “looks plausible” step is where hours vanish and judgement drifts — by row 200 you are rubber-stamping. The semantic route replaces eyeballing with a score derived from your own successes, so the ranking is consistent from the first row to the last, and you spend your attention on outreach rather than triage.
The economics shift again when prospecting is continuous rather than one-off. An agency running outreach for a dozen clients re-prospects constantly; rebuilding a filtered spreadsheet each time is dead labour. With a standing index, each client’s seed set becomes a saved query, new prospects flow in on the quarterly re-embed, and “find this month’s 50 best lookalikes for client X” is a one-line call. The reusable index is what turns a one-off trick into an operating asset — and the reason the build pays back at agency scale where it would not for a single small list.
Knowing it works: a quick evaluation
Do not deploy on faith — the same evidence-first discipline that governs fine-tuning in this cluster applies here. Before you trust the index to drive outreach, run two cheap checks. First, a held-out test: pull five known-good placements out of the seed set, build the centroid from the rest, and confirm those five reappear high in the results. If your own winners do not rank near the top, the corpus or the model is wrong, and no amount of clever querying will fix it.
Second, a precision spot-check: take the top 25 lookalikes and judge by hand how many are genuinely pitchable — right topic, links out, plausible to cover you. If 20-plus of 25 pass, the pipeline is doing its job; if half are junk, tighten the corpus with more boilerplate stripping, tighten the filter (DR band, language), or add a reranking pass. Write the pass mark down before you look, so you are measuring the index rather than rationalising it. Both checks take ten minutes and save you from scaling a list that quietly does not work.
Hybrid search and reranking: the quality upgrades
Pure vector search has one blind spot: it can miss an exact term. If a prospect page is the only one that mentions a specific product name or a rare brand, semantic similarity may rank it below fuzzier but more “average” matches. The fix is hybrid search — combining dense vectors with keyword (BM25) matching so exact terms and meaning both count. Weaviate offers native BM25-plus-vector-plus-metadata in a single query; Qdrant supports sparse and dense vectors together. For most prospecting lists, pure vector plus metadata filtering is enough — reach for hybrid only when you keep losing prospects that hinge on a specific named term.
Reranking: the cheap second pass
The highest-return upgrade is reranking. Vector search is fast but approximate; a reranker re-scores the top candidates against your query with a more precise model, dramatically tightening the order of the few results you will actually act on. The pattern is: retrieve 100 candidates by vector, rerank to the best 25, pitch those.
| # rerank.py — full version: github.com/linkbuildingjournal/snippets/tree/main/251-vector-database-prospecting import os, cohere co = cohere.Client(os.environ[‘COHERE_API_KEY’]) top = [h.payload[‘url’] for h in hits[:100]] docs = [by_url[u][‘text’] for u in top] ranked = co.rerank(query=target_page_text, documents=docs, model=’rerank-3.5′, top_n=25) best = [top[r.index] for r in ranked.results] | ||
| Failure threshold + fallback Reranking adds an API call and a little latency per query, so it earns its place only when you are choosing a small action set from a large candidate pool — the classic prospecting shape. If you are bulk-exporting thousands of rows rather than selecting a shortlist, skip the reranker; the marginal ordering gain is not worth the cost or the added dependency. As always, add the heavier component only when the lighter one demonstrably falls short. | ||
Where this breaks in production
Every tutorial shows the happy path. Here is the standing list of what actually goes wrong at volume, and the fix for each.
Rate limits (HTTP 429)
Embed 60,000 records in a tight loop and the provider will start returning 429s. Wrap every call in exponential backoff with jitter; never retry instantly in a flat loop.
| # robust_embed.py — full version: github.com/linkbuildingjournal/snippets/tree/main/251-vector-database-prospecting import time, random def embed_with_backoff(batch, model, tries=6): for attempt in range(tries): try: return vo.embed(batch, model=model, input_type=’document’).embeddings except Exception as e: if ‘429’ in str(e) or ‘rate’ in str(e).lower(): time.sleep(min(2 ** attempt + random.random(), 60)) else: raise raise RuntimeError(‘exhausted retries’) |
Corpus drift and stale vectors
A vector is a snapshot. Publishers redesign, change focus, or die; six months on, a chunk of your index points at pages that no longer exist or no longer match. Re-embed on a schedule (quarterly is sane for prospecting) and timestamp every record so you can expire the oldest. Treat the index as perishable, not permanent.
Embedding-model migration
Vectors from different models are not comparable — you cannot mix voyage-3.5 and OpenAI vectors in one collection, and upgrading models means re-embedding the whole corpus. Budget for it: at these prices a full re-embed is cheap in money, but it is a full pass, so script it rather than doing it by hand. Pin your model string (it is in the reproducibility block) so you always know which model produced a given collection.
Near-duplicate false positives
A 0.95 similarity threshold for de-duplication is a starting point, not a law. Two genuinely distinct regional editions of a title can score above it; a single site with two very different sections can score below. Spot-check the boundary before you auto-merge, and keep a human review on anything you are about to discard.
Metadata-filter gotchas
Filtering on a field that is missing on some records silently drops those records from results — they are not “failing the filter”, they are invisible to it. Backfill defaults (a null DR becomes 0, a missing language becomes the explicit value you intend) before you rely on filters, or your best prospect quietly never appears.
Prospect data is personal data
The moment your records contain editor names, personal email addresses or other identifying detail, you are processing personal data under UK GDPR — the same constraint that runs through the UK disclosure and data layer for this cluster. Embed the topical text, not the person: keep names and contact details in a separate, access-controlled store keyed by ID, never baked into the vector payload. Honour suppression and erasure requests at the source, and confirm your lawful basis with a qualified adviser. This is general best-practice guidance, not legal advice.
Three myths about vector prospecting
Three claims recur whenever vector search comes up, and each is wrong often enough to name.
- “A bigger embedding model always retrieves better.” Not for prospecting. The cheapest tier clears the bar for ranking publisher pages by topic; the premium models pay back on long, technical documents and specialist domains, not on short marketing pages. Match the model to the corpus, not to the leaderboard.
- “Higher similarity means a better prospect.” Similarity measures resemblance to your seed, nothing more. A 0.9-similar site can still be a poor placement if it never links out, sits in the wrong country, or is a content farm. Similarity is one input; metadata and human judgement decide.
- “Index it once and you are done.” An index is perishable. Publishers change and die, models get superseded, and your own definition of a good prospect evolves. A vector store you never refresh is a slowly rotting list dressed up as a system.
Failure modes that waste the build
- Embedding boilerplate. Full-page dumps bury the topic under navigation and footers; the vectors cluster on chrome, not subject. Clean the corpus first.
- Building it for a small list. Below the 5,000-row line the apparatus costs more attention than it saves. Read the list instead.
- Mixing models in one collection. Vectors from different models are not comparable; one collection, one model, always.
- Trusting filters on missing fields. Records lacking a filtered field vanish silently. Backfill defaults before you rely on a filter.
- Treating similarity as the whole decision. Rank by fit, then apply DR, language and link-likelihood, then have a human approve. The score narrows the field; it does not pick the winner.
The economics: what it actually costs
Put the numbers in one place and the picture is clear: the API is a rounding error and the host is the line item. All figures use list prices verified in 2026; convert at roughly $1.27 to the pound.
| Cost component | 5,000 prospects | 100,000 prospects | Notes |
| Embed corpus (3-small, batch) | ~$0.02 | ~$0.40 | One-off; input-only tokens |
| Vector storage | ~31MB | ~614MB | 1,536-dim float32; negligible |
| Lookalike queries (per 10k) | ~$0.04 | ~$0.04 | Lite model; pennies |
| Database host | £0 (Chroma local) | ~£25–£40/mo (Qdrant VPS) | The real recurring cost |
| Re-embed (quarterly) | ~$0.02 | ~$0.40 | Repeat the one-off |
The maths is the argument. If embedding 100,000 pages costs under £1 and the index lives on a £30 server, the only meaningful investment is the hour you spend building a clean corpus and the discipline to keep it fresh. That is also why the 5,000-row line matters: when the list is small, even this trivial cost cannot be justified against simply reading it, and the honest answer is to skip the build.
Your Monday-morning path
- Pull your last 12 months of placements and tag the 10–20 that genuinely worked — a link plus relevant traffic, not just a link. This is your seed set.
- Export your raw prospect list and build the corpus: one clean 300–500 token record per prospect, boilerplate stripped, metadata attached (DR, niche, language).
- Stand up Chroma locally and embed the corpus with OpenAI 3-small via the Batch API. If the list is under ~5,000 rows, stop here and rank in a spreadsheet instead — you do not need this.
- Run the lookalike query against your seed centroid with a DR-band filter, and pull the top 200.
- Cluster the 200 into topic groups, write one outreach angle per cluster, and feed the segmented list into your multi-agent outreach workflow with a human gate on every send.
- Once it earns its keep, promote the index from Chroma to a self-hosted Qdrant instance and set a quarterly re-embed reminder.
Done this way, semantic prospecting is the rare advanced technique that is genuinely cheap to run and genuinely hard to misuse — provided you respect the 5,000-row line and keep the corpus clean. It is the data layer that makes every other build in this cluster sharper: better prospects in means better outreach out, and the whole link building strategy compounds from there. For the fundamentals of why relevance and editorial fit matter more than raw volume in the first place, our primer on what backlinks are and how editorial links are earned sets the ground.
Frequently asked questions
Do I need a vector database for link prospecting?
Only above roughly 5,000 prospects, or when you re-prospect continuously, or when you specifically need lookalike or de-duplication work. Below that, a keyword-filtered spreadsheet or a single Claude pass over the list is faster and just as good. The vector pipeline is a scale-and-reuse tool, not a default.
Which embedding model should I use?
Default to OpenAI text-embedding-3-small for cost and universal support. Upgrade to Voyage — Anthropic’s recommended provider — when retrieval quality or long pages matter. Whatever you pick, use the Batch API for the one-off corpus embed and match the input_type flag between documents and queries.
How much does it cost to embed a large prospect list?
Far less than people expect. Embeddings are input-only, so embedding 100,000 pages costs well under £1 at batch rates. Your recurring cost is the database host — £0 for Chroma running locally, around £25–£40 a month for a self-hosted Qdrant instance — plus the time to build a clean corpus.
What is the difference between this and RAG?
Same machinery, different job. RAG retrieves facts so a model can write a grounded message; semantic prospecting uses embeddings and a vector store to find and rank who to contact. In a mature stack you run both: prospect with vectors, then write with RAG-grounded facts.
Can I mix vectors from different embedding models?
No. Vectors from different models live in different spaces and are not comparable, so a single collection must use one model. Upgrading models means re-embedding the whole corpus — cheap in money at these prices, but a full pass, so script it and pin the model string per collection.
