Building an LLM Model By Extracting Data From Universal Search API

TL;DR

Build a mini-LLM that uses Scrapingdog’s Universal Search API to pull deduped, real-time results from major engines in one call.
Flow: fetch snippets in Python, then a tiny Markov chain generates text.
Why: skip multiple SERP APIs and aggregation overhead.
Includes setup steps and 1,000 free credits to test.

When building applications with large language models (LLMs), one of the biggest challenges is retrieving fresh and reliable information. Training data often gets outdated, and developers turn to serp APIs to keep models relevant.

Developers often need to gather large amounts of data from multiple search APIs. To pull results from Google, Bing, or Yahoo, they end up calling several APIs, merging the results, and then feeding them into their system. This extra aggregation step adds overhead and ultimately reduces system efficiency.

That’s why we built the Universal Search API at Scrapingdog. One single request to this API pulls data from all major search engines. It’s the simplest way to add real-time, multi-engine search to your AI or data projects.

In this blog, we will build a small LLM model that will collect real-time data from the web.

Why Use Universal Search API

LLMs typically rely on static databases, but integrating the Universal Search API enables them to access real-time data at lightning speed.
You don’t have to integrate multiple APIs for scraping data from different search engines.
Deduplication is critical when assembling datasets. This API saves you hours by ensuring no repeated links clutter your results.

Here's a quick demo on Universal Search API. ⬇️

Prerequisite To Build LLM using Fresh Search Engines Data

You must have Python installed on your machine. If it is not there, then you can install it from here.
Create a folder by any name you like. We will keep our Python file inside this folder.
Create a Python file by any name you like. I am naming it as llm.py.
Install the requests library inside this folder. You can install it with the command pip install requests.
Sign up for the free pack of Scrapingdog. You will get a generous 1000 credits, which are enough for testing the service.

Building a Mini-LLM

We will be building a Markov-chain like mini model. We will fetch real-time data from Scrapingdog and then feed it to predict the next future state.

Pulling data from Scrapingdog

1import requests
2 
3API_KEY = "YOUR_API_KEY"
4query = "russia ukraine war"
5 
6url = f"https://api.scrapingdog.com/search/?api_key={API_KEY}&query={query}"
7 
8response = requests.get(url)
9results = response.json()
10 
11# Scrapingdog returns "organic_results" for this endpoint
12results = data.get("organic_results", [])
13 
14# Extract snippets for corpus
15corpus = [item.get("snippet", "") for item in results if item.get("snippet")]
16 
17if not corpus:
18    raise ValueError("No snippets found in API response.")
19 
20print(f"Got {len(corpus)} snippets from Scrapingdog API.")

Let me explain this code step-by-step.

Uses requests library to make HTTP calls.
API_KEY stores your Scrapingdog API key.
query = "russia ukraine war" is the search term.
Formats the API endpoint with your key and query.
requests.get(url) fetches search results.
response.json() converts the reply into a Python dictionary.
Then we extract organic_results from the response (that’s where search results are stored).
Iterates over organic_results and collects non-empty "snippet" values into a list.
Finally, we are printing the number of snippets retrieved from the Scrapingdog API.

Markov Chain

1def build_markov_model(corpus, n=2):
2    """Builds a simple Markov chain model from text corpus"""
3    model = defaultdict(list)
4    for text in corpus:
5        words = text.split()
6        for i in range(len(words) - n):
7            key = tuple(words[i:i+n])
8            next_word = words[i+n]
9            model[key].append(next_word)
10    return model
11 
12def generate_text(model, length=60):
13    """Generates text of given length using the Markov chain model"""
14    start = random.choice(list(model.keys()))
15    output = list(start)
16    for _ in range(length):
17        state = tuple(output[-2:])
18        next_words = model.get(state)
19        if not next_words:
20            break
21        output.append(random.choice(next_words))
22    return " ".join(output)
23 
24 
25 
26 
27markov_model = build_markov_model(corpus)

This code builds a very simple text generator using a Markov chain. First, it takes your collected snippets (the corpus) and learns which words tend to follow which. It does this by looking at pairs of words (like “Russia Ukraine”) and recording what word usually comes next (for example, “war”). The result is a dictionary where each pair of words points to a list of possible continuations.

Then, when you want to generate new text, the program picks a random starting pair and keeps adding words by checking what words are likely to follow the last two. By repeating this process, it creates a new sequence of words that looks similar to the original snippets but isn’t copied directly. Essentially, it’s a toy model that mimics writing style and context based on word transitions from your data.

Now let’s check the output of our code. You can run it with the command python llm.py.

So, we successfully built a lightweight LLM model without relying on multiple SERP APIs.

Complete Code

You can also use GPT in place of this Markov model to get a better conclusion. However, for now, the code appears as follows.

1import requests
2import random
3from collections import defaultdict
4 
5# ======================
6# Step 1. Fetch Data
7# ======================
8 
9API_KEY = "your-api-key"   # Replace with your Scrapingdog API key
10query = "russia ukraine war"
11 
12url = f"https://api.scrapingdog.com/search/?api_key={API_KEY}&query={query}"
13 
14print(f"Fetching data for query: {query}...")
15response = requests.get(url)
16data = response.json()
17 
18# Scrapingdog returns "organic_results" for this endpoint
19results = data.get("organic_results", [])
20 
21# Extract snippets for corpus
22corpus = [item.get("snippet", "") for item in results if item.get("snippet")]
23 
24if not corpus:
25    raise ValueError("No snippets found in API response.")
26 
27print(f"Got {len(corpus)} snippets from Scrapingdog API.")
28 
29 
30# ======================
31# Step 2. Build Mini Model
32# ======================
33 
34def build_markov_model(corpus, n=2):
35    """Builds a simple Markov chain model from text corpus"""
36    model = defaultdict(list)
37    for text in corpus:
38        words = text.split()
39        for i in range(len(words) - n):
40            key = tuple(words[i:i+n])
41            next_word = words[i+n]
42            model[key].append(next_word)
43    return model
44 
45def generate_text(model, length=60):
46    """Generates text of given length using the Markov chain model"""
47    start = random.choice(list(model.keys()))
48    output = list(start)
49    for _ in range(length):
50        state = tuple(output[-2:])
51        next_words = model.get(state)
52        if not next_words:
53            break
54        output.append(random.choice(next_words))
55    return " ".join(output)
56 
57 
58# ======================
59# Step 3. Generate Output
60# ======================
61 
62markov_model = build_markov_model(corpus)
63 
64print("\n--- Mini LLM Output ---\n")
65print(generate_text(markov_model, 800))
66print("\n-----------------------")

Conclusion

What we’ve built here is a fun, lightweight “mini-LLM” powered by live data from Scrapingdog’s Universal Search API. Instead of training on static datasets, this approach allows your model to generate fresh, topic-specific text using real search snippets from engines like Google, Bing, and Yahoo.

Of course, this isn’t a replacement for large-scale language models, but it shows how quickly developers can prototype search-aware applications without heavy infrastructure.

👉 Sign up for Scrapingdog and get free credits to start experimenting with your own mini-models today.