How To Parse HTML with PHP

TL;DR

Shows three PHP ways to parse HTML: DOMDocument, php-html-parser, and Goutte (skip regex).
Flow: fetch page → parse DOM → extract fields → save to CSV.
Example script parses a table and exports results.
At scale, use Scrapingdog to fetch reliably (proxies/CAPTCHAs/JS handled).

Parsing is the most critical task after scraping. Whether you’re building a web crawler, scraping data, or just extracting elements from a page, PHP offers some great tools for HTML parsing.

In this detailed guide, we’ll explore everything you need to know about parsing HTML with PHP, from the basics to advanced examples.

Parsing Methods in PHP

Before we go deeper, let’s outline the primary ways HTML can be parsed using PHP:

DOMDocument (Built-in)
Simple HTML DOM Parser (External Library)
Goutte (Symfony-based Web Scraper)
cURL + Regex (Not recommended, but used)

Installing PHP and Required Libraries

Before you begin scraping or parsing, ensure you have PHP installed on your system.

Install PHP

If you’re on macOS:

1brew install php

For Ubuntu/Debian:

1sudo apt update
2sudo apt install php php-cli php-curl php-mbstring

For Windows:

Download PHP from php.net.
Extract and add the path to your system’s environment variables.

Install Composer (PHP Dependency Manager)

1php -r "copy('https://getcomposer.org/installer', 'composer-setup.php');"
2php composer-setup.php
3php -r "unlink('composer-setup.php');"

Then, move the file:

1mv composer.phar /usr/local/bin/composer

Install paquettg/php-html-parser

This is one of the most popular PHP HTML parsing libraries.

1composer require paquettg/php-html-parser

Forget about getting blocked while scraping the Web

Try out Scrapingdog Web Scraping API & scrape any website at Scale. We handle all the proxies, headless browsers & retries for you!

Try Scrapingdog for Free

Scraping with PHP

For this tutorial, we are going to this site to scrape and parse. Now, create a PHP file by any name you like. I am naming the file as scraper.php.

The code is very simple, but let me explain you step by step.

We defined a variable $url containing the URL of the web page you want to scrape.
Then we are using PHP’s built-in file_get_contents() function to send a GET request to the URL.
The HTML content of the page is then stored in the $html variable.
It’s a simple way to fetch raw HTML from a web page.
This checks if the request failed (i.e., $html is false).
If the page couldn’t be loaded, the script stops and prints:
“Failed to fetch page”
This writes the fetched HTML content into a new file called raw.html.
The file will be created in the same directory as the script (or overwritten if it exists).
Finally, a success message is printed to confirm that the file has been saved.

Now, let’s parse it.

Parsing with PHP

Now, let’s parse the raw HTML and extract the team name, year, wins, and losses.

1loadFromFile('raw.html');
2 
3$data = [];
4$rows = $dom->find('.table tbody tr');
5 
6foreach ($rows as $row) {
7    $teamName = $row->find('td', 0)->text;
8    $year = $row->find('td', 1)->text;
9    $wins = $row->find('td', 2)->text;
10    $losses = $row->find('td', 3)->text;
11 
12    $data[] = [
13        'Team' => trim($teamName),
14        'Year' => trim($year),
15        'Wins' => trim($wins),
16        'Losses' => trim($losses)
17    ];
18}
19 
20print_r($data);
21?>

First, we load all Composer-installed PHP libraries.
Assumes you’ve installed paquettg/php-html-parser via Composer.
Imports the Dom class from the library, allowing you to parse and interact with HTML DOM elements.
Creates a new DOM parser instance.
Loads the HTML from raw.html (an offline copy of a webpage with a table) for processing.
Initializes an empty array called $data to hold the parsed results.
Uses a CSS selector to find all <tr> (table row) elements inside <tbody> of a table with class .table.
Iterates through each row of the table.
Adds a new entry to the $data array.
trim() removes any leading/trailing whitespace from the extracted text.
Outputs the final structured array in a human-readable format.

Once you run this code, you will get a beautiful parsed response.

1Array
2(
3    [0] => Array
4        (
5            [Team] => Boston Celtics
6            [Year] => 2013
7            [Wins] => 41
8            [Losses] => 40
9        )
10 
11    [1] => Array
12        (
13            [Team] => Brooklyn Nets
14            [Year] => 2013
15            [Wins] => 49
16            [Losses] => 33
17        )
18 
19    [2] => Array
20        (
21            [Team] => New York Knicks
22            [Year] => 2013
23            [Wins] => 37
24            [Losses] => 45
25        )
26 
27    [3] => Array
28        (
29            [Team] => Philadelphia 76ers
30            [Year] => 2013
31            [Wins] => 19
32            [Losses] => 63
33        )
34 
35    [4] => Array
36        (
37            [Team] => Toronto Raptors
38            [Year] => 2013
39            [Wins] => 48
40            [Losses] => 34
41        )
42)

Storing the Data in a CSV File

Let’s export this parsed data into a CSV file.

Opens a new file named teams.csv in write mode.
If the file doesn’t exist, it will be created.
$csvFile is now a file handle used for writing to the file.
Write the column headers (the first row) into the CSV file.
fputcsv() automatically formats the array into a comma-separated line
Then, we iterate through the $data array, which is assumed to be an array of associative arrays.
Finally, we close the file after writing is complete.

Complete Code

1loadFromFile('raw.html');
2 
3// Step 3: Extract table rows
4$data = [];
5$rows = $dom->find('.table tbody tr');
6 
7foreach ($rows as $row) {
8    $teamName = $row->find('td', 0)->text;
9    $year     = $row->find('td', 1)->text;
10    $wins     = $row->find('td', 2)->text;
11    $losses   = $row->find('td', 3)->text;
12 
13    $data[] = [
14        'Team'   => trim($teamName),
15        'Year'   => trim($year),
16        'Wins'   => trim($wins),
17        'Losses' => trim($losses)
18    ];
19}
20 
21// Step 4: Write the data to a CSV file
22$csvFile = fopen('teams.csv', 'w');
23 
24// Add header row
25fputcsv($csvFile, ['Team', 'Year', 'Wins', 'Losses']);
26 
27// Write each data row
28foreach ($data as $line) {
29    fputcsv($csvFile, $line);
30}
31 
32fclose($csvFile);
33echo "Data written to teams.csv successfully.\n";
34?>

Key Takeaways:

The tutorial explains how to parse HTML in PHP, showing how to fetch raw HTML from webpages and then extract meaningful data from it.
It covers how to set up PHP and required parsing libraries, and then navigate the HTML DOM to find and extract specific elements like titles, text, and attributes.
The guide walks through using tools like DOMDocument and parsing logic to break down HTML into a structured format that PHP can work with.
Parsed HTML data can be exported or stored, enabling you to use it downstream in reports, datasets, or other applications

Conclusion

You just learned how to scrape and parse data from a real-world HTML page using PHP. From installing dependencies to writing your first HTML parsing logic and exporting the results, this guide covers the end-to-end workflow.

Whether you’re scraping for personal projects or commercial tools, PHP offers powerful solutions when used with the right libraries.

Frequently Asked Questions (FAQs)

1. What is the best way to parse HTML in PHP?

One of the easiest ways is to use DOMDocument or a library like php-html-parser. They let you load HTML, target elements, and extract the data you need in a structured way.

2. Can PHP scrape and parse HTML without external libraries?

Yes, PHP can fetch HTML using built-in functions like file_get_contents() and parse it with DOMDocument. But external libraries like php-html-parser make the process easier and cleaner.

3. How do you save parsed HTML data into a CSV file in PHP?

After extracting the data into an array, you can use fopen(), fputcsv(), and fclose() to write it into a CSV file. This is useful for exporting scraped data into reports or datasets.

4. Is regex a good option for parsing HTML in PHP?

Regex can be used, but it is not recommended for parsing HTML. HTML structures can be messy, so using a DOM-based parser is usually more reliable and easier to maintain.