Articles • Perl HTML Parsing Modules
WWW::Mechanize

This module is both an HTML parser as well as an HTTP user agent. It uses several underlying modules to do the parsing, and then presents a high-level interface for interacting with a page.

HTML::TreeBuilder

Excellent general-purpose module. It provides a tree view of the HTML, allowing quick access to any part of it. Excellent for finding tags in a specific location in the content.

HTML::TableExtract

If you want all or part of the content of a table, this is the module to reach for. It can locate tables based on various criteria, and return all of the rows found. It can also be induced to return HTML::Element-compatible objects, making it even more flexible.

HTML::SimpleLinkExtor

Simple and direct interface for extracting links (img src values, a href values, etc.) from a document. It's based on HTML::LinkExtor, but provides a more readable interface.

HTML::TreeBuilder::XPath

Adds XPath support to HTML::TreeBuilder. Excellent if XPath is how you like to approach your problem.

HTML::TokeParser::Simple

A lower-level module, useful if you need to visit all, or nearly all, parts of an HTML document. Not as good for finding specific elements, as search criteria is more limited. It's based on HTML::TokeParser, but provides more readable methods for individual tokens.

HTML::Parser

This is the ultimate superclass of every parser listed here. If you are implementing your own HTML parsing module, you will probably want to inherit from this. It's event-based, and very awkward to use unless you truly do need to visit every part of a document.