Xidel - HTML/XML data extraction tool

Xidel is a command line tool to download and extract data from html/xml pages.

News

2013-03-26: New release
The 0.7 release adds JSONiq support, grouping of command line options, new input/output formats, fixes html parsing/serialization, changes the syntax of extended strings and some other stuff
2012-11-06: New release
The 0.6 release adds XQuery support, the form and match functions, improves the Windows command-line interface, merges the two old cgi services to a single one and fixes several interpreter bugs
2012-09-05: Initial release of Xidel
First release of the VideLibri backend as stand-alone command-line tool
2010-08-29: Initial release
Release of a tiny cli example to test the pattern matching and XPath expressions of the Pascal Internet Tools / VideLibri backend on local files

Features

It supports:

Examples

  1. Print all urls found by a google search.

    xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
  2. Print the title of all pages found by a google search and download them:

    xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
  3. Generally follow all links on a page and print the titles of the linked pages:
  4. Another template example:

    If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>
    You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
    (and this will also check, if the part with the ood is there, and fail otherwise)
  5. Calculate something with XPath:

    xidel -e "(1 + 2 + 3) * 4 + 5 + 6 + 7"
  6. Print all newest stackoverflow questions with title and url:

    xidel http://stackoverflow.com -e "<A class='question-hyperlink'>{title:=text(), url:=@href}</A>*"
  7. Print all reddit comments of an user, with html and url:

    xidel "http://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"
  8. Check if your reddit letter is red:
  9. Use XQuery, to create a html table of odd and even numbers:

    Windows: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml
    Linux: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
    (xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows cmd, and " does not escape $ in the Linux shells)
  10. Export variables to bash

    eval $(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)

    This sets the bash variable $title to the title of the page and $links becomes an array of all links there

You may also want to read the readme file of Xidel, the documentation of my template language and XPath 2/XQuery library. Or look at its results on the XQuery Testsuite (XPath 2 only), skipping tests testing for the rejection of invalid input.

Downloads

The following Xidel downloads are available on the sourceforge download page:

Operating SystemFilenameSize
Windows: 32 Bit xidel-0.7.win32.zip556.7 kB
Universal Linux: 64 Bit xidel-0.7.linux64.tar.gz1.5 MB
Universal Linux: 32 Bit xidel-0.7.linux32.tar.gz1.2 MB
Source: xidel-0.7.src.tar.gz1.1 MB
Debian: 64 Bit xidel_0.7-1_amd64.deb1.5 MB
Debian: 32 Bit xidel_0.7-1_i386.deb1.2 MB
Mac 10.8externally prebuilt 0.7 version and compile instructions.

Usually you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for https connections on Linux openssl and libcrypto (or even openssl-dev?) are also required.

You can also test it online on a webpage or directly by sending a request to the cgi service like http://videlibri.sourceforge.net/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true.


The source is stored in a mercurial repository together with the VideLibri source.

The program is written in FreePascal/Lazarus, so you need these to compile it.

With the source from the repository, it should be possible to just open the project file xidel.lpi (in the programs/internet/xidel directory) in Lazarus and click on Run\Compile.
If that fails, because it complains about missing packages, you first need to register the packages in Lazarus, by opening the components/pascal/internettools.lpk and components/pascal/import/utf8tools/utf8tools.lpk files and clicking the "Use/Install" button.

In the source tar ball, the paths are not fully correct (absolute instead relative) and a package file is missing. So before compiling the xidel.lpi, you need to open the "project options" dialog to add the "other unit file" paths to all subdirectories in the components/pascal directory and open the "project inspector" to remove the dependency on "internet tools" (alternatively, you can download the internet tools package from my homepage, but that should not be necessary, once the paths are set).

It is also possible to compile it on Mac: There you need the newest version of the synapse library (the version included in the Xidel source is not mac-compatible), then you can compile it as described here. There you can find a prebuilt mac version.

Xidelscript

There is also a Greasemonkey script to automatically generate templates by selecting the interesting values on a webpage. The script intercepts the selection and marks the elements as shown in the screenshots below:

Selections that create a template to read the reddit frontpage (extracting title/author/time of each link) Selections that create a template to read the newest stackoverflow question. (votes/title/author are extracted for each of the questions)

You can find the script in the mercurial repository or on userscripts.org with a detailed description.



Contact

Autor: Benito van der Zander, benito_NOSPAM_benibela.de, www.benibela.de