Xidel - HTML/XML data extraction tool

Xidel is a command line tool to download and extract data from html/xml pages.

News

2014-03-24: New release
The 0.8 release improves the JSONiq support and our own JSON extensions, adds arbitrary precision arithmetic, a trivial subset of XPath/XQuery 3, new functions for resolving uri or html hrefs, and more...
2013-03-26: New release
The 0.7 release adds JSONiq support, grouping of command line options, new input/output formats, fixes html parsing/serialization, changes the syntax of extended strings and some other stuff
2012-11-06: New release
The 0.6 release adds XQuery support, the form and match functions, improves the Windows command-line interface, merges the two old cgi services to a single one and fixes several interpreter bugs
2012-09-05: Initial release of Xidel
First release of the VideLibri backend as stand-alone command-line tool

Features

It supports:

Examples

  1. Print all urls found by a google search.

    xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
  2. Print the title of all pages found by a google search and download them:

    xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
  3. Generally follow all links on a page and print the titles of the linked pages:
  4. Another template example:

    If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>
    You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
    (and this will also check, if the part with the ood is there, and fail otherwise)
  5. Calculate something with XPath:

    xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"
  6. Print all newest stackoverflow questions with title and url:

    xidel http://stackoverflow.com -e "<A class='question-hyperlink'>{title:=text(), url:=@href}</A>*"
  7. Print all reddit comments of an user, with html and url:

    xidel "http://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"
  8. Check if your reddit letter is red:
  9. Use XQuery, to create a html table of odd and even numbers:

    Windows: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml
    Linux: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
    (xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows cmd, and " does not escape $ in the Linux shells)
  10. Export variables to bash

    eval "$(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"

    This sets the bash variable $title to the title of the page and $links becomes an array of all links there

You may also want to read the readme file of Xidel, the documentation of my template language and XPath 2/XQuery library. Or look at its results on the XQuery Testsuite (XPath 2 only), skipping tests testing for the rejection of invalid input.

Downloads

The following Xidel downloads are available on the sourceforge download page:

Operating SystemFilenameSize
Windows: 32 Bit xidel-0.8.win32.zip602.5 kB
Universal Linux: 64 Bit xidel-0.8.linux64.tar.gz1.5 MB
Universal Linux: 32 Bit xidel-0.8.linux32.tar.gz1.2 MB
Source: xidel-0.8.src.tar.gz1.3 MB
Debian: 64 Bit xidel_0.8-1_amd64.deb1.0 MB
Debian: 32 Bit xidel_0.8-1_i386.deb794.3 kB
Mac OSX 10.9externally prebuilt version and compile instructions.

Usually you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for https connections on Linux openssl and libcrypto (or even openssl-dev?) are also required.

You can also test it online on a webpage or directly by sending a request to the cgi service like http://videlibri.sourceforge.net/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true.


The source is stored in a mercurial repository together with the VideLibri source.

The program is written in FreePascal/Lazarus, so you need these to compile it.

With the source from the repository, it should be possible to just open the project file xidel.lpi (in the programs/internet/xidel directory) in Lazarus and click on Run\Compile.
If that fails, because it complains about missing packages, you first need to register the packages in Lazarus, by opening the components/pascal/internettools.lpk and components/pascal/import/utf8tools/utf8tools.lpk files and clicking the "Use/Install" button.


Xidelscript

There is also a Greasemonkey script to automatically generate templates by selecting the interesting values on a webpage. The script intercepts the selection and marks the elements as shown in the screenshots below:

Selections that create a template to read the reddit frontpage (extracting title/author/time of each link) Selections that create a template to read the newest stackoverflow question. (votes/title/author are extracted for each of the questions)

You can find the script in the mercurial repository or on userscripts.org with a detailed description.



Contact

Autor: Benito van der Zander, benito_NOSPAM_benibela.de, www.benibela.de