Xidel - HTML/XML data extraction tool

Xidel is a command line tool to download and extract data from html/xml pages.

News

2014-08-13: Minor release
The 0.8.4 versions extends some standard XQuery expressions with pattern matching, adds options to set HTTP headers and read environment variables, and fixes some bugs...
2014-03-24: New release
The 0.8 release improves the JSONiq support and our own JSON extensions, adds arbitrary precision arithmetic, a trivial subset of XPath/XQuery 3, new functions for resolving uri or html hrefs, and more...
2013-03-26: New release
The 0.7 release adds JSONiq support, grouping of command line options, new input/output formats, fixes html parsing/serialization, changes the syntax of extended strings and some other stuff
2012-11-06: New release
The 0.6 release adds XQuery support, the form and match functions, improves the Windows command-line interface, merges the two old cgi services to a single one and fixes several interpreter bugs
2012-09-05: Initial release of Xidel
First release of the VideLibri backend as stand-alone command-line tool

Features

It supports:

Examples

  1. Print all urls found by a google search.

    xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
  2. Print the title of all pages found by a google search and download them:

    xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
  3. Generally follow all links on a page and print the titles of the linked pages:
  4. Another template example:

    If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>
    You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
    (and this will also check, if the part with the ood is there, and fail otherwise)
  5. Calculate something with XPath:

    xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"
  6. Print all newest stackoverflow questions with title and url:

    xidel http://stackoverflow.com -e "<A class='question-hyperlink'>{title:=text(), url:=@href}</A>*"
  7. Print all reddit comments of an user, with html and url:

    xidel "http://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"
  8. Check if your reddit letter is red:
  9. Use XQuery, to create a html table of odd and even numbers:

    Windows: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml
    Linux: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
    (xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows cmd, and " does not escape $ in the Linux shells)
  10. Export variables to bash

    eval "$(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"

    This sets the bash variable $title to the title of the page and $links becomes an array of all links there

You may also want to read the readme file of Xidel, the documentation of my template language and XPath 2/XQuery library. Or look at its results on the XQuery Testsuite (XPath 2 only), skipping tests testing for the rejection of invalid input.

Downloads

The following Xidel downloads are available on the sourceforge download page:

Operating SystemFilenameSize
Windows: 32 Bit xidel-0.8.4.win32.zip609.9 kB
Universal Linux: 64 Bit xidel-0.8.4.linux64.tar.gz1.5 MB
Universal Linux: 32 Bit xidel-0.8.4.linux32.tar.gz1.2 MB
Source: xidel-0.8.4.src.tar.gz1.3 MB
Debian: 64 Bit xidel_0.8.4-1_amd64.deb1.0 MB
Debian: 32 Bit xidel_0.8.4-1_i386.deb799.8 kB
Mac 10.8externally prebuilt version and compile instructions.

Usually you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for https connections on Linux openssl and libcrypto (or even openssl-dev?) are also required.

You can also test it online on a webpage or directly by sending a request to the cgi service like http://videlibri.sourceforge.net/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true.


The source is stored in a mercurial repository together with the VideLibri source.

The program is written in FreePascal/Lazarus, so you need these to compile it.

With the source from the repository, it should be possible to just open the project file xidel.lpi (in the programs/internet/xidel directory) in Lazarus and click on Run\Compile.
If that fails, because it complains about missing packages, you first need to register the packages in Lazarus, by opening the components/pascal/internettools.lpk and components/pascal/import/utf8tools/utf8tools.lpk files and clicking the "Use/Install" button.


Pronounciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.

Xidelscript

There is also a Greasemonkey script to automatically generate templates by selecting the interesting values on a webpage. The script intercepts the selection and marks the elements as shown in the screenshots below:

Selections that create a template to read the reddit frontpage (extracting title/author/time of each link) Selections that create a template to read the newest stackoverflow question. (votes/title/author are extracted for each of the questions)
Afterwards you can call Xidel with the templates and it will output the element that you have selected and related ones. (so e.g. in the screenshots the first reddit/stackoverflow post was selected, and Xidel will print all of them )

You can find the script in the mercurial repository or on userscripts.org with a detailed description. You need to change the name to "Webscraper / Xidelscript" (it is a multiscript that changes its behaviour depending on its name))



Contact

Autor: Benito van der Zander, benito_NOSPAM_benibela.de, www.benibela.de