Xidel - HTML/XML/JSON data extraction tool

Fork me on GitHub

Xidel is a command line tool to download and extract data from HTML/XML pages as well as JSON APIs.

News

2018-04-02: New release 0.9.8
The 0.9.8 release improves cookie handling, module loading, adds pattern matching between siblings elements, functions to use multipage templates and the variable changelog dynamically, as well as minor bug fixes and performance improvements.
2016-11-20: New release 0.9.6
The 0.9.6 release brings new functions, performance improvements, bug fixes, and stricter default settings.
2016-06-08: New release 0.9.4
The 0.9.4 release completes the XPath/XQuery 3.0 support, implements the EXPath file module, uses a new regular expression library, and has various other improvements.
2016-02-25: New development snapshots
There is now a download folder with prereleases for upcoming versions. Also the survey below is still running till the 1.0 version.
2015-08-16: Survey for later releases
Tbere is now a survey running if the default language of Xidel should be XPath or XQuery (on Google Forms, so you need to login). Both are Turing-complete, but have slightly incompatible string syntax, so the question is which one you prefer.
2015-06-28: New release
The 0.9 release adds support for most of the XPath/XQuery 3.0 syntax like anonymous and higher order functions, supports multipart HTTP requests for file uploads, changes the default output format, adds an (experimental) function for page modifications, fixes a large number of bugs mostly related to command line parsing and XPath/XQuery standard compatibility, and more...
2014-08-13: Minor release
The 0.8.4 versions extends some standard XQuery expressions with pattern matching, adds options to set HTTP headers and read environment variables, and fixes some bugs...
2014-03-24: New release
The 0.8 release improves the JSONiq support and our own JSON extensions, adds arbitrary precision arithmetic, a trivial subset of XPath/XQuery 3, new functions for resolving uri or HTML hrefs, and more...
2013-03-26: New release
The 0.7 release adds JSONiq support, grouping of command line options, new input/output formats, fixes HTML parsing/serialization, changes the syntax of extended strings and some other stuff
2012-11-06: New release
The 0.6 release adds XQuery support, the form and match functions, improves the Windows command-line interface, merges the two old cgi services to a single one and fixes several interpreter bugs
2012-09-05: Initial release of Xidel
First release of the VideLibri backend as stand-alone command-line tool

Features

It supports:

Examples

  1. Print all URLs found by a google search.

    xidel https://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
  2. Print the title of all pages found by a google search and download them:

    xidel https://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
  3. Generally follow all links on a page and print the titles of the linked pages:
    • With XPath: xidel https://example.org -f //a -e //title
    • With CSS selectors: xidel https://example.org -f "css('a')" --css title
    • With pattern matching: xidel https://example.org -f "<a>{.}</a>*" -e "<title>{.}</title>"
  4. Another pattern matching example:

    If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>
    You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
    (and this will also check, if the element containing "ood" is there, and fail otherwise)
  5. Calculate something with XPath using arbitrary precision arithmetics:

    xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"
  6. Print all newest Stackoverflow questions with title and url using pattern matching on their RSS feed:

    xidel http://stackoverflow.com/feeds -e "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"
  7. Print all Reddit comments of a user, with HTML and URL:

    xidel "https://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"
  8. Check if your Reddit letter is red:
    • Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation:

      xidel https://reddit.com -f "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" -e "css('#mail')/@title"
    • Using the Reddit API:

      xidel -d "user=$your_username&passwd=$your_password&api_type=json" https://ssl.reddit.com/api/login --method GET 'https://www.reddit.com/api/me.json' -e '($json).data.has_mail'
  9. Use XQuery, to create a HTML table of odd and even numbers:

    Windows cmd: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml
    Linux/Powershell: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
    (Xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows' cmd, and " does not escape $ in the Linux shells)
  10. Export variables to shell

    Linux/bash: eval "$(xidel https://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"
    This sets the bash variable $title to the title of the page and $links becomes an array of all links there.

    Windows cmd: FOR /F "delims=" %%A IN ('xidel https://site -e "title:=//title" -e "links:=//a/@href" --output-format cmd') DO %%A
    This sets the batch variable %title% to the title of the page and %links% becomes an array of all links there.
  11. Reading JSON:
    • Read the 10th array element: xidel file.json -e '$json(10)'
    • Read all array elements: xidel file.json -e '$json()'
    • Read property "foo" and then "bar" with JSONiq notation: xidel file.json -e '$json("foo")("bar")'
    • Read property "foo" and then "bar" with dot notation: xidel file.json -e '($json).foo.bar'
    • Read property "foo" and then "bar" with XPath-like notation: xidel file.json -e '$json/foo/bar'
    • Mixed example: xidel file.json -e '$json("abc")()().xyz/(u,v)'
      This would read all the numbers from e.g. {"abc": [[{"xyz": {"u": 1, "v": 2}}], [{"xyz": {"u": 3}}, {"xyz": {"u": 4}} ]]}.
      All selectors are sequence-transparent, i.e. you can use the same selector to read something from one value as to read it from several values. Arrays are converted to sequences with ()
    Using XPath 3.1 syntax (requires Xidel 0.9.9):
    • Read the 10th array element: xidel file.json -e '$json?10'
    • Read all array elements: xidel file.json -e '$json?*'
    • Read property "foo" and then "bar" with 3.1 notation: xidel file.json -e '$json?foo?bar'
  12. Convert table rows and columns to a CSV-like format:

    xidel https://site -e '//tr / string-join(td, ",")'

    string-join((...)) can generally be used to output some values in a single line. In the example tr / string-join calls string-join for every row.
  13. Modify/Transform an HTML file, e.g. to mark all links as bold (requires Xidel 0.9.9):

    Windows cmd:
    xidel --html your-file.html --xquery "x:replace-nodes(/, //a, function($e) { 
       $e/<a style='{string-join((@style, 'font-weight: bold'), '; ')}'>{@* except @style, node()}</a> 
            else .
    })" > your-output-file.html 
    Linux/Powershell:
    xidel --html your-file.html --xquery 'x:replace-nodes(/, //a, function($e) { 
       $e/<a style="{string-join((@style, "font-weight: bold"), "; ")}">{@* except @style, node()}</a> 
    })' > your-output-file.html

    This example combines three important syntaxes:
    • x:replace-nodes(/, //a, function($e) { .. }: This applies an anonymous function to every link a-element in the HTML document, whereby that element is stored in the variable $e and is replaced by the return value of the function.
    • <a>{@* except @style, node()}</a> : This creates a new a-element that has the same children, descendants and attributes as the current element, but removes the style-attribute.
    • style="{string-join((@style, "font-weight: bold"), "; ")}": This creates a new style-attribute by appending "font-weight: bold" to the old value of the attribute. A separating "; " is inserted, if (and only if) that attribute already existed.

Documentation

There is various documentation available:

Screenshots

Xidel on Linux
Xidel on Windows

Downloads

The last official release is Xidel 0.9.8, but a Xidel 0.9.9 development version is published irregularly for Windows, Linux (>= Ubuntu 20.10), Android and Windows, Linux (Ubuntu 20.04) and Mac as a preview for the next release. It is recommended to use the 0.9.9 version, since it contains bug fixes, is more performant, and partially supports XPath/XQuery 3.1. Thereby most of the JSONiq syntax has been replaced by the XPath 3.1 JSON syntax. It will be published officially, once all of XPath/XQuery 3.1 is implemented.

The following Xidel downloads are available on the sourceforge download page:

Operating SystemFilenameSizeSHA-256
Windows: 32 Bitxidel-0.9.8.win32.zip840.0 kB 96854c2be1e3755f56fabb8f00d1fe567108461b9fab139039219a1b7c17e382
Windows: 32 Bit (needs OpenSSL)xidel-0.9.8-openssl.win32.zip873.6 kB 1b9f3e78897727fe3ea2a359ec9678d0b2e593792a3c10c468bec60d7a873b59
Universal Linux: 32 Bitxidel-0.9.8.linux32.tar.gz848.5 kB dcc80b3a1dbf437c98d94c8dcd9b4af5f709174892bf926f36ea8dd5cb55aaec
Universal Linux: 64 Bitxidel-0.9.8.linux64.tar.gz1.3 MB cf6d7391a73dbadf7c74e22206ea3f9f4f77f77d0e9d6e32d15ec400b1b843ef
Debian: 32 Bitxidel_0.9.8-1_i386.deb665.1 kB 8329c02512da430ef1f40f77e2676539a146b258c7201375337e7de8f4e16b2c
Debian: 64 Bitxidel_0.9.8-1_amd64.deb991.9 kB f6a6e29b77547d5ae38383440bd653b3eaf9eeb470def14cc48154a4f6925f69
Android ARM:xidel-0.9.8.androidarm.tar.gz2.1 MB 3d19cf5e9a5bf9314e251aa14e0ac990fdd290aa5bfad9e0e5c6956800365fb5
Source:xidel-0.9.8.src.tar.gz1.9 MB 72b5b1a2fc44a0a61831e268c45bc6a6c28e3533b5445151bfbdeaf1562af39c
Mac 10.8externally build old Xidel 0.9.6 and compile instructions.

Usually, you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for HTTPS connections on Linux OpenSSL (including openssl-dev) and libcrypto are also required. For Unicode collations, libicu is required.

There are already nightly preview builds of the next version, Xidel 0.9.9. There are builds compiled on a build server (on Ubuntu 20.04; for Linux GLIBC < 2.34, Windows and Mac) and compiled locally (on Ubuntu 21.10; for Linux GLIBC >= 2.34, Android, and Windows).

You can also test it online on a webpage or directly by sending a request to the cgi service like https://www.videlibri.de/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true.


The source history is stored in a mercurial repository together with the VideLibri source and dependencies, licensed as GPLv3+. There are mirrors on GitHub Build status and GitLab. These mirrors have the Xidel source only, in order to compile it you need to download the dependencies from their own repositories first. Or use the above source tarball, which also contains dependencies.

The source then needs to be compiled with FreePascal.
In a Unix-like shell you compile it by calling ./build.sh, which just calls FreePascal. If you want to call FreePascal directly yourself, you can use fpc xidel.pas in which case you need to pass the paths to all directories of the source using the -Fu, -Fi options.
Alternatively, Xidel can be compiled using the Lazarus IDE. For this install components/pascal/internettools.lpk in Lazarus, then open programs/internet/xidel/xidel.lpi and click on Run\Compile.



Pronunciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.

Contact

You can join the Xidel mailing list, have discussions on SourceForge or follow @bibliothekapp.

Author: Benito van der Zander, benito_NOSPAM_benibela.de, www.benibela.de
(Please do not ask me how to scrape your website. Ask how to do something with Xidel instead. I know Xidel, I do not know your website. The point of the tool is to make it easy for anyone to parse any webpage. Scraping every webpage myself does not scale well.)