Xidel - HTML/XML/JSON data extraction tool

2018-04-02: New release 0.9.8: The 0.9.8 release improves cookie handling, module loading, adds pattern matching between siblings elements, functions to use multipage templates and the variable changelog dynamically, as well as minor bug fixes and performance improvements.
2016-11-20: New release 0.9.6: The 0.9.6 release brings new functions, performance improvements, bug fixes, and stricter default settings.
2016-06-08: New release 0.9.4: The 0.9.4 release completes the XPath/XQuery 3.0 support, implements the EXPath file module, uses a new regular expression library, and has various other improvements.
2016-02-25: New development snapshots: There is now a download folder with prereleases for upcoming versions. Also the survey below is still running till the 1.0 version.

2015-08-16: Survey for later releases: Tbere is now a survey running if the default language of Xidel should be XPath or XQuery (on Google Forms, so you need to login). Both are Turing-complete, but have slightly incompatible string syntax, so the question is which one you prefer.
2015-06-28: New release: The 0.9 release adds support for most of the XPath/XQuery 3.0 syntax like anonymous and higher order functions, supports multipart HTTP requests for file uploads, changes the default output format, adds an (experimental) function for page modifications, fixes a large number of bugs mostly related to command line parsing and XPath/XQuery standard compatibility, and more...
2014-08-13: Minor release: The 0.8.4 versions extends some standard XQuery expressions with pattern matching, adds options to set HTTP headers and read environment variables, and fixes some bugs...
2014-03-24: New release: The 0.8 release improves the JSONiq support and our own JSON extensions, adds arbitrary precision arithmetic, a trivial subset of XPath/XQuery 3, new functions for resolving uri or HTML hrefs, and more...
2013-03-26: New release: The 0.7 release adds JSONiq support, grouping of command line options, new input/output formats, fixes HTML parsing/serialization, changes the syntax of extended strings and some other stuff
2012-11-06: New release: The 0.6 release adds XQuery support, the form and match functions, improves the Windows command-line interface, merges the two old cgi services to a single one and fixes several interpreter bugs
2012-09-05: Initial release of Xidel: First release of the VideLibri backend as stand-alone command-line tool

show less

Features

It supports:

Extract expressions:
- CSS 3 Selectors: to extract elements unchanged
- XPath 3.0: to extract values and calculate things with them.
- XQuery 3.0: to create new documents from the extracted values and to build Turing-complete scripts.
- Pattern matching: to extract several expressions in an easy way using an annotated version of the input page for pattern-matching.
- XPath 2.0/XQuery 1.0: compatibility mode for old XPath/XQuery versions.
- JSONiq: to work with JSON APIs (deprecated by XPath 3.1)
Following:
- HTTP Codes: Redirections like 30x are automatically followed, while keeping things like cookies.
- Links: It can follow (all) links on a page, meta refreshs, or any extracted value.
- HTML Forms: It can fill in arbitrary data in the input elements and submit the form.
- Arbitrary HTTP requests: In any query, you can call a function to make other requests.
Output formats:
- Adhoc: just prints the data in a human-readable format.
- XML: encodes the data as XML.
- HTML: encodes the data as HTML.
- JSON: encodes the data as JSON.
- bash/cmd: exports the data as shell variables.
- fn:serialize: implements the W3C XQuery Serialization standard.
Connections: HTTP / HTTPS as well as local files or stdin.
Systems: Windows (using wininet), Linux, Mac OSX, and Android (using synapse+OpenSSL)

Examples

Print all URLs found by a google search.

xidel https://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
Print the title of all pages found by a google search and download them:

xidel https://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
Generally follow all links on a page and print the titles of the linked pages:
- With XPath: xidel https://example.org -f //a -e //title
- With CSS selectors: xidel https://example.org -f "css('a')" --css title
- With pattern matching: xidel https://example.org -f "<a>{.}</a>*" -e "<title>{.}</title>"
Another pattern matching example:

If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>
You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
(and this will also check, if the element containing "ood" is there, and fail otherwise)
Calculate something with XPath using arbitrary precision arithmetics:

xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"
Print all newest Stackoverflow questions with title and url using pattern matching on their RSS feed:

xidel http://stackoverflow.com/feeds -e "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"
Print all Reddit comments of a user, with HTML and URL:

xidel "https://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"
Check if your Reddit letter is red:
- Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation:
  
  xidel https://reddit.com -f "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" -e "css('#mail')/@title"
- Using the Reddit API:
  
  xidel -d "user=$your_username&passwd=$your_password&api_type=json" https://ssl.reddit.com/api/login --method GET 'https://www.reddit.com/api/me.json' -e '($json).data.has_mail'
Use XQuery, to create a HTML table of odd and even numbers:

Windows cmd: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml
Linux/Powershell: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
(Xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows' cmd, and " does not escape $ in the Linux shells)
Export variables to shell

Linux/bash: eval "$(xidel https://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"
This sets the bash variable $title to the title of the page and $links becomes an array of all links there.

Windows cmd: FOR /F "delims=" %%A IN ('xidel https://site -e "title:=//title" -e "links:=//a/@href" --output-format cmd') DO %%A
This sets the batch variable %title% to the title of the page and %links% becomes an array of all links there.
Reading JSON:
- Read the 10th array element: xidel file.json -e '$json(10)'
- Read all array elements: xidel file.json -e '$json()'
- Read property "foo" and then "bar" with JSONiq notation: xidel file.json -e '$json("foo")("bar")'
- Read property "foo" and then "bar" with dot notation: xidel file.json -e '($json).foo.bar'
- Read property "foo" and then "bar" with XPath-like notation: xidel file.json -e '$json/foo/bar'
- Mixed example: xidel file.json -e '$json("abc")()().xyz/(u,v)'
  
  This would read all the numbers from e.g. {"abc": [[{"xyz": {"u": 1, "v": 2}}], [{"xyz": {"u": 3}}, {"xyz": {"u": 4}} ]]}.
  All selectors are sequence-transparent, i.e. you can use the same selector to read something from one value as to read it from several values. Arrays are converted to sequences with ()
Using XPath 3.1 syntax (requires Xidel 0.9.9):
- Read the 10th array element: xidel file.json -e '$json?10'
- Read all array elements: xidel file.json -e '$json?*'
- Read property "foo" and then "bar" with 3.1 notation: xidel file.json -e '$json?foo?bar'
Convert table rows and columns to a CSV-like format:

xidel https://site -e '//tr / string-join(td, ",")'

string-join((...)) can generally be used to output some values in a single line. In the example tr / string-join calls string-join for every row.
Modify/Transform an HTML file, e.g. to mark all links as bold (requires Xidel 0.9.9):

Windows cmd: xidel --html your-file.html --xquery "x:replace-nodes(/, //a, function($e) { $e/<a style='{string-join((@style, 'font-weight: bold'), '; ')}'>{@* except @style, node()}</a> else . })" > your-output-file.html Linux/Powershell: xidel --html your-file.html --xquery 'x:replace-nodes(/, //a, function($e) { $e/<a style="{string-join((@style, "font-weight: bold"), "; ")}">{@* except @style, node()}</a> })' > your-output-file.html
This example combines three important syntaxes:
- x:replace-nodes(/, //a, function($e) { .. }: This applies an anonymous function to every link a-element in the HTML document, whereby that element is stored in the variable $e and is replaced by the return value of the function.
- <a>{@* except @style, node()}</a> : This creates a new a-element that has the same children, descendants and attributes as the current element, but removes the style-attribute.
- style="{string-join((@style, "font-weight: bold"), "; ")}": This creates a new style-attribute by appending "font-weight: bold" to the old value of the attribute. A separating "; " is inserted, if (and only if) that attribute already existed.

Documentation

There is various documentation available:

List of available functions with examples,
XPath 3.1 standard explaining the basic query syntax for JSON and XML,
XQuery 3.1 standard explaining the more advanced query syntax like custom functions, iterating with grouping or partioning, and XML constructors,
JSONiq specification explaining the deprecated JSON query syntax,
XPath/XQuery data model standard explaining the different types,
readme file included with Xidel explaining the command line arguments, e.g. input and output settings,
the XML/HTML pattern matching syntax as documented in the internal Pascal library,
the internally used Pascal XPath/XQuery library,
XQuery wikibook a general introduction to XQuery, not Xidel specific,
a Wiki,
XPath/XQuery Test Suite results.

Screenshots

Downloads

The last official release is Xidel 0.9.8, but a Xidel 0.9.9 development version is published irregularly for Windows, Linux (>= Ubuntu 20.10), Android and Windows, Linux (Ubuntu 20.04) and Mac as a preview for the next release. It is recommended to use the 0.9.9 version, since it contains bug fixes, is more performant, and partially supports XPath/XQuery 3.1. Thereby most of the JSONiq syntax has been replaced by the XPath 3.1 JSON syntax. It will be published officially, once all of XPath/XQuery 3.1 is implemented.

The following Xidel downloads are available on the sourceforge download page:

Operating System	Filename	Size	SHA-256
Windows: 32 Bit	xidel-0.9.8.win32.zip	840.0 kB	96854c2be1e3755f56fabb8f00d1fe567108461b9fab139039219a1b7c17e382
Windows: 32 Bit (needs OpenSSL)	xidel-0.9.8-openssl.win32.zip	873.6 kB	1b9f3e78897727fe3ea2a359ec9678d0b2e593792a3c10c468bec60d7a873b59
Universal Linux: 32 Bit	xidel-0.9.8.linux32.tar.gz	848.5 kB	dcc80b3a1dbf437c98d94c8dcd9b4af5f709174892bf926f36ea8dd5cb55aaec
Universal Linux: 64 Bit	xidel-0.9.8.linux64.tar.gz	1.3 MB	cf6d7391a73dbadf7c74e22206ea3f9f4f77f77d0e9d6e32d15ec400b1b843ef
Debian: 32 Bit	xidel_0.9.8-1_i386.deb	665.1 kB	8329c02512da430ef1f40f77e2676539a146b258c7201375337e7de8f4e16b2c
Debian: 64 Bit	xidel_0.9.8-1_amd64.deb	991.9 kB	f6a6e29b77547d5ae38383440bd653b3eaf9eeb470def14cc48154a4f6925f69
Android ARM:	xidel-0.9.8.androidarm.tar.gz	2.1 MB	3d19cf5e9a5bf9314e251aa14e0ac990fdd290aa5bfad9e0e5c6956800365fb5
Source:	xidel-0.9.8.src.tar.gz	1.9 MB	72b5b1a2fc44a0a61831e268c45bc6a6c28e3533b5445151bfbdeaf1562af39c
Mac 10.8	externally build old Xidel 0.9.6 and compile instructions.

Usually, you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for HTTPS connections on Linux OpenSSL (including openssl-dev) and libcrypto are also required. For Unicode collations, libicu is required.

There are already nightly preview builds of the next version, Xidel 0.9.9. There are builds compiled on a build server (on Ubuntu 20.04; for Linux GLIBC < 2.34, Windows and Mac) and compiled locally (on Ubuntu 21.10; for Linux GLIBC >= 2.34, Android, and Windows).

You can also test it online on a webpage or directly by sending a request to the cgi service like https://www.videlibri.de/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true.

The source history is stored in a mercurial repository together with the VideLibri source and dependencies, licensed as GPLv3+. There are mirrors on GitHub

and GitLab. These mirrors have the Xidel source only, in order to compile it you need to download the dependencies from their own repositories first. Or use the above source tarball, which also contains dependencies.

The source then needs to be compiled with FreePascal.
In a Unix-like shell you compile it by calling ./build.sh, which just calls FreePascal. If you want to call FreePascal directly yourself, you can use fpc xidel.pas in which case you need to pass the paths to all directories of the source using the -Fu, -Fi options.
Alternatively, Xidel can be compiled using the Lazarus IDE. For this install components/pascal/internettools.lpk in Lazarus, then open programs/internet/xidel/xidel.lpi and click on Run\Compile.

Pronunciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.