Xidel - HTML/XML/JSON data extraction tool

Fork me on GitHub

Xidel is a command line tool to download and extract data from HTML/XML pages as well as JSON APIs .

News

2018-04-02: New release 0.9.8
The 0.9.8 release improves cookie handling, module loading, adds pattern matching between siblings elements, functions to use multipage templates and the variable changelog dynamically, as well as minor bug fixes and performance improvements.
2016-11-20: New release 0.9.6
The 0.9.6 release brings new functions, performance improvements, bug fixes, and stricter default settings.
2016-06-08: New release 0.9.4
The 0.9.4 release completes the XPath/XQuery 3.0 support, implements the EXPath file module, uses a new regular expression library and has various other improvements .
2016-02-25: New development snapshots
There is now a download folder with prereleases for upcoming versions. Also the survey below is still running till the 1.0 version.
2015-08-16: Survey for later releases
Tbere is now a survey running if the default language of Xidel should be XPath or XQuery (on Google Forms, so you need to login). Both are Turing-complete, but have slightly incompatible string syntax, so the question is which one you prefer.
2015-06-28: New release
The 0.9 release adds support for most of the XPath/XQuery 3.0 syntax like anonymous and higher order functions, supports multipart HTTP requests for file uploads, changes the default output format, adds an (experimental) function for page modifications, fixes a large number of bugs mostly related to command line parsing and XPath/XQuery standard compatibility, and more...
2014-08-13: Minor release
The 0.8.4 versions extends some standard XQuery expressions with pattern matching, adds options to set HTTP headers and read environment variables, and fixes some bugs...
2014-03-24: New release
The 0.8 release improves the JSONiq support and our own JSON extensions, adds arbitrary precision arithmetic, a trivial subset of XPath/XQuery 3, new functions for resolving uri or HTML hrefs, and more...
2013-03-26: New release
The 0.7 release adds JSONiq support, grouping of command line options, new input/output formats, fixes HTML parsing/serialization, changes the syntax of extended strings and some other stuff
2012-11-06: New release
The 0.6 release adds XQuery support, the form and match functions, improves the Windows command-line interface, merges the two old cgi services to a single one and fixes several interpreter bugs
2012-09-05: Initial release of Xidel
First release of the VideLibri backend as stand-alone command-line tool

Features

It supports:

Examples

  1. Print all urls found by a google search.

    xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
  2. Print the title of all pages found by a google search and download them:

    xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
  3. Generally follow all links on a page and print the titles of the linked pages:
  4. Another template example:

    If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>
    You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
    (and this will also check, if the element containing "ood" is there, and fail otherwise)
  5. Calculate something with XPath using arbitrary precision arithmetics:

    xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"
  6. Print all newest Stackoverflow questions with title and url:

    xidel http://stackoverflow.com -e "<A class='question-hyperlink'>{title:=text(), url:=@href}</A>*"
  7. Print all reddit comments of an user, with HTML and URL:

    xidel "http://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"
  8. Check if your reddit letter is red:
  9. Use XQuery, to create a HTML table of odd and even numbers:

    Windows cmd: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml
    Linux/Powershell: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
    (Xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows' cmd, and " does not escape $ in the Linux shells)
  10. Export variables to shell

    Linux/bash: eval "$(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"
    This sets the bash variable $title to the title of the page and $links becomes an array of all links there.

    Windows cmd: FOR /F "delims=" %%A IN ('xidel http://site -e "title:=//title" -e "links:=//a/@href" --output-format cmd') DO %%A
    This sets the batch variable %title% to the title of the page and %links% becomes an array of all links there.
  11. Reading JSON:

  12. Convert table rows and columns to a CSV-like format:

    xidel http://site -e '//tr / join(td, ",")'

    join((...)) can generally be used to output some values in a single line. The function name is an abbreviation for the XPath function string-join. In the example tr / join calls join for every row.
  13. Modify/Transform an HTML file, e.g. to mark all links as bold:

    Windows cmd:
    xidel --html your-file.html --xquery "transform(/, function($e) { 
       $e / if (name() = 'a') then 
               <a style='{join((@style, 'font-weight: bold'), '; ')}'>{@* except @style, node()}</a> 
            else .
    })" > your-output-file.html 
    Linux/Powershell:
    xidel --html your-file.html --xquery 'transform(/, function($e) { 
       $e / if (name() = "a") then 
               <a style="{join((@style, "font-weight: bold"), "; ")}">{@* except @style, node()}</a> 
            else .
    })' > your-output-file.html

    This example combines three important syntaxes:

More examples are in the Wiki and the directory "example ". You may also want to read the readme file of Xidel, the complete list of available functions, the documentation of my template language and XPath/XQuery 3.0 library. Or look at its results on the XQuery Testsuite.

Screenshots

Xidel on Linux
Xidel on Windows

Downloads

The following Xidel downloads are available on the sourceforge download page:

Operating SystemFilenameSizeSHA-1
Windows: 32 Bitxidel-0.9.8.win32.zip839.7 kB 4ac0c11307fd9e6e5f2cc6e63b45fd87449b300c
Windows: 32 Bit (needs OpenSSL)xidel-0.9.8-openssl.win32.zip873.7 kB e6b3962530ecb41dbd4bc94fb11345eb0d063ef3
Universal Linux: 32 Bitxidel-0.9.8.linux32.tar.gz848.4 kB 58a28c87ce18afcd7a80fa22b3408ffd64ccaba1
Universal Linux: 64 Bitxidel-0.9.8.linux64.tar.gz1.3 MB 90edeab3b80df0ef162031501c5f126f746aee13
Debian: 32 Bitxidel_0.9.8-1_i386.deb665.2 kB f94a637a99382418c88be28442ab50d8224425ca
Debian: 64 Bitxidel_0.9.8-1_amd64.deb991.7 kB b738a4782f10b29ca17f9a9be060aea3c2c246bc
Android ARM:xidel-0.9.8.androidarm.tar.gz2.1 MB 3952e942723e2803f325219423860a2f9e28d2fd
Source:xidel-0.9.8.src.tar.gz1.9 MB a1a48f17af24a7832d65b238e09e3fd3e86ae2aa
Mac 10.8externally prebuilt version and compile instructions.

Usually you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for https connections on Linux openssl (including openssl-dev) and libcrypto are also required.
Development snapshots are provided in a special download folder occasionally.

You can also test it online on a webpage or directly by sending a request to the cgi service like http://www.videlibri.de/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true.


The source history is stored in a mercurial repository together with the VideLibri source and dependencies, licensed as GPLv3+. There are mirrors on GitHub Build status, Bitbucket and GitLab. These mirrors have the Xidel source only, in order to compile it you need to download the dependencies from their own repositories first. Or use the above source tarball, which also contains dependencies.

The source then needs to be compiled with FreePascal.
In a Unix-like shell you compile it by calling ./build.sh, which just calls FreePascal. If you want to FreePascal directly yourself, you can use fpc xidel.pas in which case you need to pass the paths to all directories of the source using the -Fu, -Fi options.
Alternatively, Xidel can be compiled using the Lazarus IDE. For this install components/pascal/internettools.lpk and components/pascal/internettools_utf8.lpk in Lazarus, then open programs/internet/xidel/xidel.lpi and click on Run\Compile.



Pronounciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.

Xidelscript

There is also a Greasemonkey 3 script to automatically generate templates by selecting the interesting values on a webpage. The script intercepts the selection and marks the elements as shown in the screenshots below:

Selections that create a template to read the reddit frontpage (extracting title/author/time of each link) Selections that create a template to read the newest stackoverflow question. (votes/title/author are extracted for each of the questions)
Afterwards you can call Xidel with the templates and it will output the element that you have selected and related ones. (so e.g. in the screenshots the first reddit/stackoverflow post was selected, and Xidel will print all of them )

The script was written for Greasemonkey 3. Beware that it will not work properly in Greasemonkey 4.
You can find the script in the mercurial repository or on userscripts.org (mirrored as userscripts is dead) with a detailed description. You need to change the name to "Webscraper / Xidelscript" (it is a multiscript that changes its behaviour depending on its name))



Contact

You can join the Xidel mailing list or have discussions on SourceForge.

Autor: Benito van der Zander, benito_NOSPAM_benibela.de, www.benibela.de
(Please do not ask me how to scrape your website. Ask how to do something with Xidel instead. I know Xidel, I do not know your website. The point of the tool is to make it easy for anyone to parse any webpage. Scraping every webpage myself does not scale well.)