60

DjVu, an open PDF alternative

 5 years ago
source link: https://www.tuicool.com/articles/hit/Q3ieaie
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
DjVu jqaeq2a.png!webFilename extensions .djvu, .djv Internet media type image/vnd.djvu, image/x-djvu Developed by AT&T Labs – Research Initial release 1998 ; 20 years ago Latest release

Version 26

(June 2006 ; 12 years ago )

Type of format Image file formats Open format ? GNU GPLv2 for DjVu Reference Library and DjVuLibre-3.5;
License grants under the GNU GPL for several patents that cover aspects of the library Website djvu.org

DjVu ( / ˌ d ʒ ɑː ˈ v / DAY -zhah- VOO , like English "déjà vu") is acomputer file format designed primarily to storescanned documents, especially those containing a combination of text, line drawings, indexed color images, and photographs. It uses technologies such as image layer separation of text and background/images,progressive loading,arithmetic coding, andlossy compression for bitonal (monochrome) images. This allows high-quality, readable images to be stored in a minimum of space, so that they can be made available on theweb.

DjVu has been promoted as an alternative toPDF, promising smaller files than PDF for most scanned documents.The DjVu developers report that color magazine pages compress to 40–70 kB, black-and-white technical papers compress to 15–40 kB, and ancient manuscripts compress to around 100 kB; a satisfactoryJPEG image typically requires 500 kB.Like PDF, DjVu can contain anOCR text layer, making it easy to performcopy and paste and text search operations.

Free creators, manipulators, converters, browser plug-ins, and desktop viewers are available.DjVu is supported by a number of multi-format document viewers and e-book reader software on Linux (Okular,Evince) and Windows (SumatraPDF).

Despite its advantages, DjVu is not widely supported by scanning and viewing software.

Contents

History [ edit ]

The DjVu technology was originally developedbyYann LeCun,Léon Bottou, Patrick Haffner, and Paul G. Howard atAT&T Labs from 1996 to 2001.

Due to its declared higher compression ratio (and thus smaller file size) and the ease of converting large volumes of text into DjVu format, and because it is anopen file format, it has been considered superior toPDF. Independent technologistBrewster Kahle in a 2004 talk on IT Conversations discussed the benefits of allowing easier access to DjVu files.

The DjVu library distributed as part of the open-source package DjVuLibre has become the reference implementation for the DjVu format. DjVuLibre has been maintained and updated by the original developers of DjVu since 2002.

The DjVu file format specification has gone through a number of revisions:

Revision history Support status Version Release date Notes Unsupported 1996–1999 Developmental versions by AT&T labs preceding the sale of the format toLizardTech. Unsupported April 1999 DjVu version 3. DjVu changed from a single-page format to a multipage format. Older, still supported September 1999 Indirect storage format replaced. The searchable text layer was added. Older, still supported April 2001 Page orientation, color JB2 Unsupported July 2002 CID chunk Unsupported February 2003 LTAnno chunk Older, still supported May 2003 NAVM chunk. Support for DjVu bookmarks (outlines) was added. Changes made by Versions 23 and 24 were made obsolete. Current April 2005 Text/line annotations

Technical overview [ edit ]

File structure [ edit ]

The DjVu file format is based on the Interchange File Format and is composed of hierarchically organized chunks. The IFF structure is preceded by a 4-byte AT&T magic number. Following is a single FORM chunk with a secondary identifier of either DJVU or DJVM for a single-page or a multi-page document, respectively.

Chunk types [ edit ]

Chunk identifier Contained by Description FORM:DJVU FORM:DJVM Describes a single page. Can either be at the root of a document and be a single-page document or referred to from a DIRM chunk. FORM:DJVM N/A Describes a multi-page document. Is the document's root chunk. FORM:DJVI FORM:DJVM Contains data shared by multiple pages. FORM:THUM FORM:DJVM Contains thumbnails. INFO FORM:DJVU Must be the first chunk. Describes the page width, height, format version,resolution,gamma, and rotation. DIRM FORM:DJVM Must be the first chunk. References other FORM chunks. These chunks can either follow this chunk inside the FORM:DJVM chunk or be contained in external files. These types of documents are referred to as bundled or indirect , respectively. NAVM FORM:DJVM If present, must immediately follow the DIRM chunk. Contains a BZZ-compressed outline of the document.

Compression [ edit ]

DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100 dpi); the mask image is a high-resolution bilevel image (e.g., 300 dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44.The mask image is compressed using a method called JB2 (similar toJBIG2). The JB2 encoding method identifies nearly identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.

Optionally, these shapes may be mapped toUTF-8 codes (either by hand or potentially by a text recognition system ) and stored in the DjVu file. If this mapping exists, it is possible to select and copy text.

Since JBIG2 was based on JB2, both compression methods have the same problems when performing lossy compression. Numbers may be substituted with similarly looking numbers (such as replacing 6 with 8) if the text was scanned at a low resolution prior to lossy compression.

Format licensing [ edit ]

DjVu is anopen file format with patents.The file format specification is published, as well as source code for the reference library.The original authors distribute anopen-source implementation named " DjVuLibre " under the GNU General Public License . The rights to the commercial development of the encoding software have been transferred to different companies over the years, includingAT&T Corporation, LizardTech , Celartem and Cuminas .

Support [ edit ]

Despite its advantages, DjVu is not widely supported by scanning and viewing software.While viewers can be downloaded, opening DjVu files is not implemented in most operating systems by default.

In 2002, the DjVu file format was chosen by theInternet Archive as a format in which its Million Book Project provides scannedpublic-domain books online (along withTIFF and PDF).

Wikimedia Commons, a media repository used byWikipedia among others, conditionally permits PDF and DjVu media files.

Third party tools using APIs and SDKs can create and manipulate DjVu files.

See also [ edit ]


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK