Managing Gigabytes: Compressing and Indexing Documents and Images

Front Cover
Morgan Kaufmann, 1999 - Business & Economics - 519 pages
5 Reviews

In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. It also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the Web.

  • Up-to-date coverage of new text compression algorithms such as block sorting, approximate arithmetic coding, and fat Huffman coding
  • New sections on content-based index compression and distributed querying, with 2 new data structures for fast indexing
  • New coverage of image coding, including descriptions of de facto standards in use on the Web (GIF and PNG), information on CALIC, the new proposed JPEG Lossless standard, and JBIG2
  • New information on the Internet and WWW, digital libraries, web search engines, and agent-based retrieval
  • Accompanied by a public domain system called MG which is a fully worked-out operational example of the advanced techniques developed and explained in the book
  • New appendix on an existing digital library system that uses the MG software

What people are saying - Write a review

User ratings

5 stars
4 stars
3 stars
2 stars
1 star

LibraryThing Review

User Review  - juha - LibraryThing

A hard-core approach to information retrieval. I didn't appreaciate this book until recently, when I started to look for ways to reduce I/O. The use of compression in storing the text, integers, lexicon and inverted list is detailed beautifully. Read full review

User Review - Flag as inappropriate

This is the only book there is that will actually teach you how to build an information retrieval system (aka search engine). It discusses all the algorithms and tradeoffs, and comes with free downloadable source code to experiment with. Some of the material is standard, but covered in more implementation detail here than anywhere else. Some of the material is novel: you won't find better coverage of compression unless you hand-assemble twenty research papers, and reverse-engineer them to figure out how they're implemented. But with "Managing Gigabytes", it's all here. (Although, after a particularly envigorating discussion of how to string together a bunch of techniques to compress their corpus and save a couple 100MB, I did a check and found you could buy 512MB of RAM for less than the cost of the book. Knowledge is Power, but sometimes a little cash is more powerful.) The only negative is that this book is not called "Managing Terabytes", as the first edition promised/threatened it might be. RAM and disk are cheap, but not that cheap, and for now terabytes (and sometimes petabytes) are managed only by NASA, Google, and a few others. I can't wait to see the third edition! 


one Overview
two Text Compression
three Indexing
four Querying
five Index Construction
six Image Compression
seven Textual Images
eight Mixed Text and Images
Further reading
ten The Information Explosion
A Guide to the mg System
B Guide to the NZDL
About the Authors

Other editions - View all

Common terms and phrases

References to this book

Guide to Biometrics
Ruud Bolle
Limited preview - 2004
All Book Search results »

About the author (1999)

Ian H. Witten is a professor of computer science at the University of Waikato in New Zealand. He directs the New Zealand Digital Library research project. His research interests include information retrieval, machine learning, text compression, and programming by demonstration. He received an MA in Mathematics from Cambridge University, England; an MSc in Computer Science from the University of Calgary, Canada; and a PhD in Electrical Engineering from Essex University, England. He is a fellow of the ACM and of the Royal Society of New Zealand. He has published widely on digital libraries, machine learning, text compression, hypertext, speech synthesis and signal processing, and computer typography. He has written several books, the latest being Managing Gigabytes (1999) and Data Mining (2000), both from Morgan Kaufmann.