Recent Changes - Search:
 Welcome to the Cisco Academy for Vision Impaired Linux Wiki.

PmWiki

edit SideBar

ArchiveAndCompressionUtilities

Background And History

Back in the old Unix days, there was compress and uncompress.

Uncompress was actually the same program, it just cgot called with a different name.

Compress took a file as input, and operating on the same file, produced a compressed result with a .Z extension.

Also back in that golden age of ancient Unix, two decades ago, the tar (or tape archiver) was king. It bundled up files in to one large archive, starting each with a file header, terminating it with a checksum, and using sequential 512k blocks for storage, since that was what fit conveniently on to magnetic tape.

You could tar up a single directory, an entire system or anything in between and the archive could span multiple tapes if needed.

In the early days of the internet, files were first bundled up in to tar archives, to avoid a user needing to download multiple little files to get a distribution. Then they were compressed to save storage space and download time on slow modem connections.

So a tar.Z file was common, and whole operating systems as well as individual software packages were distributed that way.

Compress had licensing restrictions like all of the Unix source code, so when the gnu folks started writing utilities that were "free" they rewrote compress and improved on it. Their backward-compatible version is called gzip and gunzip, the g, in many of these utility names, stands for gnu.

The gzip and gunzip utilities should not be confused with the old-time DOS favorite pkzip or its windows cousins, winzip and pkzip for Windows. The zip archive compression algorithm also has licensing restrictions and originally could not be distributed as open-source with the gpl license. Besides the licensing problems, the zip utilities didn't become popular on Linux because they couldn't, by default save extended file attributes like a file's owner, group and permissions.

We see so many zip archives on the internet because this format became very popular with BBS sysops near the end of the 1980s. but it never caught on in the Unix world due to its inability to save those extended attributes.

However open-source programmers were ready to reverse-engineer these algorithms and improve on them, so there is, indeed a zip for Linux that is free, and improvements were also made to gzip, known as bzip2. There are other compression programs floating around as well, many of them free implementations of compression schemes which had showed up on other operating systems first. These include formats like arj, zoo, arc and lzh.

With tar, things went a bit easier because it was designed at Berkeley, was originally part of the BSD distribution and therefore its source license was less restrictive. Tar also has always been able to preserve all of a file's information so it has remained to this day a favorite backup and file distribution solution. At some point, tar eventually became part of all open-source distributions, though it contains much of the original BSD code.

Tar developed some new tricks however in the 21st century. For one thing, it can now handle gzip files directly as well as bzip2 files. It can create compressed archives and uncompress them and extract the original files as well. But it can still handle old .tar archives from the 1970s, staying about as backward compatible as any utility has a right to be. And tar still preserves a file's full path, owner, group and full permissions.

If a file is merely a tar archive, it usually has the extension of .tar. If it is a compressed archive, it usually has the extension of .tar.gz. If you see a file with a .tgz extension that is also a tar archive, compressed with gzip, it's probably been in use for a while.

The three-letter extension facillitated easier unpacking on DOS and Windows 3.1. Slackware, for example was a distro that got its start in the DOS days, and its packages have .TGZ extensions.

Common Commands

To extract an archive with a .tar.gz extension, type
tar -xf archive.tar.gz

To compress and archive a group of text files, in the current directory, type
tar -czf archive.tar.gz *.txt

To compress and archive, using bzip2 type
tar -jcf archive.tar.bz *.txt

Notice that the tar command is followed by the parameters, followed then by the archive name and optionally by the file names you are adding or extracting. By default, tar simply extracts everything.

Also the dash in the argument list is optional, due to the BSD heritage of tar. You can type
tar xzf archive.tar.gz

to extract an archive, with no dash and the parameter letters in a different order. However the argument letters must precede the filename of the archive.

Listing its Contents

Most of the time, if you download some tar archive, you can safely extract it anywhere, and it will create a directory with its files inside. But occasionally someone archives up files starting from their home directory, or worse, their root, and tar will happily overwrite files on your system with the same paths and names.

This is why the command
tar tzf archive.tar.gz

is so useful; it displays just a table of contents. You can also use the v for verbose switch to tell tar to display every file it is extracting as in
tar -jvf archive.tar.bz2

which extracts from a tar archive compressed with bzip2 and shows you all filenames.

The v and t arguments therefore can generate a lot of output so it's a good idea to pipe them through less, more, head or tail, depending on your needs. For example
tar tf archive.tar | head

shows the first ten lines of the table of contents for a tar archive.

compressing and Uncompressing

As for gunzip and gzip, the syntax is very simple.
gzip file.txt

turns it in to file.gz. and
gunzip file.gz

turns it back into file.txt.
zcat file.gz

will cat a file with a gz extension without uncompressing it.

Uses of Tar

Tar is very useful for backing up directory trees. It preserves the permissions so they can be extracted from the archive retaining everything including the original pathname.

If you are used to copying from one drive to another, for example, copying from c: to e: in Windows/DOS, the concept of using tar to back up your entire system to a different drive, when that drive hass to be mounted somewhere within the system you are backing up can seem confusing.

The important thing here is to mount your backup partition but to exclude it from the tar archive.

Added later

Extra notes: These were added after more lectures. The t parameter for getting the Table of contents (list of files) in a tar archive is important when you want to check out an archive before it gets extracted. Though this repeats imformation above, it is so important, I wish to review it again:
tar zvt archive.tar | head

is going to display the first twenty lines of output. That output is the full pathnames for the files contained in archive.tar. The v tells tar to be verbose and the z tells tar to use gunzip in decompressing the archive in order to get the filename list.

If you'd specified the x parameter instead of the t parameter, you would have uncompressed and extracted the entire archive to the current directory, which might not be where you wanted it to go.

Also important for speech users, do not confuse the Windows pkzip which creates zip archives with the Linux gzip. PK unzip and PK zip were originally DOS programs. Various patent and licensing issues still encumber this software, but these zip archives are handled by the Windows utilities 7-zip and pkzip for Windows. Windows XP has also built-in zip support. Also a zip file is an archive that's compressed. (Note that 7zip also handles its own 7z archive format as well.)

Linux, on the other hand, with its unix heritage, didn't use pkzip, because it wasn't open-source. Today there are open source tools for working with zip archives, but the standard Linux compression utility is G Zip, which spelled gzip sounds similar when read with speech. gzip is nothing like pkzip, and it is based on and is backward compatible with the Unix compress. Compress gave files a z extension and gunzip a gz extension. Unlike a Windows zip archive, a Linux gz file is a single file that is simply compressed. So file.zip probably contains many files, whereas file.gz is a single file that has simply been compressed to save space.

In Windows if you wish to compress an archive you use the single tool, either 7zip or pkzip or simply the zip support built in to Windows.

On Linux, to compress an archive, you use tar, which automatically calls gzip (if you specified the z parameter) to do the compression.

There are several commands for working directly with compressed files. zcat is like cat, zless is like less, zmore like more and zgrep and zegrep search compressed files. There is even a zdif like dif and a ccmp like cmp, for comparing files with each other.

Sample Script

Here is an old script I wrote to back up my server named sketti. It is well-documented and works so feel free to alter it for your needs.

  1. !/bin/bash
    # backupsketti.sh
    # Purpose: create a pair of zipped tar'd files
    # containing a complete backup of Sketti.
    # rootbackup.tar.gz contains the slash (root partition) backup
    # varbackup contains the var partition's backup
    # It is expected that the user will create a directory to hold the backup
    # And change in to that directory before starting this script
    # At this time, there is no error checking.
    # To do: implement that.
    # DA 17-May-2009.
    #
    tar -cvzf ./rootbackup.tgz --exclude=/proc --exclude=/lost+found\
    --exclude=/archive --exclude=/mnt --exclude=/media\
    --exclude=/sys --exclude=./rootbackup.tgz \
    --one-file-system /mnt/one/ > rootfilesbackuped.txt

    # -- c creates new tar archive
    # -- v tells tar to verbosely print what's happening
    # -- z use gzip to compress archive (gzip is gnu compress)
    # -- j (a parameter we might use) to create a bzip (better compression)
  2. archive. If we used j we'd want to change the file extension from tgz to
    bz2
    # And we'd want to remove the z parameter.
    # -- f following other parameters is file name of the archive
    # -- --exclude= followed immediately by directory to exclude from the
  3. backup
    # Note we've also excluded the filename itself
    # in case we run this script from a directory in the same
    # partition we are backing up.
    # -- --one-file-system tells tar to use only the local file system.
    # we can remove this parameter if we want mounted file systems to
    # be included in the tar archive. At this time, we don't.
    # -- The final parameter is the directory which we are backing up, in this
    # case / (the root.)
    # Next backup the var partition
    tar -cvzf ./varbackup.tgz --exclude=/var/lib/mythtv/recordings \
    --one-file-system /mnt/three/ > varfilesbackupedup.txt
    # # This script last edited DA 17-May-2009.
Edit - History - Print - Recent Changes - Search
Page last modified on May 02, 2012, at 02:06 AM