NAME
  glimpse 3.0 - search quickly through entire file systems

OVERVIEW
  Glimpse (which stands for GLobal IMPlicit SEarch) is an indexing and query
  system that allows you to search through all your files very quickly.
  Glimpse supports most of agrep's options (agrep is our powerful version of
  grep) including approximate matching (e.g., finding misspelled words),
  Boolean queries, and even some limited forms of regular expressions. It is
  used in the same way, except that you don't have to specify file names.
  So, if you are looking for a needle anywhere in your file system, all you
  have to do is say glimpse needle and all lines containing needle will
  appear preceded by the file name.

  To use glimpse you first need to index your files with glimpseindex, which
  is typically run every night. glimpseindex -o ~  will index everything at
  or below your home directory.  See man glimpseindex for more details.

  Glimpse is also available for HTTP servers, to provide search of local
  data, as a set of tools called GlimpseHTTP.  See
  http://glimpse.cs.arizona.edu:1994/ghttp/ for more information.

  Glimpse includes all of agrep and can be used instead of agrep by giving a
  file name(s) at the end of the command.  This will cause glimpse to ignore
  the index and run agrep as usual. For example, glimpse -1 pattern file is
  the same as agrep -1 pattern file.  We added a new option to agrep:  -r
  searches recursively the directory and everything below it (see agrep
  options below); it is used only when glimpse reverts to agrep.

  Mail glimpse-request@cs.arizona.edu to be added to the glimpse mailing
  list.  Mail glimpse@cs.arizona.edu to report bugs, ask questions, discuss
  tricks for using glimpse, etc. (this is a moderated mailing list with very
  little traffic, mostly announcements).  HTML version of these manual pages
  can be found in http://glimpse.cs.arizona.edu:1994/glimpsehelp.html Also,
  see the glimpse developers home page in http://glimpse.cs.arizona.edu:1994/

SYNOPSIS
  glimpse [ -(agrep's options) -C -F file_pattern -H directory -J host_name
  -K port_number -L x -N -T directory -V -W -z ] pattern

INTRODUCTION
  We start with simple ways to use glimpse and describe all the options in
  detail later on.  Once an index is built, using glimpseindex, searching for
  pattern is as easy as saying

  glimpse pattern

  The output of glimpse is similar to that of agrep (or any other grep),
  except that the name of the file containing the match appears at the begin-
  ning of the line by default.  The pattern can be any agrep legal pattern
  including a regular expression or a Boolean query (e.g., searching for Tuc-
  son AND Arizona is done by glimpse 'Tucson;Arizona').
  in any of the patterns (either insertion, deletion, or substitution), which
  in this case is definitely needed.

  glimpse -w -i 'parent'

  specifies case insensitive (-i) and match on complete words (-w).  So
  'Parent' and 'PARENT' will match, 'parent/child' will match, but
  'parenthesis' or 'parents' will not match.  (Starting at version 3.0,
  glimpse can be much faster when these two options are specified, especially
  for very large indexes.  You may want to set an alias especially for
  "glimpse -w -i".)

  The -F option provides a pattern that must match the file name.  For exam-
  ple,

  glimpse -F '\.c$' needle

  will find the pattern needle in all files whose name ends with .c.
  (Glimpse will first check its index to determine which files may contain
  the pattern and then run agrep on the file names to further limit the
  search.) The -F option should not be put at the end after the main pattern
  (e.g., "glimpse needle -F hay" is incorrect).

DETAILED DESCRIPTION OF GLIMPSE

  The use of glimpse is similar to that of agrep (or any other grep), except
  that there is no need to specify file names.  Most of agrep's (and other
  greps) options are supported.  It is important to have in mind that the
  search is over many files.  Using very common patterns may lead to a huge
  number of matches.  Running glimpse a will work, but will take a long time
  and will probably output all of the indexed files.  We start with the new
  options, and then list all of agrep's original options (with some addi-
  tional comments when relevant).

The New Options of Glimpse

  -a   prints attribute names.  This option applies only to structured data
       (used with glimpseindex -s); this option was added to support the Har-
       vest project.  See STRUCTURED QUERIES below for more information and
       also http://harvest.cs.colorado.edu for more information about the
       Harvest project.

  -C   tells glimpse to send its queries to glimpseserver.  See man
       glimpseserver for more details.

  -E   prints the lines in the index (as they appear in the index) which
       match the pattern.  Used mostly for debugging and maintenance of the
       index.

  -F file_pattern
       limits the search to those files whose name (including the whole path)
       matches file_pattern.  If file_pattern matches a directory, then all

       glimpse -F '-1 gopherc' pattern

       will allow one spelling error when matching gopherc to the file names
       (so "gopherrc" and "gopher" will be considered as well).

       glimpse -F '-v \.c$' counter

       will search for 'counter' in all files except for .c files.

  -H directory_name
       searches for the index and the other .glimpse files in directory_name.
       The default is the home directory.  This option is useful, for exam-
       ple, if several different indexes are maintained for different
       archives (e.g., one for mail messages, one for source code, one for
       articles).

  -J host_name
       used in conjunction with glimpseserver (-C) to connect to one particu-
       lar server.  See man glimpseserver for more details.

  -K port_number
       used in conjunction with glimpseserver (-C) to connect to one particu-
       lar server at the specified TCP port number.  See man glimpseserver
       for more details.

  -L x | x:y | x:y:z
       if one number is given, it is a limit on the total number of matches.
       Glimpse outputs only the first x matches. If -l is used (i.e., only
       file names are sought), then the limit is on the number of files; oth-
       erwise, the limit is on the number of records.  If two numbers are
       given (x:y), then y is an added limit on the total number of files.
       If three numbers are given (x:y:z), then z is an added limit on the
       number of matches per file.  If any of the x, y, or z is set to 0, it
       means to ignore it (in other words 0 = infinity in this case);  for
       example, -L 0:10 will output all matches to the first 10 files that
       contain a match.

  -N   searches only the index (so the search is faster).  If -o or -b are
       used then the result is the number of files that have a potential
       match plus a prompt to ask if you want to see the file names.  (If -y
       is used, then there is no prompt and the names of the files will be
       shown.) This could be a way to get the matching file names without
       even having access to the files themselves.  However, because only the
       index is searched, some potential matches may not be real matches.  In
       other words, with -N you will not miss any file but you may get extra
       files.  For example, since the index stores everything in lower case,
       a case-sensitive query may match a file that has only a case-
       insensitive match.  Boolean queries may match a file that has all the
       keywords but not in the same line (indexing with -b allows glimpse to
       figure out whether the keywords are close, but it cannot figure out
       from the index whether they are exactly on the same line or in the
       useful mainly in the context of structured queries for the Harvest
       project, where the temporary files may be non-trivial.

  -V   prints the current version of glimpse.

  -W   The default for Boolean AND queries is that they cover one record (the
       default for a record is one line) at a time. For example, glimpse
       'good;bad' will output all lines containing both 'good' and 'bad'.
       The -W option changes the scope of Booleans to be the whole file.
       Within a file glimpse will output all matches to any of the patterns.
       So, glimpse -W 'good;bad' will output all lines containing 'good' or
       'bad', but only in files that contain both patterns.  For structured
       queries, the scope is always the whole attribute or file.

  -z   Allow customizable filtering, using the file .glimpse_filters to per-
       form the programs listed there for each match.  The best example is
       compress/decompress.  If .glimpse_filters include the line
       *.Z   uncompress <
       (separated by tabs) then before indexing any file that matches the
       pattern "*.Z" (same syntax as the one for .glimpse_exclude) the com-
       mand listed is executed first (assuming input is from stdin, which is
       why uncompress needs <) and its output (assuming it goes to stdout) is
       indexed.  The file itself is not changed (i.e., it stays compressed).
       Then if glimpse -z is used, the same program is used on these files on
       the fly.  Any program can be used (we run 'exec').  For example, one
       can filter out parts of files that should not be indexed.  Glimpsein-
       dex tries to apply all filters in .glimpse_filters in the order they
       are given.  For example, if you want to uncompress a file and then
       extract some part of it, put the compression command (the example
       above) first and then another line that specifies the extraction.
       Note that this can slow down the search because the filters need to be
       run before files are searched.  (See also glimpseindex.)

The Options of Agrep Supported by Glimpse

  -#   # is an integer between 1 and 8 specifying the maximum number of
       errors permitted in finding the approximate matches (the default is
       zero).  Generally, each insertion, deletion, or substitution counts as
       one error.  It is possible to adjust the relative cost of insertions,
       deletions and substitutions (see -I -D and -S options).  Since the
       index stores only lower case characters, errors of substituting upper
       case with lower case may be missed (see LIMITATIONS).

  -c   Display only the count of matching records.  Only files with count > 0
       are displayed.

  -d 'delim'
       Define delim to be the separator between two records.  The default
       value is '$', namely a record is by default a line.  delim can be a
       string of size at most 8 (with possible use of ^ and $), but not a
       regular expression.  Text between two delim's, before the first delim,
       and after the last delim is considered as one record.  For example, -d
       sages, for example, and glimpse finds the pattern in a regular file,
       it may not find the delimiter and will therefore output the whole
       file.  (The -t option - see below - can be used to put the delim at
       the end of the record.)

  -e pattern
       Same as a simple pattern argument, but useful when the pattern begins
       with a `-'.

  -h   Do not display filenames.

  -i   Case-insensitive search - e.g., "A" and "a" are considered equivalent.
       Glimpse's index stores all patterns in lower case (see LIMITATIONS
       below).

  -k   No symbol in the pattern is treated as a meta character. For example,
       glimpse -k 'a(b|c)*d' will find the occurrences of a(b|c)*d whereas
       glimpse 'a(b|c)*d' will find substrings that match the regular expres-
       sion 'a(b|c)*d'.  (The only exception is ^ at the beginning of the
       pattern and $ at the end of the pattern, which are still interpreted
       in the usual way. Use \^ or \$ if you need them verbatim.)

  -l   Output only the files names that contain a match.

  -n   Each matching record (line) is prefixed by its record (line) number in
       the file.

  -r   (This option is valid only when a file name is given and glimpse is
       used as agrep; it is a new agrep option.) If the file name is a direc-
       tory name, glimpse will search (recursively) the whole directory and
       everything below it.  Glimpse will not use its index.

  -s   Work silently, that is, display nothing except error messages.  This
       is useful for checking the error status.

  -t   Output the record starting from the end of delim to (and including)
       the next delim. This is useful for cases where delim should come at
       the end of the record.  (See warning for the -d option.)

  -w   Search for the pattern as a word - i.e., surrounded by non-
       alphanumeric characters.  For example, glimpse -w -1 car will match
       cars, but not characters and not car10.  The non-alphanumeric must
       surround the match;  they cannot be counted as errors.  This option
       does not work with regular expressions.

  -x   The pattern must match the whole line.  (This option is translated to
       -w when the index is searched and it is used only when the actual text
       is searched.  It is of limited use in glimpse.)

  -y   Do not prompt.  Proceed with the match as if the answer to any prompt
       is y.

  -Dk  Set the cost of a deletion to k (k is a positive integer).  This
       option does not currently work with regular expressions.

  -G   Output the (whole) files that contain a match.

  -Ik  Set the cost of an insertion to k (k is a positive integer).  This
       option does not currently work with regular expressions.

  -Sk  Set the cost of a substitution to k (k is a positive integer).  This
       option does not currently work with regular expressions.

  The characters `$', `^', `*', `[', `]', `^', `|', `(', `)', `!', and `\'
  can cause unexpected results when included in the pattern, as these charac-
  ters are also meaningful to the shell.  To avoid these problems, enclose
  the entire pattern in single quotes, i.e., 'pattern'.  Do not use double
  quotes (").

PATTERNS

  glimpse supports a large variety of patterns, including simple strings,
  strings with classes of characters, sets of strings, wild cards, and regu-
  lar expressions (see LIMITATIONS).

  Strings
       Strings are any sequence of characters, including the special symbols
       `^' for beginning of line and `$' for end of line.  The following spe-
       cial characters ( `$', `^', `*', `[', `^', `|', `(', `)', `!', and `\'
       ) as well as the following meta characters special to glimpse (and
       agrep): `;', `,', `#', `<', `>', `-', and `.', should be preceded by
       `\' if they are to be matched as regular characters.  For example,
       \^abc\\ corresponds to the string ^abc\, whereas ^abc corresponds to
       the string abc at the beginning of a line.

  Classes of characters
       a list of characters inside [] (in order) corresponds to any character
       from the list.  For example, [a-ho-z] is any character between a and h
       or between o and z.  The symbol `^' inside [] complements the list.
       For example, [^i-n] denote any character in the character set except
       character 'i' to 'n'.  The symbol `^' thus has two meanings, but this
       is consistent with egrep.  The symbol `.' (don't care) stands for any
       symbol (except for the newline symbol).

  Boolean operations
       Glimpse supports an `AND' operation denoted by the symbol `;' an `OR'
       operation denoted by the symbol `,', or any combination. For example,
       glimpse 'pizza;cheeseburger' will output all lines containing both
       patterns.  glimpse -F 'gnu;\.c$' 'define;DEFAULT' will output all
       lines containing both 'define' and 'DEFAULT' (anywhere in the line,
       not necessarily in order) in files whose name contains 'gnu' and ends
       with .c.  glimpse '{political,computer};science' will match 'political
       science' or 'science of computers'.


  Regular expressions
       Since the index is word based, a regular expression must match words
       that appear in the index for glimpse to find it.  Glimpse first strips
       the regular expression from all non-alphabetic characters, and
       searches the index for all remaining words.  It then applies the regu-
       lar expression matching algorithm to the files found in the index.
       For example, glimpse 'abc.*xyz' will search the index for all files
       that contain both 'abc' and 'xyz', and then search directly for
       'abc.*xyz' in those files.  (If you use glimpse -w 'abc.*xyz', then
       'abcxyz' will not be found, because glimpse will think that abc and
       xyz need to be matches to whole words.) The syntax of regular expres-
       sions in glimpse is in general the same as that for agrep.  The union
       operation `|', Kleene closure `*', and parentheses () are all sup-
       ported.  Currently '+' is not supported.  Regular expressions are
       currently limited to approximately 30 characters (generally excluding
       meta characters).  Some options (-d, -w, -t, -x, -D, -I, -S) do not
       currently work with regular expressions.  The maximal number of errors
       for regular expressions that use '*' or '|' is 4. (See LIMITATIONS.)

  structured queries
       Glimpse supports some form of structured queries using Harvest's SOIF
       format.  See STRUCTURED QUERIES below for details.

EXAMPLES

  (Run "glimpse '^glimpse' this-file" to get a list of all examples, some of
  which were given earlier.)

  glimpse -F 'haystack.h$' needle
       finds all needles in all haystack.h's files.

  glimpse -2 -F html Anestesiology
       outputs all occurrences of Anestesiology with two errors in files with
       html somewhere in their full name.

  glimpse -l -F '.c$' variablename
       lists the names of all .c files that contain variablename (the -l
       option lists file names rather than output the matched lines).

  glimpse -F 'mail;1993' 'windsurfing;Arizona'
       finds all lines containing windsurfing and Arizona in all files having
       `mail' and '1993' somewhere in their full name.

  glimpse -F mail 't.j@#uk'
       finds all mail addresses (search only files with mail somewhere in
       their name) from the uk, where the login name ends with t.j, where the
       . stands for any one character. (This is very useful to find a login
       name of someone whose middle name you don't know.)

  glimpse -F mbox -h -G  . > MBOX
       concatenates all files whose name matches `mbox' into one big one.

  index(es) is (are) stored and have .glimpse_ as a prefix.  The first two
  files (.glimpse_exclude and .glimpse_include) are optionally supplied by
  the user.  The other files are built and read by glimpse.

  .glimpse_exclude
       contains a list of files that glimpseindex is explicitly told to
       ignore. In general, the syntax of .glimpse_exclude/include is the same
       as that of agrep (or any other grep).  The lines in the
       .glimpse_exclude file are matched to the file names, and if they
       match, the files are excluded.  Notice that agrep matches to parts of
       the string!  e.g., agrep /ftp/pub will match /home/ftp/pub and
       /ftp/pub/whatever.  So, if you want to exclude /ftp/pub/core, you just
       list it, as is, in the .glimpse_exclude file.  If you put
       "/home/ftp/pub/cdrom" in .glimpse_exclude, every file name that
       matches that string will be excluded, meaning all files below it.  You
       can use ^ to indicate the beginning of a file name, and $ to indicate
       the end of one, and you can use * and ? in the usual way.  For example
       /ftp/*html will exclude /ftp/pub/foo.html, but will also exclude
       /home/ftp/pub/html/whatever;  if you want to exclude files that start
       with /ftp and end with html use ^/ftp*html$ Notice that putting a * at
       the beginning or at the end is redundant (in fact, in this case
       glimpseindex will remove the * when it does the indexing).  No other
       meta characters are allowed in .glimpse_exclude (e.g., don't use .* or
       # or |).  Lines with * or ? must have no more than 30 characters.
       Notice that, although the index itself will not be indexed, the list
       of file names (.glimpse_filenames) will be indexed unless it is expli-
       citly listed in .glimpse_exclude.

  .glimpse_filters
       See the description above for the -z option.

  .glimpse_include
       contains a list of files that glimpseindex is explicitly told to
       include in the index even though they may look like non-text files.
       Symbolic links are followed by glimpseindex only if they are specifi-
       cally included here.  If a file is in both .glimpse_exclude and
       .glimpse_include it will be excluded.

  .glimpse_filenames
       contains the list of all indexed file names, one per line.  This is an
       ASCII file that can also be used with agrep to search for a file name
       leading to a fast find command.  For example,
       glimpse 'count#\.c$' ~/.glimpse_filenames
       will output the names of all (indexed) .c files that have 'count' in
       their name (including anywhere on the path from the index).  Setting
       the following alias in the .login file may be useful:
       alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

  .glimpse_index
       contains the index.  The index consists of lines, each starting with a
       word followed by a list of block numbers (unless the -o or -b options
       are used, in which case each word is followed by an offset into the

  .glimpse_turbo
       An added data structure (used under glimpseindex -o or -b only) that
       helps to speed up queries significantly for large indexes.  Its size
       is 0.25MB.  Glimpse will work without it if needed.

STRUCTURED QUERIES
  Glimpse can search for Boolean combinations of "attribute=value" terms by
  using the Harvest SOIF parser library (in glimpse/libtemplate). To search
  this way, the index must be made by using the -s option of glimpseindex
  (this can be used in conjunction with other glimpseindex options). For
  glimpse and glimpseindex to recognize "structured" files, they must be in
  SOIF format. In this format, each value is prefixed by an attribute-name
  with the size of the value (in bytes) present in "{}" after the name of the
  attribute. For example, The following lines are part of an SOIF file:
  type{17}:       Directory-Listing
  md5{32}:        3858c73d68616df0ed58a44d306b12ba
  Any string can serve as an attribute name.  Glimpse
  "pattern;type=Directory-Listing" will search for "pattern" only in files
  whose type is "Directory-Listing".  The file itself is considered to be one
  "object" and its name/url appears as the first attribute with an "@" pre-
  fix; e.g., @FILE { http://xxx... } The scope of Boolean operations changes
  from records (lines) to whole files when structured queries are used in
  glimpse (since individual query terms can look at different attributes and
  they may not be "covered" by the record/line).  Note that glimpse can only
  search for patterns in the value parts of the SOIF file: there are some
  attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's
  internal routines.  See http://harvest.cs.colorado.edu/harvest/user-manual/
  for more detailed information of the SOIF format.

REFERENCES

  1.   U. Manber and S. Wu, "GLIMPSE: A Tool to Search Through Entire File
       Systems," Usenix Winter 1994 Technical Conference, San Francisco
       (January 1994), pp. 23-32.  Also, Technical Report #TR 93-34, Dept. of
       Computer Science, University of Arizona, October 1993 (a postscript
       file is available by anonymous ftp at
       cs.arizona.edu:reports/1993/TR93-34.ps).

  2.   S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Communica-
       tions of the ACM 35 (October 1992), pp. 83-91.

SEE ALSO
  agrep(1), ed(1), ex(1), glimpseindex(1), glimpseserver(1), grep(1), sh(1),
  csh(1).

LIMITATIONS

  The index of glimpse is word based.  A pattern that contains more than one
  word cannot be found in the index.  The way glimpse overcomes this weakness
  is by splitting any multi-word pattern into its set of words and looking
  for all of them in the index.  For example, glimpse 'linear programming'
  The index of glimpse stores all patterns in lower case.  When glimpse
  searches the index it first converts all patterns to lower case, finds the
  appropriate files, and then searches the actual files using the original
  patterns.  So, for example, glimpse ABCXYZ will first find all files con-
  taining abcxyz in any combination of lower and upper cases, and then
  searches these files directly, so only the right cases will be found.  One
  problem with this approach is discovering misspellings that are caused by
  wrong cases.  For example, glimpse -B abcXYZ will first search the index
  for the best match to abcxyz (because the pattern is converted to lower
  case); it will find that there are matches with no errors, and will go to
  those files to search them directly, this time with the original upper
  cases. If the closest match is, say AbcXYZ, glimpse may miss it, because it
  doesn't expect an error.  Another problem is speed.  If you search for
  "ATT", it will look at the index for "att".  Unless you use -w to match the
  whole word, glimpse may have to search all files containing, for example,
  "Seattle" which has "att" in it.

  There is no size limit for simple patterns and simple patterns within
  Boolean expressions.  More complicated patterns, such as regular expres-
  sions, are currently limited to approximately 30 characters.  Lines are
  limited to 1024 characters.  Records are limited to 48K, and may be trun-
  cated if they are larger than that.  The limit of record length can be
  changed by modifying the parameter Max_record in agrep.h.

  Glimpseindex does not index words of size > 64.

BUGS

  A Boolean AND query that includes two patterns one of which is a prefix of
  the other (or equal to the other) may not work correctly.  Essentially
  glimpse will find the smallest pattern first, but will not backtrack to try
  to check again if it matches another pattern.  (We are not sure whether
  this is a bug or a feature, because there is no apparent reason to have
  patterns like that.)

  A Boolean query with a pattern of length 1 (i.e., one character only) may
  miss matches.

  In some rare cases, regular expressions using * or # may not match
  correctly.

  A query that contains no alphanumeric characters is not recommended (unless
  glimpse is used as agrep and the file names are provided).  This is an
  understatement.

  Please send bug reports or comments to glimpse@cs.arizona.edu.

DIAGNOSTICS
  Exit status is 0 if any matches are found, 1 if none, 2 for syntax errors
  or inaccessible files.