File locator web front-end in PHP

When my file server resided in the Windows environment, I made use of the Everything search engine to index the files and to search for them both locally and through Everything’s built-in web server.

This latter functionality is what I wanted to replicate once I built the ZFS-based FreeBSD file server and moved to it. All UNIX flavours have the locate command, which will use a pre-built database to quickly find a string in file names and paths on your server. So, the obvious solution was to install Apache and PHP and write a web front-end for locate.

An alternative option is to use Solr, a Lucene search engine front-end. I, however, wished to have something simpler and custom-made. This was also a good opportunity to explore a new programming language.

I’ve never written PHP code before, as my main area is ASP.NET and C#, but learning the ropes of PHP was an enjoyable task and it is always good to learn another language. The result can be seen below.

The program will search for an arbitrary string in the locate database, optionally ignoring the case. Another option lets the user to restrict the search to the last segment of the path, thus avoiding flooding with nearly duplicate hits if the string is located only in the directory portion of the path. The program will also highlight the hits.

This is how the simple UI of the program looks like:

Update 1

I’ve added some more desired functionality:

  • the program can now update the underlying locate database through the web interface
  • it now accepts ‘*’ and ‘?’ wildcards in the search string and highlights the results appropriately
  • it can now give direct links to the located content
  • it can now search for string containing Unicode charachters
  • highlighting is made Lynx-friendly

Forcing database update involves running the update script as root, which will then su as user nobody. Apache (httpd) runs under a limited user www (or suchlike). To overcome this obstacle, I used a solution, suggested in this Stack Overflow thread:

  1. Modify update.launcher.c (code below) to point to the update script, which is typically located in /etc/periodic/weekly/310.locate
  2. #gcc update.launcher.c -o update.launcher
  3. #chown root update.launcher
  4. #chmod u=rwx,go=xr,+s update.launcher
  5. Place the program on your server and modify UPDATE_SCRIPT_LAUNCHER constant in the program
  6. Verify that LOCATE_DB_FILE constant points to the database file, so that the porgram is able to report the state of the database

Remember to change the value in SEARCH_ROOT constant, which limits the search location range.

If you want the program to display direct links to the located content, perform the following 2 steps:

  1. Create a symbolic link to the root of your searchable content, as defined in SEARCH_ROOT
  2. Update VIEW_SYMLINK_PREFIX constant to point to that symlink, relative to web root or relative to the locator.html placement. (If this constant is not defined, the program will not generate any links.)

There are a few caveats and assumptions:

  • There is no thorough error checking involved
  • Unicode search is always case sensitive

locator.html

Download

<?php
//Configuration constants:
define(SEARCH_ROOT, "/zstore/");
define(UPDATE_SCRIPT_LAUNCHER, "/usr/local/www/apache22/data/update.launcher");
define(LOCATE_DB_FILE, "/var/db/locate.database");
define(VIEW_SYMLINK_PREFIX, "./zstore/");

$searchString = isset($_POST['searchString']) ? $_POST['searchString'] : '';
$ignoreCaseChecked = isset($_POST['caseIgnore']) ? 'checked' : '';
$lastSegmentChecked = isset($_POST['lastSegment']) ? 'checked' : '';

if(isset($_POST['clearButton']))
{
    $searchString = '';
    $ignoreCaseChecked = '';
    $lastSegmentChecked = '';
}
?>

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">
span.marker
{
    background:yellow;
    font-weight:bold;
}
span.error
{
    color:red;
    font-weight:bold;
}
span.info
{
    color:blue;
}
</style>
<title>File Locator</title>
</head>

<body>
<table width="100%" border="0">
<tr><td width="80%">
<form action="<?php print $_SERVER['PHP_SELF']; ?>" method="post">
Search for: 
<input type="text" name="searchString" value="<?php echo $searchString; ?>" /> 
(wildcards * and ? are allowed)<br>
<input type="checkbox" name="caseIgnore" <?php echo $ignoreCaseChecked; ?> /> 
Ignore case<br>
<input type="checkbox" name="lastSegment" <?php echo $lastSegmentChecked; ?> /> 
Search in last segment only<br>
<input type="submit" name="submitButton" value="Go" /> 
<input type="submit" name="clearButton" value="Clear" />
</td><td valign="top" align="right">
<input type="submit" name="updateButton" value="Update file name database" /><br><br>
<input type="submit" name="stateButton" value="Show database state" />
</td></tr>
</table>
</form>
<hr>

<?php
if(isset($_POST['updateButton']))
{
    updateDatabase();
}
else if(isset($_POST['stateButton']))
{
    showDatabaseState();
}
else if(isset($_POST['submitButton']))
{
    $ignoreCase = isset($_POST['caseIgnore']);
    $lastSegmentSearch = isset($_POST['lastSegment']);

    if(updateLocatorIsRunning())
    {
	print '<span class="info">File name database is currently being updated.<br>Search results may be inaccurate.<br><br></span>';
    }

    $ret = array();
    $command = 'locate ' . ($ignoreCase ? '-i "' : '"') . 
		SEARCH_ROOT . '*' . $searchString . '*"';

    exec($command, $ret);

    $word = str_replace(array("?", "*"), array(".", ".+"), $searchString);

    foreach ($ret as $line)
    {
	if($lastSegmentSearch && !foundInLastSegment($line, $word, $ignoreCase))
	{
	    continue;
	}
	
	$find = highlight($line, $word, $ignoreCase);
	if(defined("VIEW_SYMLINK_PREFIX"))
	{
	    $link = str_replace(SEARCH_STRING, VIEW_SYMLINK_PREFIX, $line);
	    print '[<a href="' . $link . '">View</a>] ';
	}
	print "$find<br>\n";
    }
}

function showDatabaseState()
{
    if(updateLocatorIsRunning())
    {
	print '<span class="info">File name database is currently being updated.</span>';
	return;
    }
    else
    {
	clearstatcache();
	date_default_timezone_set('UTC');
	$dbtime = date("D, d.m.Y, H:i:s", filemtime(LOCATE_DB_FILE));
	print '<span class="info">File name database was last updated on ' .
		$dbtime . '</span>';
    }
}

function updateDatabase()
{
    if(updateLocatorIsRunning())
    {
	print '<span class="error">File name database is already being updated!</span>';
	return;
    }
    $command = UPDATE_SCRIPT_LAUNCHER . " > /dev/null 2>&1 &";
    exec($command);
    sleep(1);

    if(updateLocatorIsRunning())
    {
	print '<span class="info">Started updating file name database.</span>';
    }
    else
    {
	print '<span class="error">File name database updator failed to start.</span>';
    }
}

function updateLocatorIsRunning()
{
    $ret = array();
    $command = "ps -U nobody -o command";
    exec($command, $ret);
    foreach ($ret as $line)
    {
	if(strstr($line, "locate.updatedb"))
	{
	    return true;
	}
    }
    return false;
}

function foundInLastSegment($line, $searchString, $ignoreCase)
{
    $search = '/(?=[^\/]+$)' . $searchString . ($ignoreCase ? '/i' : '/');
    return preg_match($search, $line);
}

function highlight($text, $word, $ignoreCase)
{
    return preg_replace("/($word)/U" . ($ignoreCase ? "i" : ""),
                        "<span class=\"marker\"><b>$1</b></span>",
                         $text);
}
?>
</body>
</html>

update.launcher.c

Download

#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

int main (int argc, char *argv[])
{
    setuid (0);
    system ("/bin/sh /etc/periodic/daily/320.locate");
    return 0;
}

RegEx to match a substring after a delimiter

They say that if you have a problem and want to use RegEx to solve it, then you have two problems. So true! 🙂

My specific problem was that I wanted to search for a string within a substring after a delimiter sign, more precisely, in the last segment of a path. Here is an example:

/some/test_path/to/search/with_a_Test_file.txt

The RegEx, searching without case sensitivity for “test” should return a match only for the portion of the string after the last “/”.
All suggestions, which I could find on StackOverflow, concerned with matching the entire file name and not a portion of it, so I had to learn some advanced RegEx. Fast.

The answer was something, called “lookahead”, which is well explained at Regular-Expressions.info site.

The resulting RegExt string looks like some serious swearing in a cartoon bubble… 🙂 Here is the code, which is accepted by PHP’s preg_match() function:

/(?=[^\/]+$)test/i

According to my (rather limited) understanding of RegEx, the first portion in the parenthesis, after the “?=”is the lookahead, which matched the entire file name after the last “/”, then comes the search substring, “test”, which operates on that result and, finally, “/i” is the switch, instructing a case-insensitive match.

Adding disks by label in ZFS and making those labels stick around

When I stared building my new file server, I decided to add the disks to ZFS vdevs by label and not by the device id, i.e:

#glabel label l1 /dev/ada0
#glabel label l2 /dev/ada1

After a reboot, those labelled disks suddenly started to show up as /dev/ada0 and /dev/ada1 again and the labels disappeared from /dev/label directory.

For the existing disks, I tried to offline each disk in turn and re-label it. A new problem turned up then: I could not replace the /dev/adaX offlined disks with the same labelled ones, as zpool gave an error of the device “is part of active pool”.

After some further searching, I found out that I had to zero out the first and the last megabyte of the disk before labelling it and replacing in zpool:

#dd if=/dev/zero of=/dev/ada0 bs=1m count=1
#dmesg | grep ada0
<read the block count value, subtract 2048 and provide the result to the seek switch below>
#dd if=/dev/zero of=/dev/ada0 seek=358746954
#glabel label l1 /dev/ada0
#zpool replace zstore /dev/ada0 label/l1

At this point zpool status was again showing labels. However, after the next reboot, the labels were gone again and I was pretty frustrated. Back to the search engine.

On page 3 of some discussion of this matter, I noticed two additional steps, which should fix the problem. After performing the steps above and re-labelling and re-placing the disks, I issued:

#zpool export zstore
#zpool import -d /dev/label zstore

The -d switch is what instructs zpool to read the disk references from a specific directory and it makes the labels stick around.

When I added subsequent new disks to the pool, I followed these steps to make the labels stick and to avoid re-labelling at a later point:

  1. Zero-out the first and the last part of each disk that will comprise the new vdev (especially important if the disk has been in use before and does not come staight from the factory)
  2. Label each disk with glabel
  3. #zpool add zstore raidz label/l5 label/l6 etc….
  4. #zpool export zstore
  5. #zpool import -d /dev/label zstore

And the labels never disappeared again.

This same procedure can be applied to labelling your ZIL and LARC devices.

Reporting correct space usage for Samba shared ZFS volumes

ZFS is all the rage now and there are lots of tutorials and how-to’s out there covering most of the topics. There is one issue, for which I could not find any ready solution. When sharing a zfs volume over Samba, Windows would report incorrect total volume size. More precisely, Windows would always show the same size for both total size and free size and both values will be changing as the volume gets used.
This is obviously not what we want. Some digging uncovered that Samba relies internally on the result of the df program, which will report incorrect values for ZFS systems. More digging lead to this page and to the man pages of smb.conf, showing that it is possible to override space usage detection behaviour by creating a custom script and pointing Samba server to it using the following entry in smb.conf:


[global]
dfree command = /usr/local/bin/dfree


The following bash script is where the magic lies (tested on FreeBSD):

#!/bin/sh

CUR_PATH=`pwd`

let USED=`zfs get -o value -Hp used $CUR_PATH` / 1024 > /dev/null
let AVAIL=`zfs get -o value -Hp available $CUR_PATH` / 1024 > /dev/null

let TOTAL = $USED + $AVAIL > /dev/null

echo $TOTAL $AVAIL

And the following is a variation, which works on Linux (courtesy commenter nem):

#!/bin/bash

CUR_PATH=`pwd`

USED=$((`zfs get -o value -Hp used $CUR_PATH` / 1024)) > /dev/null
AVAIL=$((`zfs get -o value -Hp available $CUR_PATH` / 1024)) > /dev/null

TOTAL=$(($USED+$AVAIL)) > /dev/null

echo $TOTAL $AVAIL

Make sure to check the comments section, as several variations of this script are posted there, for example taking account for both ZFS and non-ZFS shares on the same system!

I can’t use zpool list as it reports the total size for the pool, including parity disks, so the total size might be greater than the real usable total size.
zfs list could have been used if there was a way to display the information in bytes and not in human-readable form of varying granularity.
The solution was to use zfs get and then normalise the values reported to Samba to the 1024 byte blocks. (I tried providing the third, optional, parameter of 1 as mentioned in the man pages, but Samba seemed to have trouble parsing really large byte values, so I ended up doing the normalisation in the script).

Also, I can’t rely on the $1 input parameter to the script, as it turned out to always be equal to ‘.’, which is usable for df, but not for zfs. This ‘.’ lead me to check the working directory of the invocation and, bingo, it turned out to be the root path of the requested volume, so I could simply get the value from pwd and pass it to zfs.