Levent Serinol's Blog: 2009

Friday, September 11, 2009

I/O usage per process on Linux

Linux kernel 2.6.20 and later supports per process I/O accounting. You can access every process/thread's I/O read/write values by using /proc filesystem. You can check if your kernel has built with I/O account by just simply checking /proc/self/io file. If it exists then you have I/O accounting built-in.


$ cat /proc/self/io
rchar: 3809
wchar: 0
syscr: 10
syscw: 0
read_bytes: 0
write_bytes: 0
cancelled_write_bytes: 0


Field Descriptions:

rchar  - bytes read
wchar  - byres written
syscr  - number of read syscalls
syscw  - number of write syscalls
read_bytes  - number of bytes caused by this process to  read
            from underlying storage
write_bytes - number of bytes caused by this process to written from
            underlying storage

As you know, ever process is presented by it's pid number under /proc directory. You can access any process's I/O accounting values by just looking /proc/#pid/io file. There is a utility called iotop which collects these values and shows you in like top utility. You see your processes I/O activity with iotop utility.

Friday, September 4, 2009

problem compiling Mysql 5.1 with sphinx engine support

Mysql's full-text search engine is not very powerful and support comes only with Myisam engine. Sphinx is a full-text search engine replacement for Mysql and Postgresql. Also, It's possible to use Sphinx full-text search engine with both Myisam and InnoDB engines. You can read detailed information about mysql compilation with sphinx on Sphinx site.

I have faced with following error message, while compiling Mysql with innodb and sphinx engine support ( ./configure --prefix=/data/mysql3/ --with-plugins=sphinx,innobase). I used the innodb (base) which bundled with Mysql. You can also compile innodb plugin which has additional features. For example, on the fly compression, speed improvements.


configure: error: unknown plugin: sphinx

The trick was to run autorun.sh command in Mysql's source tree before running configure with sphinx engine support. This script will create a new configure script which knows how to compile sphinx engine.


# ./BUILD/autorun.sh

Thursday, September 3, 2009

nethogs network utility

Nethogs is a small but very useful utility, if you like to see your machine's bandwidth usage per processes. Here is a simple screenshot from my linuxbox.

Sunday, July 12, 2009

Fast Copy on Unix systems

I needed to copy so many small files and directories (1.2 TB) to another machine with netcat (nc). But I faced with a problem when I tried to run nc program on the background with bash. Bash was suspending standard input when I move nc process to background job, this was causing nc processess to die when standard input was blocked. So, I have found another netcat like utility called socat. Socat is not stop execution while it was working as background job on bash shell. I was able to utilize 100mbit ethernet with few socat processes on server and client side. First of all, I run socat processes on destination server as daemons to listen specific ports. Following, command will run socat in daemon mode to listen on tcp port 4123 and pipe incoming data to tar command on the background.

# (socat -u tcp4-listen:4123 - | tar xzp -C .) &

On source server, I run socat to feed incoming tar data to destination socat daemon where it listens on tcp port 4123. My destination server's ip adddess was 192.168.36.199 .

# (tar -czp - /mydir/1/ | socat -u - tcp4:192.168.3.199:4123)&

So, with a simple bash script I was able to run few socat daemons on destination server with different ports and feed them from source server to achieve higher data transer rate.

Monday, June 8, 2009

socket programming with Bash

I have many memcached servers around. So, I needed a tool to check status information on these servers like top utility. I decided to write it in bash scripting language (100% in bash except printf,date and sleep commands :) ). You can simply enhance the script for example getting server names and ports from a config file or display interval parameter from command line. Memcached supports tcp and udp protocols for communication with clients. I used tcp communication in my bash script. Here is information about Memcached's supported server commands. My example script uses "stats" command. Here is sample output for "stats" command from one memcached server.

I have used the information provided by "stats" command. Script uses pure bash builtin commands to access a tcp server like memcached.

You can download the script from here.

Here is sample output from the script.

If you're using Ubuntu, probably you will not be able to run this script, because bash is not compiled with "--enable-net-redirections" parameter on Ubuntu systems. RedHat based distributions has no problem with bash net-directions like Fedora,Centos and RedHat ES.

Sunday, May 31, 2009

Mirroring your OpenSolaris root disk with ZFS

Solaris/OpenSolaris supports booting with ZFS filesystem. For example, if you look into your opensolaris insallation you'll notice the zfs rpool (root pool). ZFS is default filesystem on opensolaris (optional with Solaris). If you look into detailts of that pool with zpool utility , you'll see that it's a single disk or a partition (depends how you partitioned your disk on installation process of opensolaris). If that single disk crashes, you may lost your data and operating system.That's one point of failure. The simple solution is to convert our zpool into RAID-1 with a secondary disk (with same capacity or bigger one).

As you can see, our zpool contains a single disk called c4t0d0. Let's add a new disk to our rpool to convert it into mirror. zpool will do it automatically for us. One important point here is that Solaris doesn't support boot devices to be in EFI label format. So, if your new disk label is in EFI format, you need to convert it into SMI label. So, I'll use fdisk command to create a Solaris2 partition and remove any EFI labels if present. My new disk name is c4t1d0.

After creating the partition which covers whole disk area, I copied partition map of original rpool member c4t0d0 to our new disk c4t1d0.

Now, it's time to attach our brand new disk into rpool. After attaching the disk, our rpool pool will be converted into mirror (raid-1). Except one thing, the grub boot manager. We need to install it into new disk by hand using installgrub command.

Now, we have created our mirror on rpool. It will take time to sync original disk to new disk. You can watch this process by using zpool's status command.

Saturday, May 23, 2009

FreeBSD and procfs

Many Unix systems have support for proc file system (process file system).Procfs filesystem type is pseudo. FreeBSD is one of that Unix systems. Unlike Linux, which has information other than processes, FreeBSD procfs support is only about the processes on the system. FreeBSD doesn't mount the procfs on boot by default. You need to manually add it to fstab for auto mount on boot or mount it by command for temporarily usage.Common mount point for procfs is /proc on Unix systems.


echo "none            /proc        procfs  rw   0 0" >> /etc/fsab


mount -t procfs none /proc

Every process is presented as directories named by it's pid number on the /proc mount point.
Procfs gives information on running processes on the system like memory mapping,command line arguments of running process, process resource limits and many other.Following is a sample procfs directory structure on a FreeBSD machine.

As you can see, every pid is represented as a directory in procfs. Every directory contains following files. Some of the files are write only or read only where you read information or send information to process.



- status  (read-only) : returns process status
- mem     (read/write): virtual memory image of the process
- file    (depends)   : symbolic link to running process
- regs    (read/write): process registers
- ctl     (write-only): used to send signal to process or 
                        attach/deattach it for debugging
- cmdline (read-only) : command line arguments of running process
- rlimits (read-only) : current resource limits of running process
- map     (read-only) : memory mappings of the running process.
- etype   (read-only) : type of the executable (eg. FreeBSD ELF32)
- fpregs  (read/write): floating point registers

Some of the information provided by these files are in binary format.For example "regs" and "fpregs" files are in binary format. They depend on the architecture of the underlying machine (i386, amd64,sparc64,etc..). Following is the format of the "regs" file on the i386 machine.



struct reg {
        unsigned int    r_fs;
        unsigned int    r_es;
        unsigned int    r_ds;
        unsigned int    r_edi;
        unsigned int    r_esi;
        unsigned int    r_ebp;
        unsigned int    r_isp;
        unsigned int    r_ebx;
        unsigned int    r_edx;
        unsigned int    r_ecx;
        unsigned int    r_eax;
        unsigned int    r_trapno;
        unsigned int    r_err;
        unsigned int    r_eip;
        unsigned int    r_cs;
        unsigned int    r_eflags;
        unsigned int    r_esp;
        unsigned int    r_ss;
        unsigned int    r_gs;
};

You can use "cat" command to read information provided by procfs for text based information unlike the ones I mentioned above in binary format like "regs","fpregs" and "mem".



[root@freebsd ]# cat cmdline 
/usr/sbin/moused-p/dev/ums0-tauto-I/var/run/moused.ums0.pid

You can check a running process's resource limits by looking into rlimit file.



[root@freebsd ]# cat rlimit 
cpu -1 -1
fsize -1 -1
data 536870912 536870912
stack 67108864 67108864
core -1 -1
rss -1 -1
memlock -1 -1
nproc 5547 5547
nofile 11095 11095
sbsize -1 -1
vmem -1 -1

First digits is minimum and last one is maximum values of the given resource name. "-1" means infinite. For examle, nofile (open file descriptor) limit for this process is 11095 as minimum and maximum.

"status" file gives information about process status as follows.


- command name
- pid
- parent pid
- process group id
- session id
- major/minor of the terminal, "-" if no terminal is in action
- process flags
- process start time in seconds and microseconds separated by comma
- user time in seconds and microseconds separated by comma
- system time in seconds and microseconds separated by comma
- wait channel name
- effective userid and group lists separated by comma

Following is "cat status" result of a process.

[root@freebsd ]# cat status 
svscan 28246 1 28245 0 ttyp0 noflags 1242921860,839572 0,318052 3,263576 nanslp 0 0 0,0,0,5 -

Saturday, May 16, 2009

Configuring ILOM manager on Sun X4540 Thumper

Sun X4540 Thumper storage server provides 48TB hard disk capacity on one server. It has 2 AMD Quad x64 Opteron processors and 32GB of memory. Thumper has 48 hard disk slots. You can use 1TB sata hard disks in each slot. This will give you 48TB hard disk area as you can see in following photo.

You can run Solaris,OpenSolaris, Linux and Windows operating systems on Thumper. Thumper includes Sun Microsystem's ILOM (Sun Integrated Lights Out Manager) manager. You can use ILOM to upgrade your server's firmware, remotely accessing your server's console by using your web browser, modify and see sensors, see hardware components and so on. You can access ILOM by using a serial console,ssh client or a web browser.You need to set an ip address for your ILOM manager if you need to use ssh or a web browser.If you don't have a dhcp server around you'll need to configure your ILOM manager with a static ip address. I have used a serial console connected on Thumpers "Ser MGT" port and on the other side a linux desktop and minicom application. You need to change your minicom settings to 9600 baud rate , 8 data bits, no parity and one stop bit (9600-8-N-1) with no hardware and software flow control.

Yes, that's right, It's Linux :) . Sun ILOM uses embeded Linux. When you connect to ILOM by a serial console you will need to login in by using ILOM's factory default login name "root" and password "changeme". ILOM has a simple built-in command interface. You can use "help" command for details when you logged in ILOM manager. Setting the static ip address of ILOM is simple. I captured the screenshot when I gave one of my local ip addresses to Thumper's ILOM.

After, you commit your static ip network settings settings with "set commitpending=true" command, you need to connect Thumper's "net mgt" ethernet interface to your network.Now, you can access ILOM by using any ssh client or a web browser. My plan is to use OpenSolaris and 48TB ZFS Raid-Z on this storage server and see the results on a production environment :)

Sunday, May 3, 2009

getting detailed process information on Freebsd

Freebsd procstat utility gives detailed information about all of the processes on the system or just for a given process id number, such as virtual memory mapping, thread stack, command line arguments and open files.

Running procstat with "-a" argument prints information like pid,pid,login, process name, wchan (which event the process waiting) of all processes on the system.

# procstat -a

Procstat "-c" option shows you the command line arguments of a process and "-f" option shows opened files by given process.Following sample output shows the command line arguments and the files currently opened with their permissions by vi process.

You can access virtual memory mapping information a process with "-v" option. Following sample shows vi process virtual memory mappings.

Finally, "-k" option shows kernel threads stacks details of given process.

watching interrupt usage on Freebsd

You can easily see how many interrupts taken by each device on Linux by simply looking in /proc/interrupts file. Freebsd has no such information in procfs. But you can access the same information by using "vmstat -i" command in Freebsd.But you can't see which cpu handling which irq as shown by Linux. Following is a sample output of "vmstat -i" command from a Freebsd machine.

You can write a simple c shell script to watch interrupt usage like this.

Vmstat uses sysctl interface to gather interrupt usage information by using hw.intrnames and hw.intrcnt oid names. As names suggest intrnames holds all interrupt names and intrcnt holds their irq counts since system startup for each interrupt.

Friday, April 24, 2009

Linux NFS and "permission denied" mount problem

Today, I have faced with following strange nfs mount problem.


[root@nfsclient ~]# mount /mymount
mount: XXX.XXX.XXX.XXX:/mymount failed, 
reason given by server: Permission denied

My nfs client is FC4 and nfs server is Centos 4.6.First, I have checked the server from nfs client side by using "showmount -e myserver" command. Exports was okay.After that, I decided to check rpc services on server with "rpcinfo -p" command.


[root@nfsclient ~]# rpcinfo -p myserver

  program vers proto   port
   100000    2   tcp    111  portmapper
   100000    2   udp    111  portmapper
   100024    1   udp    638  status
   100024    1   tcp    641  status
   100021    1   udp  32768  nlockmgr
   100021    3   udp  32768  nlockmgr
   100021    4   udp  32768  nlockmgr
   100021    1   tcp  32772  nlockmgr
   100021    3   tcp  32772  nlockmgr
   100021    4   tcp  32772  nlockmgr
   100011    1   udp    903  rquotad
   100011    2   udp    903  rquotad
   100011    1   tcp    906  rquotad
   100011    2   tcp    906  rquotad
   100003    2   udp   2049  nfs
   100003    3   udp   2049  nfs
   100003    4   udp   2049  nfs
   100003    2   tcp   2049  nfs
   100003    3   tcp   2049  nfs
   100003    4   tcp   2049  nfs
   100005    1   udp    925  mountd
   100005    1   tcp    928  mountd
   100005    2   udp    925  mountd
   100005    2   tcp    928  mountd
   100005    3   udp    925  mountd
   100005    3   tcp    928  mountd

Mandatory NFS services was working correctly as seen above rpc report.I didn't have a firewall between these machines or iptables on any of them.This was not the problem also here.I haven't check nfs client side because it was mounting other nfs server's mount points without a problem at all. So, I have focused on nfs server side. Restarting nfsd service or re-exporting nfs file systems with exportfs didn't work out. I can see nfs clients that mounted server's nfs mount by looking in /var/lib/nfs/rmtab file (showmount -a) on server. But I couldn't access /proc/fs/nfsd/ directory. It was not mounted on nfs server.So,I manually mounted it by using following mount command.


mount -t nfsd nodev /proc/fs/nfsd

After mounting /proc/fs/nfsd manually, I was able to mount the nfs server from client side again.

If you look at into /etc/modprobe.conf.dist file, you can see that when kernel installs module nfsd , it mounts /proc/fs/nfsd and /var/lib/nfs/rpc_pipefs mount points.


install nfsd /sbin/modprobe --first-time --ignore-install nfsd && 
{ /bin/mount -t nfsd nfsd /proc/fs/nfsd > /dev/null 2>&1 || :; }
install sunrpc /sbin/modprobe --first-time --ignore-install sunrpc && 
{ /bin/mount -t rpc_pipefs sunrpc /var/lib/nfs/rpc_pipefs > 
/dev/null 2>&1 || :; }

Tuesday, April 14, 2009

Process IO Top utility with Solaris DTrace

DTrace is dynamic tracing framework created by Sun Microsystems for Solaris OS. DTrace let's any system administrator or developer to get internal view of how an operating system works, find any system bottlenecks, examine a live system or process, get information on system calls, network internals and many other things. For this purpose DTrace has many providers for specific tasks. You can get list of the DTrace providers and probe function names by running following command:

# dtrace -l

DTrace also adopted to other operating systems like FreeBSD and Mac OS X. DTrace uses D Language which is subset of C programming language. You can find more information about DTrace on Sun's BigAdmin Portal.

Following example demostrates you, how easy to find the information you want on Solaris with DTrace.I wrote it to see which files are most written or read by which processes on the Solaris.


#!/usr/bin/bash
# process_io_top -  show procesess by top read/write KB I/O per file
#      Written by Levent Serinol (lserinol@gmail.com)
#      http://lserinol.blogspot.com
# Apr/14/2009
#
# USAGE: process_io_top [-s interval] [-p pid]
#   
#  -s interval  # gather and show statistics in given interval (seconds)
#  -p pid  # show read/write KB I/O just for given PID
#  -h   # show usage information
#
# eg:
# process_io_top -s 10 
#
#
#
####################################################################################
interval=5
show_pid=0
pid=0

function usage()
{
echo "
USAGE: io.sh [-s interval] [-p pid]
         -s             # set interval, default is 5 seconds
         -p pid         # pid
  eg, 
         io -p 630                # io activity of pid 630
         io -s 10                 # refresh output in every 10 seconds";
}

while getopts h:p:s:a name
do
        case $name in
        p)      show_pid=1; pid=$OPTARG ;;
        s)      interval=$OPTARG ;;
        h|?)    usage;
               exit 1
        esac
done

/usr/sbin/dtrace -Cws <( cat <<EOF

 inline int PID  = $pid;
 inline int SHOW_PID    = $show_pid;
 inline int INTERVAL    = $interval;


#pragma D option quiet
#pragma D option aggsortrev


dtrace:::BEGIN 
{
  secs = INTERVAL;
  printf("Please wait....\n");
}

io:::start 
/ (SHOW_PID == 1) && ( pid == PID) /
{
 self->rw=args[0]->b_flags & B_READ ? "R" : "W";
 @files[pid,execname,self->rw,args[2]->fi_pathname] = sum (args[0]->b_bcount);
 @total_blks[self->rw]=count();
 @total_bytes[self->rw]=sum (args[0]->b_bcount);
 self->rw=0;
}

io:::start 
/ SHOW_PID == 0 /
/* SHOW_PID == 0 && args[2]->fi_pathname != "<none>" */
{
 self->rw=args[0]->b_flags & B_READ ? "R" : "W";
 @files[pid,execname,self->rw,args[2]->fi_pathname] = sum (args[0]->b_bcount);
 @total_blks[self->rw]=count();
 @total_bytes[self->rw]=sum (args[0]->b_bcount);
 self->rw=0;
}

profile:::tick-1s
{
        secs--;
}


profile:::tick-1s
/secs == 0/
{

 trunc(@files,30);
 normalize(@files,1024);
 system("/usr/bin/clear"); 
 printf("%Y ",walltimestamp);
 printa("%s %@11d blocks, ",@total_blks);
 printa("%s %@11d bytes, ",@total_bytes);
 printf("\n%6s %-12s %3s %8s %3s\n", "PID", "CMD","R/W", "KB", "FILE");
 printa("%6d %-12.12s %1s %@10d %s\n",@files);
 secs = INTERVAL;

}
dtrace:::END
{
        trunc(@files);
}
EOF
)

You can download it here process_io_top. This scripts briefly shows you how to pass arguments from shell to DTrace. Using aggregations and sort them in reverse order by using pragma "aggsortrev". Defining probes (io:::start) more than once with different predicate conditions. Following is a sample output from process_io_top script.

Monday, March 30, 2009

speeding up your nginx server with memcached

Nginx is a high performance web and proxy (web and mail proxy) server. Generally, nginx is used as a front-end proxy server to Apache webserver. Nginx is known to be slow while serving dynamic pages like php. Normally, nginx is using fast-cgi method which is slow. Therefore, it's a good idea to run Apache as back-end server to Nginx and serve dynamic php pages from Apache. If your website's php pages suitable to cache for a certain time, you can use Nginx proxy module and proxy_store command to cache Apache served php pages output in Nginx automatically as html. Here, I'll give you instructions how to use Nginx's memcache module and Danga Software's memcached deamon to store your content in memory and serve it. Serving content from memory will be faster than serving it from disk. memcached's default listening port is 11211. You can find instructions on Danga Software's website how to compile and run memcached.

Now, we can look our Nginx configuration for memcache implementation. Let's suppose we have two Apache webservers running on two different physical servers. IP addresses of the Apache webservers are 192.168.2.3 and 192.168.2.4. We'll use those Apache webservers as back-end servers. We have a Nginx server as front-end to them on 192.168.2.1 ip adress. First of all, we have to tell Nginx about those back-end servers. We use Nginx upstream module for this purpose. As you can see below, we defined a upstream named "backend". The configuration has our two Apache webservers ip addresses. Upstream module let's you also give weight to each server in configuration. Our first server's hardware configuration is better than the second one, so we gave the first one weight value 2. This configuration should be in http section of Nginx configration file (nginx.conf).


upstream backend {
     server 192.168.2.3 weight=2;
     server 192.168.2.4;
}

We have created our upstream configuration. Now, we have to tell Nginx, which files will be server by memcache module. I have decided to only serve some image types by memcache. The following configuration part should be in server section of Nginx configuration. The "location" directive tell's the nginx to handle every file which ends with given extensions like .jpg,.png and .gif in url. As first step, Nginx will check the url in memcached. Memcached is simple key value memory database. Every row has a unique key.In our case the key is our url. If Nginx, finds the key (url) in memcached, it will get contents of the key from mecached and send it back to client. This operation is running completely from memory. In case that the key (url) not found, it will fallback to 404 and as you can see, we catch 404 error and send request to our back-end Apache servers. Nginx will then send Apache's response to client.


location ~* \.(jpg|png|gif)$ {
access_log   off;
expires      max;
add_header   Last-Modified "Thu, 26 Mar 2000 17:35:45 GMT";
set $memcached_key $uri;
memcached_pass     127.0.0.1:11211;
error_page         404 = /fetch;
}

location /fetch {
internal;
access_log   off;
expires      max;
add_header   Last-Modified "Thu, 26 Mar 2000 17:35:45 GMT";
proxy_pass http://backend;
break;
}

Of course, we have a drawback here. Nginx's memcache module never put anything automatically in memcached. You have to store your information in it manually by using something like a script. Considering our example, if we forget to store information about a file in memcached, it will be always served by back-end Apache servers. Here is a simple php script, which finds given image types and deploy it into memcached for Nginx.


<?php

function rscandir($base='', &$data=array()) {
$array = array_diff(scandir($base), array('.', '..'));

foreach($array as $value) :
  if (is_dir($base.$value)) {
    $data = rscandir($base.$value.'/', $data);

  }
  elseif (is_file($base.$value)) {
   $rest = substr($value, -4);
   if ((!strcmp($rest,'.jpg')) || (!strcmp($rest,'.png'))
                                || (!strcmp($rest,'.gif')) ){
         $data[] = $base.$value;
   }
 }

endforeach;
return $data;
}

$mylist=rscandir("/var/www/mysite");

$srch = array('/var/www/mysite');
$newval = array('');

$memcache_obj = memcache_connect("192.168.2.1", 11211);

while (list($key, $val) = each($mylist)) {
  $url=str_replace($srch,$newval,$val);
  echo "$key => $val -> ".filesize($val)."\n";
  $value = file_get_contents($val);
  memcache_add($memcache_obj, $url, $value, false, 0);
}
?>

You need to run this script one time, it will find all given image types and store them into memcached. I run this on one of the Apache back-end servers. It will store data into memcached. This memcached is located on Nginx server which ip address is 192.168.2.1 .

Sunday, March 22, 2009

Hammer Filesystem

On Feb 17 2009, new release 2.2 of DragonFly BSD released which includes a new filesystem called hammer. Hammer is a new filesystem which intented to replace ffs on DragonFly systems. Hammer filesystem has many advanced futures compared to ffs filesystem like

instant crash recovery
large file systems and multivolume support
data integrity checking
history and snapshots
reblocking (on the fly defragmentation)
mirroring

Let's start by creating a new hammer filesystem on a new disk device by using hammer's newfs_hammer command. As you can see below, we used the '-L' parameter. Every hammer filesystem requires a label name and it's mandatory.

Let's mount in under /mnt directory by using "noatime" option. mount_hammer command must be used for this operation.

Looking to df output you can see our new hammer filesystem labeled as "mydata". Now, let's see where hammer filesystem differs from ffs.

History feature

History metadata of a hammer filesystem is written on every 30 seconds. You can use hammer utility to access your files history data. Usage is simple as, calling hammer utility with history command and given your file name which you want to see whole history.

# hammer history /mnt/test.txt

You may be noticed that I have used "sync" command, because I didn't want to wait the the kernel sync operation occur (30 seconds) and forced it by using sync comand to get history result fast.

Snapshot feature

Now let's see hammer's snapshot feature. I have created a directory called "/snapshots" where I want to keep all of my hammer filesystem snaphosts. We'll use hammer utility for taking snapshots.

# hammer snapshot /mnt /snapshots/mnt_snapshot1

I have created a file on /mnt filesystem and took a snapshot. After a sucsessful snapshot, I deleted the original file from /mnt and accessed it by using the snapshot I took before deleting the file (/snapshots/mnt_snapshot1/test.txt).

Undo Feature

You can see and rollback every change between the taken snapshots and original files by using hammer's undo command. You can use "-d" option to get diff output to see every changes on the given file line by line between every snapshot and original one. Also, "-a" option gives you whole change history of given file. Here, I created a test file called "myfile.txt" under my new hammer filesystem (/mnt) and then took a snapshot. After taking snapshot, I added one more line to my test file and used "undo -d /mnt/myfile.txt" to get diff output and see what changes made the file between snapshots.

Mutivolume Feature

As I mentioned above, you can use multi volumes and create a big disk volume with hammer filesystem. Now, we'll create a single hammer volume by using two disks (ad1 and ad3).
You just call newfs_hammer and mount_hammer commands by only giving two or more disk device names instead of one.Commands are straight forward. First, we need to create new hammer filesystem on disks.

# newfs_hammer -L big /dev/ad1 /dev/ad3

I have used to disks and each size is 8GB. After creating new hammer filesystems on disks , we can mount them as one big volume. (I know 8GB is not big :) ).

# mount_hammer /dev/ad1 /dev/ad3 /mnt

You can see on /mnt mount point we have a disk in size 16GB, total of two 8GB disks.

Mirror Feature

Hammer uses Pseudo File Systems (PFS) to duplicate inode numbers to slaves. Therefore, mount_null command required for this operation. This duplication is triggered by using hammer utlity with mirror-copy parameter. First we need to create two pfs, called for master and slave. I created them as /mymaster and /myslave. Master volume must be created with pfs-master and slave with pfs-slave parameter by using hammer utility. One thing to take in accounting here is that slave pfs must use master's shared-uuid number.

After creating mymaster and myslave pfs we have to use mirror-copy to start initial mirroring operation by using these pfs links , because mount_null can't access them without this copy operation.

After, running first mirror copy operation on our pfs links, we can mount our original hamme volumes where we want and associate them with the pfs links we created and sync (mirror-copy) our mounted hammer volumes.

You can use mount_null mount points to do your filesystems operations. You can see that slave volume (myslave) is read-only and you don't have write access to it. You can only use master volume (mymaster) to make your write operations.If you're looking an alternative filesystem to ffs on BSD systems, I suggest you to evaluate both ZFS on FreeBSD and Hammer on DragonFly BSD.

Monday, March 16, 2009

xfs filesystem fragmentation

Many filesystems use many techniques to overcome filesystem fragmentation and XFS filesystem is one of them. But in long usage, you may face with fragmentation problems.Here, I will show you how to find fragmentation ratio in your xfs filesystem and how to re-organize your xfs filesystem with xfs_fsr utility to reduce fragmentation on a live xfs filesystem. First, let's check our xfs filesystem's fragmenation ratio using xfs_db (xfs debugger) utility. As shown below, we used two parameters. First one is "-r", this means that we use debugger in read-only mode. We do not want to accidently do anything bad to our filesystem. Also, this option is useful if we are debugging a mounted filesystem. Second option is "-c" command, which simply takes it's parameter as xfs_db command and run it. Without "-c" parameter, xfs_db will run it's interactive shell where you can run xfs_db commands.

We see that our xfs filesystem on /dev/sdb device has 0.14% fragmentation. This is very low fragmentation. We don't need to defrag it. But here how to do it with xfs_fsr. This utility re-organizes a xfs filesystem while it's mounted file by file. As you can see in sample below, it runs with "-t" parameter, this tells xfs_fsr utility to run maximum that given seconds and quit.

There is one more option you can use while using xfs filesystem to avoid future fragmentations. It's called "allocsize", where it preallocates disk space before writing to a file. You have to set this preallocation size while mounting your xfs filesystem. This will prevent fragmentation on your xfs filesystem if you give a reasonable size depending on your average file sizes.

mount -t xfs -o noatime,allocsize=8M /dev/sdb /mydata

you can also set this option in your /etc/fstab.

Wednesday, March 11, 2009

stopping and resuming a process run in Linux

You can stop (pause) a running process by sending "-STOP" signal via kill command in Linux. You can re-run same stopped process again by sending "-CONT" signal to it. Assume that we have a process to stop and it's pid is 8674. Following, command will stop the running process id 8674.

# kill -STOP 8674

Now the process is stopped, to resume it's running state, you have to send "-CONT" signal.

# kill -CONT 8674

In Solaris, you can do the same task by using pstop and prun utilities or you can use kill command and use signals "SIGSTOP" and "SIGCONT".

Saturday, March 7, 2009

measuring your network throughput on Solaris

You can use nicstat utility to measure your network device utilization and throughput. The utility itself uses solaris kstat for network device statistics information. Statistics include network read and write in kilobytes, packets and device utilization.You can compile nicstat with gcc as shown below.

# gcc nicstat.c -o nicstat -lkstat -lgen -lsocket

Following image shows a sample output from nicstat utility.

getting network interface statistics on Solaris

There are many ways to get network interface status and statistics on Solaris. I'll show some of them. First one is well known command netstat. netstat's "-i" parameter will give you statistics of devices used for ip traffic. Statistics include input/output total packets,errors and collisions for each network device.

Next method is by using solaris ndd command. First, we have to find our device name. This can be done by ifconfig command.

My network device name is "e1000g0" as seen in sample output above. Now, it's time to ask ndd which parameters are supported by this device. We do this by calling ndd (ndd /dev/e1000g0 \?) with our device name and "\?" parameter to display every supported parameter.

ndd let's you change some interface settings. They're marked as "read and write". The parameter names clearly shows what they are for. As you can see in sample output below, ndd can give us link speed, link status (cable plugged/unplugged),duplex mode and autoneg status. Following sample output shows us that my network interface link status is 1, which means the network cable is plugged,Network speed is 1000mbit and auto negotiation is on.

The other command is called dladm. dladm can give us above results with one command and in easily readable format. Also, it can give network interface statistics like netstat with "-s" parameter except collisions.

Saturday, February 28, 2009

iptables and connection limit

Linux iptables uses a module called ip_conntrack. The name of the module exactly says what it does. It tracks every connection attempt (every connection state) in a hashed table.Every hash entry contains a linked list. Each linked listed entry is called a bucket presenting a connection. You can see your machine's current connections by using ip_conntrack module's proc entry.

Of course, this table has a default value. It is likely that you will reach this limit on a machine getting high connection rates. First clue will be the following log message on your system logs (/var/log/messages).

ip_conntrack: table full, dropping packet

In this case, your machine (iptables) will drop every incoming connection because there is no free hash entry to store any incoming connection (udp or tcp). In this case, there is a configuration option to bump this connection limit to higher values. It's called ip_conntrack_max. It's in /proc filesystem (/proc/sys/net/ipv4/ip_conntrack_max).

You can easily increase this number by simply echo 'ing new number in to this file.

This conntrack tables is divided in linked list entries by hash number as I mention above. The calculation is CONNTRACK_MAX/HASHSIZE. Every hash contains linked lists for connections.This parameter is locted in /sys/module/ip_conntrack/parameters/hashsize. In our case it's default value is 4096. You remember that before changing our default value of conntrack_max, it was 32768. The default hashsize 4096 is produced by 32768/8 calculation. You can tweak this number on your system to increase performance of iptables. Don't forget to put your settings in /etc/sysctl.conf. Because overriding this /proc/ and /sys/ values is only temporary change. There will be gone when you reboot your machine. The hashsize parameter in /sys/ is called buckets in sysctl conf configuration. Also, check other conntrack parameters via "sysctl net.ipv4.netfilter". By setting some other options like wait time can reduce your entries in conntrack table.

Saturday, February 21, 2009

Solaris Interrupts

In this post, I'll be shown you how to access irq table on a Solaris system and how to get statistics about the number of interrupts produced on Solaris. We'll use two commands intrstat and mdb.
intrstat command shows how many interrupts produced by each device and how many of them handled by each cpu on the system. "%tim" column shows the time where spend by that driver's interrupt handler on processing cpu. You can pass a parameter to intrstat and tell it to gather information in given intervals (in seconds). Following is a sample output of intrstat where it gets statistics in 5 seconds interval.

Now, after getting interrupt statistics we can now examine the irq table on Solaris system. For this job, we'll use solaris mdb (Solaris modular debugger). We can access irq table through the kernel. So, we'll pass the "-k" parameter to mdb. This will tell it to get in kernel debugging mode. mdb will use /dev/ksyms and /dev/kmem to access Solaris kernel. After running "mdb -k", type "::interrupts" command to acess irq table in Solaris system. Here is a sample output:

This table shows us which driver(s) (ISR) using which irq. Share column shows how many devices sharing the same irq. IPL gives irq priority. Type column show irq type used Fixed (legacy), MSI, IPI interprocessor interrupt (xcalls - cross call).

disk error statistics with Solaris

Solaris iostat command normally shows you disk activity statistics and cpu utilization information. But it has another useful option which shows you disk error statistics. Those parameters are "-e" and "-E"."-e" options show summary disk error statistics for soft, hard and trasnportation errors. The other option "-E" shows all errors on the disks and information about the devices like the vendor,model,size and media erros.

Tuesday, February 17, 2009

Stopping and Running process on the fly with Solaris/OpenSolaris

Solaris/OpenSolaris has two utilities called pstop and prun which has ability to stop a running process and run it again. Both of the ulities takes pid or pid/lwp (light weight process - thread) as parameter. As the name says, pstop stops a running process when given an pid of a process. Given the same process id of a stopped process to prun , makes the process runnable again.

Booting OpenSolaris in verbose mode

When OpenSolaris boots it does not show kernel messages on the console by default. You can see them by passing an argument to kernel when booting. OpenSolaris uses grub boot loader. On the grub menu press "e" to enter edit mode.

Move to the line where kernel parameters are written with cursor keys.

Press "e" again for edit mode for the grub. Than add "-v" parameter to the end of kernel parameters.

After adding "-v" verbose parameter to kernel press "enter" and than "b" to boot OpenSolaris with verbose mode enabled.

If you need this option to be permanent, you need to modify grub config file in OpenSolaris. It's located in /boot/grub/menu.lst . If you use zfs root, than it's located in /rpool/boot/grub/menu.lst.You can do this on a Sun Sparc machine with Solaris when the openboot procedure begins by pressing "STOP-A" combination and type "boot -v"

Friday, February 13, 2009

IRQ affinity in Linux

Some hardware components like ethernet cards,disk controllers,etc. produces interrupts when needs to get attention from cpu. For example, when ethernet card receives packet from network. You can examine your machine's interrupts usage on cpu's by looking in /proc/interrupts proc entry.

# cat /proc/interrupts

This information includes which devices are working on which irq and how many interrupts processed by each cpu for this device.In normal cases, you will not face with a problem and no need to change irq handling process for any cpu.But in some cases, for example if you are running a linux box as firewall which has high incoming or outgoing traffic, this can cause problems. Suppose that you have 2 ethernet cards on your firewall and both ethernet cards handling many packets. In some cases you can see high cpu usage on one of your cpus.This can be caused by many interrupts produced by your network cards. You can check this by looking in /proc/interrupts and see if that cpu is handling interrupts of both cards. If this is the case, what you can do is looking for most idle cpus on your system and specify those ethernet card irqs to be served by each cpu seperately. But beware that you can only do this in a system with IO-APIC enabled device drivers. You can check if your device supports IO-APIC by looking /proc/interrupts.

In ethernet card case, there is an implementation called NAPI which can reduce interrupt usage on incoming network traffic.

You can see which irq is served by which cpu or cpus by looking in /proc/irq directory. Directory layout is very simple. For every used irq in system it is presented by a directory by it's irq number.Every directory contains a file called smp_affinity where you can set cpu settings. File content shows currently which cpu is serving this irq. The value is in hex format. Calculation is shown below.As you can see in example figure, eth0 is on irq 25 and eth1 is in irq 26. Let's say we want to set irq25 to be served only by cpu3. First of all, we have to calculate the value for cpu3 in hex value. Calculation is shown below.

Calculation

            Binary       Hex
 CPU 0    0001         1
 CPU 1    0010         2
 CPU 2    0100         4
+ CPU 3    1000         8
 -----------------------
 both     1111         f

Calculation is shown for 4 cpu system for simplicity. normally the value for all cpus on a system is represented by 8 digit hex value.As you can see in binary format every bit represents a cpu. We see that binary representation of cpu3 is 8 in hex. Then we write it into smp_affinity file for irq 25 as show below.

# echo 8 > /proc/irq/25/smp_affinity

You can check the setting by looking in to file content.

# cat /proc/irq/25/smp_affinity
00000008

Another example, let's say we want irq25 to handled by cpu0 and cpu1.

   CPU 0    0001         1
+  CPU 1    0010         2
--------------------------
                              0011                        3

Setting bit for cpu0 and cpu1 is giving us the value 3.For example if we need all cpus to handle our device's irq, we set every bit in our calculation and write it into smp_affinity as hex value which is F. There is an implementation in Linux called irqbalance , where a daemon distributes interrupts automatically for every cpu in the system. But in some cases this is giving bad performance where you need to stop the service and do it manually by yourself as I described above for higher performance.Also, irqbalance configuration let's you to configure where it will not balance given irqs or use specified cpus. In this case you can configure it to not touch your manually configured irqs and your preferred cpus and let it run to automatically load balance rest of the irqs and cpus for you.