An assortment of indigestible things

Debug like a sysadmin: using strace and ltrace

Before I start, I should say this: I have nothing but admiration for good developers, and nothing in this article is meant to say otherwise. I started out as a C/C++ developer after I graduated, and after 15 months I figured out that I’m not very good at it. So I leave the code to the experts, and merely advise them as best I can. That is the spirit in which this article is written.

One more thing: this is all about Unix/Linux-based development (hereinafter referred to as Unix, irrespective of whether the capitalisation is right!). There might be equivalent techniques for Windows, but to be honest I really don’t care 🙂

Top-down vs bottom-up

We all approach what we do from the perspective of our own discipline. When I investigate a problem, I look at it from a systems perspective: what is the kernel doing? what files are being accessed? what does the network traffic look like? In my experience, developers tend to look at it from the perspective of their own code: what line does this happen on? what’s in these variables? what is the call stack at this point?

Both are perfectly valid approaches, and a blend of the two can often provide useful insights.

A simple worked example

Let’s say you’ve been given a bit of python written by someone far less talented than yourself. You run it, and get the message

Error: unable to open file

You search for that message in the code and find that it occurs no less than twelve times. You therefore have two questions:

  • What file? and
  • why can’t you open it?

Without delving in the code at all, you can find out both things really quickly. It also doesn’t matter whether your code is written in perl, Java, python, C# or Db. This is because opening a file on a Unix system involves the use of a system call: a primitive that serves as the interface between the operating system and what kernel programmers call ‘userland’ (i.e. your code). All system calls are in section 2 of the Unix manual, of which more shortly.

So back to your errant file. You’re going to use a command called strace to show all the system calls as they’re made (on Solaris, the equivalent is called truss).

strace -f -o /tmp/woo python mycode.py

Open /tmp/woo in your favourite editor and scroll to the bottom. What you’ll see is something like this:

4240  execve("/usr/bin/python", ["python", "mycode.py"], [/* 41 vars */]) = 0
4240  brk(0)                            = 0x25cd000
4240  access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
4240  mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f59f5e57000
4240  access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
4240  open("/etc/ld.so.cache", O_RDONLY) = 3
4240  fstat(3, {st_mode=S_IFREG|0644, st_size=172764, ...}) = 0
4240  mmap(NULL, 172764, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f59f5e2c000
...

Scary stuff! What you’re seeing is a list of system calls that were made by the python interpreter. The first number in each line is the PID of the process: this may change through the file, as the -f option to strace tells it to follow child processes.

If you look closely, each line is a C function call and its return value after the ‘=’ sign. If you’re interested, you can look up these functions in section 2 of the Unix manual (e.g. ‘man -s 2 access’); it’s important to specify the section number, as some system calls have namesakes in other sections. The manpage for each system call will give you some idea of what the arguments mean, and what the return value might be.

It’s likely that your file will be hundreds of lines long, so let’s scroll to the bottom and see what we can see:

4240  lseek(3, 97, SEEK_SET)            = 97
4240  read(3, "", 4096)                 = 0
4240  close(3)                          = 0
4240  munmap(0x7f59f5e56000, 4096)      = 0
4240  open("/usr/local/share/moink.dat", O_RDONLY) = -1 ENOENT (No such file or directory)
4240  fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
4240  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f59f5e56000
4240  write(1, "Error: unable to open file\n", 27) = 27
4240  rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f59f5a2ac60}, {0x431400, [], SA_RESTORER, 0x7f59f5a2ac60}, 8) = 0
4240  exit_group(0)                     = ?

Ah-hah! We can see our error message near the bottom of the trace, but a few lines above we can see the process attempting (and failing) to open /usr/local/share/moink.dat. We could look at the open(2) manpage, but it’s pretty obvious what’s happening: the process tried to open the file read-only, but it got an ENOENT, which is helpfully translated for us as ‘No such file or directory’.

If we look at the code, we can see how

FILE = open("/usr/local/share/moink.dat", "r")

is actually implemented by the system call

open("/usr/local/share/moink.dat", O_RDONLY)

The great thing about this technique is that you don’t even need the source code. If you just have a binary blob that someone’s given you, or the code is in INTERCAL and makes your eyes go funny, you can still use strace to see what’s going on under the bonnet, because any Unix program must use system calls to do anything useful.

It’s also possible to look at a program’s activity at a slightly higher level. Most Unix programs don’t make many system calls directly; instead, they call functions in dynamic libraries that then go on to do the donkey work. Using ltrace instead of strace, the relevant lines of the trace file are:


4483 fopen64("/usr/local/share/moink.dat", "r") = 0
4483 __errno_location() = 0x7fde4c52f6a0
4483 strerror(2) = "No such file or directory"
4483 memcpy(0x0277ba94, "No such file or directory", 25) = 0x0277ba94
4483 fwrite("Error: unable to open file", 1, 26, 0x7fde4b2968e0) = 26

The C library has even copied the error string to a place where the python interpreter can get to it… but sadly our crappy python code ignored all that and just gave a generic message (which is what brought us here in the first place). So this python code

FILE = open("/usr/local/share/moink.dat", "r")

caused this C library call to be made

fopen64("/usr/local/share/moink.dat", "r") = 0

which then made this system call

open("/usr/local/share/moink.dat", O_RDONLY)

(if you care, C library functions are in section 3 of the Unix manual; you’ll see from the fopen(3) manpage that NULL is returned on error, which is effectively zero)

The two tracer utilities have lots of options and are an invaluable part of my troubleshooting toolkit when I ask the question ‘what’s really going on here?’ There are lots of others, which I might cover in a future post if you’re really, really lucky.

Previous

Disk space on the cheap: my experiences with alternative storage vendors

Next

Should I let my iPhone send debugging information to Apple?

1 Comment

  1. Pevik

    {s,l}trace are great for debugging C code. For python you can also use python -mtrace –trace script.py :-).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress & Theme by Anders Norén