March 03, 2004
Brief Intro to Pointers
by peterbI've been toying with the idea of writing an article about pointers and why we use them. When I was first learning about them, I understood the hows just fine, but couldn't ever find a lucid explanation of why one should use pointer. With an air of superiority, however, psu instructed me that the cool kids of today with their crazy dances don't condescend to worry about anything as low level as mere pointers, and I'd merely be making myself look old and unfashionable.
But I haven't worried about looking unfashionable yet, so why start now?
What's a Pointer?
Lots of new programmers think they hate pointers. Don't hate pointers. Love pointers. Pointers are the difference between having someone's address written down in a notebook and carrying their house around with you wherever you go.
Let's limit the discussion to pointers in C and C++, since those are the implementations most of us are familiar with.
A pointer is an number (the object oriented guys hate it when I say that) that describes an address somewhere else in memory. We say that a pointer "points to" the address it contains. Strictly speaking, that definition also encompasses any variable on the left hand side of an assignment (when you say "a = 5;", you're actually saying "put the value '5' in the memory pointed to by 'a'), but in common usage when we talk about pointers we're talking about an additional layer of indirection over that.
This is the point where I'm supposed to include a picture of a bunch of little cells with arrows pointing from some to others, and that is somehow supposed to make you magically understand pointers. The problem is, it won't: it'll just annoy you when you actually go to use them and there are no friendly arrows or boxes and all you have is a command prompt where you tried to run your program and it promptly responded Segmentation fault (core dumped).
In C, the two operators that are most commonly associated with pointers are "&" (the "address of" operator) and "*" (the "dereference" operator, which should not be confused with the multiplication operator that uses the same glyph). Taking the address of a variable gives you its location in memory. Dereferencing a pointer lets you read or change the contents of whatever is stored in the address that pointer points to. Dereferencing a NULL pointer, or a pointer that does not point to valid storage, will usually result in your program crashing, typically with what we call a segmentation fault. Instead of giving you diagrams, I'm going to give you code. The code will be confusing too, but hopefully you can actually run it and experiment with it yourself to make things work.
Here's a tiny function to try to demonstrate these operators in use, in an extremely non-useful manner (note: don't try to cut and paste these examples; they are line-wrapped and abbreviated to make them more readable on the web. If you want to try to compile these yourself, click the link after each example.)
When I run that, I get this:
The ugly hex numbers in this case are the addresses of memory locations within my process' stack. "a" happened to be at 0xbfbff76c. "a_ptr" was at 0xbfbff768. After we set a_ptr = &a, a_ptr (ie, the memory on the stack at location 0xbfbff768) contained the integer 0xbfbff76c. When we dereferenced a_ptr -- that would be "*a_ptr" -- our program fetched the value contained at 0xbfbff768, followed it to 0xbfbff76c, and told us what the contents of that memory was -- in this case, "5".
If you're confused, don't worry -- that's par for the course. I'm half-convinced that the reason people prefer to talk about objects instead of pointers is that the C syntax for pointers is so horrific. It really is awful. It's especially fun how the "*" glyph means one of several different things depending on whether it appears in a declaration or in a statement context. You're not hallucinating: that really does suck, and don't believe anyone who tells you otherwise. You will, in time, get used to it, but that doesn't mean we can't complain about it in the meantime.
You can also have pointers to pointers, and so on. I won't talk about those in this article.
But Why?
There are basically three reasons to use pointers:
- Data structure flexibility: allocation of memory at run-time rather than compile-time, and the ability to construct more elaborate data structures (linked lists, trees, hash tables, etc).
- Letting a function modify memory pointed to by its arguments.
- Passing data by reference instead of by value. (This is really the general form of the previous statement.)
There are actually more reasons than these, but they're the three I consider to be most important.
I'm not going to talk too much about data structures here, because that's a subject that is deep enough to warrant it's own article (or book, or high-level University class). We can just summarize by saying that there are certain data structures that would be so difficult or expensive to implement without pointers that it is reasonable to describe them as impossible without pointers. It is no exaggeration to say that deciding what data structures to use is probably the single most significant choice you will make on any given program. Likewise, object oriented programming, under the hood, is all about doing clever things, systematically, with pointers (when we talk of a language with OO support, we mean "the language does the clever things with the pointers for me, so that I don't have to go on a killing spree when I put an ampersand in the wrong place.")
Lets talk about passing data by reference instead of by value. There are two implications of passing a pointer rather than the value itself around. The first is a data access issue: if you pass a function a pointer to a variable, it will be able to change the value in that variable, whereas if you pass the value, you can't. Here's an example.
We pass the value of counter to the "val_incr()" function, and pass the pointer to counter ("&counter") to "ref_incr()". Let's run our little test:
Note that after val_incr, when we returned to the scope of main(), counter was the same as what we started with -- 9. However, ref_incr was able to actually change the value stored in main's "counter" variable. The reason for the difference is because of lexical scoping -- the "counter" in val_incr() is not the same -- that is to say, it points to a different location in memory -- than the "counter" in main(). The "*counter" in ref_incr() isn't the same either, but since it is a pointer, it can contain a pointer to the memory location referenced by main()'s "counter", and we can then dereference it and make changes.
Some of you may be scratching your heads right now and saying "But why not just declare 'counter' as a global variable, and then both val_incr() and ref_incr() would be able to change it without any gymnastics, and indeed without even passing it as a parameter?" That's true, and is a bigger subject that I'm prepared to tackle right now except to say that we gain significant advantages in maintainability, portability (in the sense of being able to reuse functions we write for one program in a later program) and provability by not using global variables. As a rule of thumb: don't use globals. That's definitely a rule you'll find good reasons to break, but it's to your advantage to learn discipline early.
The other reason we like pointers is for performance. When you pass by value (or do struct assignments) you are doing memory copies: if the variable you're passing is a large structure, then you're spending a lot of your time copying data. Let's look at an example.
Just for fun, I decided to run some numbers to see how expensive struct copies are versus simple pointer copies on a typical system (where "typical" is tautologically defined as "whichever machine I happen to be testing on at the time" -- i am not trying to do super-scientific measurements here, I'm just having fun.) I'm not normally a performance geek; as a developer, it's of much greater concern to me that a given piece of code run correctly, and that it be maintainable, than that it run fast. In my experience, it's usually easier to make readable, correct code faster than it is to make unreadable and buggy fast code correct. But, y'know. Sometimes you just want to play.
So here's some lame synthetic source code:
That compiles down to this:
Doesn't look that different, does it? If we add some instrumentation to count the number of CPU ticks, though, we can see how long each of those segments take. You can download the source of the instrumented version of this program to try yourself here. To compile it on most unix systems, try: cc -o ptrtest ptrtest.c. To generate and look at the assembly code (at least with gcc), do gcc -S ptrtest.c. The output will be in a file named ptrtest.s (and will look a little different from what you see above, since that's the instrumented version).
The number of cycles to do the struct copy increases linearly with the size of the structure being copied, as you'd expect. The pointer copy always takes about the same number of cycles.
Of course, this isn't a truly fair test. Maybe those egregiously uninitialized pointers would point to stack-allocated memory in a real program, but it's more likely that I had to allocate and free the memory from the heap for them. What does it look like in that case? So let's change that first test to allocate the memory, do the pointer copy, and then free it. Suddenly things look a lot different:
That's with the standard FreeBSD malloc implementation; results might vary depending on what malloc you're using. Now you're in real trouble! 30,000 more ticks! Boy, is your boss going to be angry that you went over your tick budget. Those ticks were a precious resource, and you went and squandered them! So, naturally, you go on to implement a chunk allocator which preallocates all the memory your application is likely to use so you can quickly grab items off of a freelist, and then the next thing you know you're drunk, drinking sterno, unwashed, living under a bridge and whimpering "Garbage collection. Why can't I be programming in a language that has automatic garbage collection?" It's a hard knock life, thinking about performance, kid.
Really, though, this last case is worst-case for pointers. It's pretty rare that you allocate memory, copy a pointer once, and then free it right away; in a program of any size it's probably that you'll be doing many copies or assignments over the life of some variable. The bottom line is that (large) programs that don't use pointers will generally perform worse than programs that use them correctly.
I hope this article has proven useful. If you find any inaccuracies, please drop me a line, or comment below.
Additional references
You can begin your slow descent down to skid row by following these links:
- Wikipedia: automatic garbage collection
- Dynamic memory allocation is trickier on multiprocessor systems.
- A nice tutorial on pointers for the confused.
The issue with pointers isn't pointers, it's C. All programming languages have pointers of some sort (OK, all programming languages anybody uses for anything these days). But C delivers pointers in a format carefully optimized for blowing your foot off. I'm starting to think that CS should be taught with LISP and asssembler, graduating to something with managed memory, and that C style memory management should be a 300 or 400 level course rather than basic instruction. Either that or just force feed people the Boehm collector until they're old enough to know better. Meanwhile, back in the real world, I hear Python's nice.
Posted by Faisal N. Jawdat at March 4, 2004 07:24 AMYeah, it's pretty clear that the C syntax for pointers is pretty horrible. I was with you in your suggestions until you said "...and assembler," because then you end up teaching the students 36 different CPU-specific addressing modes, which is the same thing as teaching them C's pointer syntax except now they have 36 different ways to blow their own feet off, instead of just 2.
I'm pretty sure most Universities nowadays don't start with C, anyway, although I can't be sure, not being a student. I heard that the data structures and algorithms course at CMU, which used to be taught in modula-2 (kill me) is now taught in java, which makes sense to me.
Posted by peterb at March 4, 2004 07:44 AMModula-2? Wow. When I took the course, it was Common LISP. The next year, I think I heard it was SML (and people wanted to kill), and then switched up to Java shortly after that.
Posted by John Prevost at March 4, 2004 10:19 AMYeah, but when you start by learning assembler (like I did), you can't avoid learning the power of pointers, and all those addressing modes do weird things to your head that turn out to be fairly useful when programming.
The last three universities at which I worked all started their students with Scheme. I have no idea why.
I'd avoid the problem of teaching people addressing modes by just picking one form of assembler and sticking with it, probably on a "not real" processor. If you're teaching people direct and dangerous memory management, you might as well admit you're playing towers of hanoi on a grand scale instead of trying to pretend you're using a high level language.
Posted by Faisal N. Jawdat at March 4, 2004 09:30 PMGood article. One issue though, the first code sample in the "But Why" section contains an error. The example in the article is correct, but the download version repeats val_incr in the print statement for the ref_incr function (See below). Took some head scratching to figure out...
void val_incr(int counter) {
printf("val_incr: start, counter is %d\n", counter);
counter++;
printf("val_incr: exit, counter is %d\n", counter);
}
void ref_incr(int *counter) {
printf("val_incr: start, counter is %d\n", *counter);
(*counter)++;
printf("val_incr: exit, counter is %d\n", *counter);
}
Ouch! Nice catch. I've corrected the downloadable version. Thanks.
Posted by peterb at June 10, 2005 07:55 PMPlease help support Tea Leaves by visiting our sponsors.