CS140 Lecture notes -- Void Stars

  • Jim Plank
  • Directory: ~cs140/www-home/notes/VoidStar
  • Lecture notes: http://www.cs.utk.edu/~cs140/notes/VoidStar
  • Tue Sep 29 15:41:30 EDT 1998

    Type Casting

    There are times when you would like to take x bytes of memory of a certain type, and assign it to y bytes of memory of another type. This is called ``type casting''. A simple example is when you want to turn a char into an int, or an int into a float as in cast1.c:
    main()
    {
    char c;
    int i;
    float f;
    
    c = 'a';
    i = c;
    f = i;
    printf("c = %d (%c).   i = %d (%c).  f = %f\n", c, c, i, i, f);
    }
    
    The statement `i = c' is a type cast, as is the statement `f = i'.

    Note that if you type cast from a type with x bytes to a type with y bytes and x < y, then you don't lose any information. For example, if you cast a char to and int, or a float to a double, then you are not losing any information. However, if x > y, then you may lose information, since there are not enough bytes to hold all of the values of the original type.

    For example, look at cast2.c:

    main()
    {
    char c;
    int i;
    float f;
    double d;
    
    i = 1100;
    c = i;
    
    d = 11111111.11111;
    f = d;
    
    printf("c = %d   i = %d  f = %lf  d = %lf\n", c, i, f, d);
    printf("c = 0x%x   i = 0x%x\n", c, i);
    
    i = c;
    d = f;
    
    printf("c = %d   i = %d  f = %lf  d = %lf\n", c, i, f, d);
    printf("c = 0x%x   i = 0x%x\n", c, i);
    
    }
    
    When you set c to i, c cannot be 1100, because it is only one byte (its values go from -128 to 127). What happens is that it sets itself to the last byte of i and ignores the others. Here is the output of cast2
    UNIX> cast2
    c = 76   i = 1100  f = 11111111.000000  d = 11111111.111110
    c = 0x4c   i = 0x44c
    c = 76   i = 76  f = 11111111.000000  d = 11111111.000000
    c = 0x4c   i = 0x4c
    UNIX> 
    
    You'll see that the last byte of i is 0x4c. That is what c is set to. When you print that out as an integer, you see that it is 76. Therefore, setting c to i has turned 1100 into 76. Not what you thought, right?

    Similarly, d is an 8 byte quantity, and f is a 4 byte quantity. Therefore, when you set f to d, you lose precision -- in this case the decimal portion. This is not always the case.

    Now, when we set i to c and d to f, you see that i and d now equal 76 and 11111111.00000 respectively. Therefore, by setting the int to a char and back to an int we have lost information. This is a good thing to be aware of.


    Pointer Casting

    Some type castings, like the one above, are very natural. The C compiler will do these for you without complaining. Most others, however, the C compiler will complain about, unless you specifically tell it that you are doing a type cast (this is a way of telling the compiler ``Yes, I know what I'm doing.''). This is what we do when we call malloc():
    int *ip;
    
    ip = (int *) malloc(sizeof(int)*20);
    
    The compiler doesn't know what type malloc() returns. You are telling it ``that's ok -- I'm going to treat it as an (int *)''.

    Remember, all a pointer is is an index into a big array that we call memory. On our machines, all pointers are 4 bytes. On some machines (for example, the DEC Alpha), pointers are 8 bytes.

    This means that if for some reason we want to cast a pointer of one type to a pointer of another type, we will not lose information.

    Here is an example. Suppose we want to print out the four bytes of an integer. One way to do that is to simply print the integer in hexidecimal padded to eight characters. For example, 0xff00004a means that the integer consists of the bytes 0xff, 0x00, 0x00 and 0x4a. That's one of the nice features of hexidecimal.

    Another way to do this is to treat the address of the integer as a (char *) and then to print out the four characters pointed to by that (char *). This is done in cast3.c:

    main()
    {
    char *s;
    int i, j;
    int *ip;
    
    i = 1100;
    
    printf("i = 0x%08x\n", i);
    
    ip = &i;
    
    s = (char *) ip;
    
    for (j = 0; j < 4; j++) {
    printf("0x%02x ", s[j]);
    }
    printf("\n");
    }
    
    The first printf() statement prints out i in hexidecimal, padded to eight characters with zeros if necessary. The second printf() statement prints out each byte of i padded to two characters with zeros if necessary. You'll see that the output works as you'd think:
    UNIX> cast3
    i = 0x0000044c
    0x00 0x00 0x04 0x4c 
    UNIX> 
    
    Make sure you look over that example until you understand what is going on. There are four bytes of i at address &i:
               |---------------------------|
    ip ------> | 0x00 | 0x00 | 0x04 | 0x4c |
               |---------------------------|
    
    When we set s to ip, then we can treat those four bytes as a character array and print out their values as char's.
                  |---------------------------|
    s, ip ------> | 0x00 | 0x00 | 0x04 | 0x4c |
                  |---------------------------|
    

    Unsigned chars/ints

    You can put the keyword unsigned in front of a char, short, int or long. What this means is that the value will never be negative. For char's, this means that their values go from 0 to 255 instead of from -128 to 127. Often this is extremely convenient. For example, look at cast4.c:
    main()
    {
    char *s;
    unsigned char *s2;
    int i, j;
    int *ip;
    
    i = 1000;
    
    printf("i = 0x%08x\n\n", i);
    
    ip = &i;
    
    s = (char *) ip;
    s2 = (unsigned char *) ip;
    
    for (j = 0; j < 4; j++) printf("%d ", s[j]);
    printf("\n");
    
    for (j = 0; j < 4; j++) printf("0x%02x ", s[j]);
    printf("\n\n");
    
    for (j = 0; j < 4; j++) printf("%d ", s2[j]);
    printf("\n");
    
    for (j = 0; j < 4; j++) printf("0x%02x ", s2[j]);
    printf("\n");
    
    }
    
    Here's the output:
    UNIX> cast4
    i = 0x000003e8
    
    0 0 3 -24 
    0x00 0x00 0x03 0xffffffe8 
    
    0 0 3 232 
    0x00 0x00 0x03 0xe8 
    UNIX> 
    
    You'll note that since the highest bit in s[3] is set, it is a negative number, and prints as such. When you use an unsigned char, then it prints out a positive value.

    There will probably not be many occasions to use unsigned char's in this class (although for example, you could store PGM pixels as unsigned char's), but it is good for you to know what they are.


    (void *)'s

    In C, there is a type called (void) which stands for nothing. Why is this useful? Because it means that you can have a type called a (void *), which is a pointer to nothing. Since it is a pointer, we know that it is four bytes.

    (void *)'s are useful whenever you want to give someone a pointer, but you don't want them to know what it points to. We'll give a useful example of this in a bit. Until then, look at cast5.c:

    main()
    {
    int i, j;
    int *ip, *jp;
    void *v;
    
    i = 1000;
    
    ip = &i;
    
    v = (void *) ip;
    
    jp = (int *) v;
    
    printf("%d %d %d\n", i, *ip, *jp);
    printf("0x%x 0x%x 0x%x\n", ip, v, jp);
    }
    
    As you can see -- we don't lose any information setting v to ip, and then jp to v. This is because all three of them are 4-byte pointers. When we print out their values, they are all the same: 0xeffffa08.
    UNIX> cast5
    1000 1000 1000
    0xeffffa08 0xeffffa08 0xeffffa08
    UNIX> 
    

    TokenGen revisited

    Now, here's a useful example of using (void *). Remember the TokenGen stuff from the last lecture. We made use of two procedures: We defined a TokenGen to be a struct in tg.h. However, when we use the TokenGen procedures (as in variance4.c), we never touch the TokenGen struct. We simply pass it to tokengen_get_token(). This is an ideal place for a (void *).

    Look at tokengen.h variance5.c and tokengen.c. What I've done is define the TokenGen type to be a void. This means that a (TokenGen *) is simply a (void *). Variance5.c can use this just like it used the TokenGen in the last lecture. This is because all it does is pass the (TokenGen *) to tokengen_get_token().

    Now, tokengen.c is the tricky piece of code. It defines a new struct called a TrueTokenGen. This is the exact same as the TokenGen from tg.h in the last lecture. It does the same thing as tg.c from the last lecture, working with TrueTokenGen's instead of TokenGen's. The only difference is that new_tokengen() casts its return value to a (TokenGen *). When tokengen_get_token() is called, it is called with a (TokenGen *), which is a (void *). This should be the pointer returned from new_tokengen(). Thus, tokengen_get_token() casts its argument to a (TrueTokenGen *) and uses that.

    It all works:

    UNIX> variance5 3
    10
    # HI!!!!
    3              5
    Average:  6.000000
    Variance: 8.666667
    UNIX> 
    
    Go over these three files very carefully so that you see how the (void *) works. This is what's known as ``information hiding'' in C. If users of a data structure (in this case, variance5.c) don't need to know how the data structure is implemented, then you can make them use (void *)'s, so that they truly don't know anything about the data structure. Often this is a nice thing. Had you instead exposed the data structure as in tg.h, then a user may mess with the data structure directly, and then if you wanted to change the implementation, you couldn't.

    Other languages (for example C++ and Java) do information hiding much, much better than C. However, it is extremely useful for you to know how it works in C, and quite frankly, well-defined data structures in C using (void *)'s are often easier to use than their analogs in the fancier languages.


    Hanging Yourself

    The big problem with C is that it lets you hang yourself, while other languages prevent it. Thus, there is nothing in C that prevents you from doing silly things like (in silly.c):
    #include < stdio.h >
    #include "tokengen.h"
    
    main(int argc, char **argv)
    {
    char *s;
    
    s = tokengen_get_token((TokenGen *) "Jim");
    }
    
    What's going to happen? Well, tokengen_get_token() is going to try to cast the string "Jim", which is a (char *) into a (TrueTokenGen *), which of course is a struct. Who knows what it will get as ttg->field, but chances are when it tries to get at ttg->is, either in the ``ttg->field >= ttg->is->NF'' part of the while() statement, or in the ``s = ttg->is->fields[ttg->field]'' statement, it will generate a segmentation violation. Try it out. This gives you good debugging practice:
    UNIX> silly
    Segmentation Fault
    UNIX> gdb silly
    GDB is free software and you are welcome to distribute copies of it
    under certain conditions; type "show copying" to see the conditions.
    There is absolutely no warranty for GDB; type "show warranty" for details.
    GDB 4.16 (sparc-sun-solaris2.5.1), 
    Copyright 1996 Free Software Foundation, Inc...
    (gdb) run 
    Starting program: /a/hasbro/lymon/homes/cs140/www-home/notes/VoidStar/silly 
    
    Program received signal SIGSEGV, Segmentation fault.
    0x10e68 in tokengen_get_token (tg=0x10fd8) at tokengen.c:29
    29        while(ttg->field == -1 || ttg->field >= ttg->is->NF) {
    (gdb) print ttg->field
    $1 = 0
    (gdb) print ttg->is->NF
    Cannot access memory at address 0x4a6974e0.
    (gdb)
    
    As you can see, we got a segmentation violation accessing ttg->is->NF. This is because ttg is really pointing to the string "Jim" rather than to a TrueTokenGen.

    The ultimate tokengen.c

    Finally, in the files /home/cs140/spring-2004/include/token.h, /home/cs140/spring-2004/src/tokens/token.c and /home/cs140/spring-2004/objs/token.o are the ultimate version of tokengen.c. In them are defined the following procedures: Variance6.c implements the variance program using token.h and token.c. Note how it makes the code even simpler, and makes error checking very nice. Also, note the compilation procedure:
    UNIX> make
    gcc -g -I/home/cs140/spring-2004/include -c variance6.c
    gcc -g -I/home/cs140/spring-2004/include -o variance6 variance6.o /home/cs140/spring-2004/objs/token.o /home/cs140/spring-2004/objs/libfdr.a 
    UNIX> variance6
    usage: variance6 n
    UNIX> variance6 3
    1
    Not enough values
    UNIX> variance6 3
    1
    Jim
    Line 2: Bad double
    UNIX> variance6 3
    1
    # Jim
    2 3
    Average:  2.000000
    Variance: 0.666667
    UNIX>