CS360 Lecture notes -- Strings in C

  • James S. Plank
  • Directory: /home/plank/cs360/notes/Strings-In-C
  • Lecture notes: http://web.eecs.utk.edu/~plank/plank/classes/cs360/360/notes/Strings-In-C/index.html
  • Lecture notes directory: /home/plank/cs360/notes/Strings-In-C
  • Bitbucket: https://bitbucket.org/jimplank/cs360-lecture-notes.
  • Original lecture notes ("PointMalloc"): Fri Aug 31 10:39:16 EDT 2007.
  • Last modified: Wed Jan 17 16:45:00 EST 2018
    In C, we lose the ease of C++ strings, which is a pity. There are a lot of routines to help you create and manipulate strings in C. I go over many of them here. One important and inconvenient thing about C strings is that you have to manage your own memory, and that can lead to many pitfalls. One goal of this lecture is to help you avoid those pitfalls.

    strcpy()

    char *strcpy(char *s1, char *s2);
    
    Strcpy() assumes that s2 is a null-terminated string, and that s1 is a (char *) with enough characters to hold s2, including the null character at the end. Strcpy() then copies s2 to s1. It also returns s1. Why would you return your first argument? The answer is historical -- I'll talk about it with strdup().

    Here's a simple program that uses strcpy() to initialize three strings and print them out (this is in strcpy.c):

    For those unfamiliar with "Give Him Six!", please see this, this or this.

    #include <stdio.h>
    #include <string.h>
    
    int main()
    {
      char give[5];
      char him[5];
      char six[5];
    
      strcpy(give, "Give");
      strcpy(him, "Him");
      strcpy(six, "Six!");
    
      printf("%s %s %s\n", give, him, six);
      return 0;
    }
    

    It runs fine:

    UNIX> ./strcpy
    Give Him Six!
    UNIX>
    
    Suppose I try to copy a string that's too big. For example, look at strcpy2.c:

    #include <stdio.h>
    #include <string.h>
    
    typedef unsigned long UL;
    
    int main()
    {
      char give[5];
      char him[5];
      char six[5];
    
      printf("give: 0x%lx  him: 0x%lx  six: 0x%lx\n", (UL) give, (UL) him, (UL) six);
    
      strcpy(give, "Give");
      strcpy(him, "Him");
      strcpy(six, "Six!");
    
      printf("%s %s %s\n", give, him, six);
    
      strcpy(him, "T.J. Houshmandzadeh");
    
      printf("%s %s %s\n", give, him, six);
      return 0;
    }
    

    Clearly there's a problem with this -- the string "T.J. Houshmandzadeh" is much larger than five characters. Some compilers will compile this, but others, like the one on my old Macintosh, take issue with it:

    UNIX> gcc -o strcpy2 strcpy2.c
    strcpy2.c: In function 'main':
    strcpy2.c:21: warning: call to __builtin___strcpy_chk will always overflow destination buffer
    UNIX> 
    
    That's a wise compiler. However, compilers are not all-seeing and all-knowing. Let's fool it by writing our own wrapper around strcpy() -- now it can't figure out the problem. This is in strcpy3.c:

    #include <stdio.h>
    #include <string.h>
    
    typedef unsigned long UL;
    
    void my_strcpy(char *s1, char *s2)
    {
      strcpy(s1, s2);
    }
    
    int main()
    {
      char give[5];
      char him[5];
      char six[5];
    
      printf("give: 0x%lx  him: 0x%lx  six: 0x%lx\n", (UL) give, (UL) him, (UL) six);
    
      strcpy(give, "Give");
      strcpy(him, "Him");
      strcpy(six, "Six!");
    
      printf("%s %s %s\n", give, him, six);
    
      my_strcpy(him, "T.J. Houshmandzadeh");
    
      printf("%s %s %s\n", give, him, six);
      return 0;
    }
    

    Now run it (Your memory addresses may differ, and your output may differ, but the interrelationship will be the same. I've compiled this one in 32-bit mode):

    UNIX> ./strcpy3
    give: 0xbfffe060  him: 0xbfffe050  six: 0xbfffe040
    Give Him Six!
    deh T.J. Houshmandzadeh Six!
    UNIX> 
    
    Take a minute and try to figure out what's going on. Look at the following picture of memory. When we start, space has been allocated for give, him and six:
                        |----4 bytes----|           
                   
                        |               |           
         six----------> |               | 0xbfffe040
                        |               | 0xbfffe044
                        |               | 0xbfffe048
                        |               | 0xbfffe04c
         him----------> |               | 0xbfffe050
                        |               | 0xbfffe054
                        |               | 0xbfffe058
                        |               | 0xbfffe05c
         give---------> |               | 0xbfffe060
                        |               | 0xbfffe064
                        |               | 0xbfffe068
                        |               | 0xbfffe06c
    
    Now, we make the first three strcpy() calls. At the point of the first printf() statement, memory looks like:
         six----------> |'S'|'i'|'x'|'!'| 0xbfffe040
                        | 0 |   |   |   | 0xbfffe044
                        |   |   |   |   | 0xbfffe048
                        |   |   |   |   | 0xbfffe04c
         him----------> |'H'|'i'|'m'| 0 | 0xbfffe050
                        |   |   |   |   | 0xbfffe054
                        |   |   |   |   | 0xbfffe058
                        |   |   |   |   | 0xbfffe05c
         give---------> |'G'|'i'|'v'|'e'| 0xbfffe060
                        | 0 |   |   |   | 0xbfffe064
                        |               | 0xbfffe068
                        |               | 0xbfffe06c
    
    Now, we make the call strcpy(him, "T.J. Houshmandzadeh"). What happens is that the entire string is copied to him, and this overruns the memory allocated for give:
         six----------> |'S'|'i'|'x'|'!'| 0xbfffe040
                        | 0 |   |   |   | 0xbfffe044
                        |   |   |   |   | 0xbfffe048
                        |   |   |   |   | 0xbfffe04c
         him----------> |'T'|'.'|'J'|'.'| 0xbfffe050
                        |' '|'H'|'o'|'u'| 0xbfffe054
                        |'s'|'h'|'m'|'a'| 0xbfffe058
                        |'n'|'d'|'z'|'a'| 0xbfffe05c
         give---------> |'d'|'e'|'h'| 0 | 0xbfffe060
                        | 0 |   |   |   | 0xbfffe064
                        |               | 0xbfffe068
                        |               | 0xbfffe06c
    
    So this means that him is indeed "T.J. Houshmandzadeh", but give has been modified as well, to be "deh". This accounts for the printout of:
    deh T.J. Houshmandzadeh Six!
    
    The bottom line is that when you modify memory that you have not allocated (as I did when I called strcpy(him, "T.J. Houshmandzadeh");), then strange things will happen. They have explanations, but until you figure it out, it will be confusing. If you're lucky, you get a segmentation violation or a bus error. If you're unlucky, you get wierd, inexplicable output. A corollary of this is that when you get a segmentation violation, a bus error, or wierd, inexplicable output, then chances are you have modified memory that you didn't allocate.

    strcat()

    char *strcat(char *s1, char *s2);
    
    Strcat() assumes that s1 and s2 are both null-terminated strings. Strcat() then concatenates s2 to the end of s1. I don't know what it returns -- read the man page if you care. Strcat() assumes that there is enough space in s1 to hold these extra characters. Otherwise, you'll start stomping over memory that you didn't allocate. Here is a simple example: (this is in strcat.c):

    #include <stdio.h>
    #include <string.h>
    
    int main()
    {
      char givehimsix[15];
    
      strcpy(givehimsix, "Give");
      printf("%s\n", givehimsix);
      strcat(givehimsix, " Him");
      printf("%s\n", givehimsix);
      strcat(givehimsix, " Six!");
      printf("%s\n", givehimsix);
      return 0;
    }
    

    The output is predictable:

    UNIX> ./strcat
    Give
    Give Him
    Give Him Six!
    UNIX> 
    
    Look at strcat2.c. Can you explain why the output is the way that it is? Try filling memory as in the strcpy2 example above.
    UNIX> ./strcat2
    give: 0xbfffe060  him: 0xbfffe050  six: 0xbfffe040
    Give Him Six!
    deh T.J. Houshmandzadeh Six!
    deh Help! T.J. Houshmandzadeh Help! Six!
    UNIX> 
    
    C-style strings are a little more difficult to handle than C++ style string. For example, suppose you wanted to create a string with a given number of j's. In C++, you might write the following (makej.cpp):

    #include <iostream>
    using namespace std;
    
    int main(int argc, char **argv)
    {
      int i, n;
      string s;
    
      if (argc != 2) { fprintf(stderr, "usage: makej number\n"); exit(1); }
      n = atoi(argv[1]);
    
      for (i = 0; i < n; i++) s += "j";
      cout << s << endl;
      return 0;
    }
    

    Suppose you want to write the equivalent in C. It's a little more difficult, as you need to call malloc() first, to allocate the string. However, here it is (strcat3.c)

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    int main(int argc, char **argv)
    {
      char *s;
      int i;
      int n;
    
      if (argc != 2) { fprintf(stderr, "usage: strcat3 number\n"); exit(1); }
    
      n = atoi(argv[1]);
      s = (char *) malloc(sizeof(char)*(n+1));
      strcpy(s, "");
    
      for (i = 0; i < n; i++) strcat(s, "j");
      
      printf("%s\n", s);
      return 0;
    }
    

    When you run them on small numbers, they appear equivalent:

    UNIX> ./makej 50
    jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
    UNIX> ./strcat3 50
    jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
    UNIX> 
    
    However, try them on a really big number. Here, I'm going to redirect standard output to /dev/null, which throws it away, and I'm going to time it with time:
    UNIX> time sh -c "./makej 1000 > /dev/null"
    0.002u 0.004s 0:00.01 0.0%	0+0k 0+0io 0pf+0w
    UNIX> time sh -c "./makej 10000 > /dev/null"
    0.002u 0.004s 0:00.00 0.0%	0+0k 0+0io 0pf+0w
    UNIX> time sh -c "./makej 100000 > /dev/null"
    0.004u 0.004s 0:00.01 0.0%	0+0k 0+0io 0pf+0w
    UNIX> time sh -c "./strcat3 1000 > /dev/null"
    0.002u 0.004s 0:00.00 0.0%	0+0k 0+0io 0pf+0w
    UNIX> time sh -c "./strcat3 10000 > /dev/null"
    0.039u 0.004s 0:00.04 75.0%	0+0k 0+0io 0pf+0w
    UNIX> time sh -c "./strcat3 100000 > /dev/null"
    3.468u 0.005s 0:03.47 99.7%	0+0k 0+0io 0pf+0w
    UNIX> 
    
    See the problem? The C++ string maintains the string's length, so concatenation is fast. In contrast, strcat() has to find the end of the string at each call, which makes the program O(n2). We can fix it, since we know where the end of the string is. This is in strcat4.c:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    int main(int argc, char **argv)
    {
      char *s;
      int i;
      int n;
    
      if (argc != 2) { fprintf(stderr, "usage: strcat4 number\n"); exit(1); }
    
      n = atoi(argv[1]);
      s = (char *) malloc(sizeof(char)*(n+1));
      strcpy(s, "");
    
      for (i = 0; i < n; i++) strcat(s+i, "j");  /* The only changed line */
      
      printf("%s\n", s);
      return 0;
    }
    

    UNIX> ./strcat4 50
    jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
    UNIX> time sh -c "./strcat4 100000 > /dev/null"
    0.003u 0.004s 0:00.01 0.0%	0+0k 0+0io 0pf+0w
    UNIX> 
    
    Such is life in C.

    strlen()

    long strlen(char *s);
    
    Strlen() assumes that s is a null-terminated string. It returns the number of characters before the null character. Strlen() is pretty obvious: (this is in strlen.c):

    #include <stdio.h>
    #include <string.h>
    
    int main()
    {
      char give[5];
      char him[5];
      char six[5];
    
      strcpy(give, "Give");
      strcpy(him, "Him");
      strcpy(six, "Six!");
    
      printf("%s %s %s\n", give, him, six);
      printf("%ld %ld %ld\n", strlen(give), strlen(him), strlen(six));
      return 0;
    }
    
    

    Output:

    UNIX> ./strlen
    Give Him Six!
    4 3 4
    

    strcmp() and strncmp()

    int strcmp(char *s1, char *s2)
    int strncmp(char *s1, char *s2, int n)
    
    Strcmp() performs a lexicographic comparison of two strings. It returns 0 if they are equal, a negative number if s1 is less than s2, and a positive number otherwise. You will use strcmp() quite a bit in this class, because it's the easiest way to compare two strings.

    Strncmp() stops comparing after n characters, if the null character has not be reached yet. It's a good exercise for you to do the D2 250-point problem from Topcoder SRM 683 as a standalone program in C, using strncmp() and strlen() rather than the C++ string library. I'll do it in class.


    strchr()

    char *strchr(char *s, int c);
    
    Strchr() is how you perform "find" for single characters in C strings. It assumes that s is a null-terminated string. C is an integer, but it is treated as a character. Strchr() returns a pointer to the first occurrence of the character equal to c in s. If s does not contain c, then it returns NULL.

    Here is a simple program that prints out whether each line of standard input contains a space (this is in strchr.c):

    #include <stdio.h>
    #include <string.h>
    
    int main()
    {
      char line[100];
      char *ptr;
    
      while (fgets(line, 100, stdin) != NULL) {
        ptr = strchr(line, ' ');
        if (ptr == NULL) {
          printf("No spaces\n");
        } else {
          printf("Space at character %ld\n", ptr-line);
        }
      }
    }
    

    Since you haven't seen fgets() before, go ahead and read the man page. The arguments are a buffer of chars, the size of the buffer, and a "stream" from which to read. stdin is a global variable, defined in stdio.h that specifies to read from standard input. fgets() reads a line of text from the stream, up to the number of characters specified. It will include the newline at the end of the line, which is often a pain. Not so here, though.

    I'm doing a little pointer arithmetic here -- ptr-line returns the number of characters between line and ptr. Here's an example of this running:

    UNIX> ./strchr
    Jim
    No spaces
    Jim Plank
    Space at character 3
    James Plank
    Space at character 5
     HI!
    Space at character 0
         HI!!
    Space at character 0
    <CNTL-D>
    UNIX> 
    
    We can modify this to print out where all the spaces are. Check out strchr2.c:
    UNIX> ./strchr2
    Jim
    No spaces
    Jim Plank
    Space at character 3
    Jim  Plank
    Space at character 3
    Space at character 4
      Give   Him   Six!!!
    Space at character 0
    Space at character 1
    Space at character 6
    Space at character 7
    Space at character 8
    Space at character 12
    Space at character 13
    Space at character 14
    <CNTL-D>
    UNIX> 
    
    Go over the code -- why do I say
            ptr = strchr(ptr+1, ' ');
    
    instead of
            ptr = strchr(ptr, ' ');
    
    If you don't know, copy the code, modify it, and see for yourself!

    If you want to find substrings rather than single characters, use strstr() (read the man page).


    Scanf()

    Scanf() is like printf() in that it takes a format string and some parameters. However, instead of writing the parameters to the terminal, it reads from the terminal (or whatever is standard input). Where scanf() confuses people is that there are no reference variables in C, so you have to use pointers. If you put "%d" in the format string, then scanf() will read an integer. The parameter that you have to pass is a pointer to the integer that you want read. The storage for the integer has to exist. Scanf() will read the integer from standard input, and will fill in the four bytes of the integer.

    Here's a simple example in scanf1.c:

    #include <stdio.h>
    #include <stdlib.h>
    
    int main()
    {
      int i;
     
      if (scanf("%d", &i) == 1) {
        printf("Just read i: %d (0x%x)\n", i, i);
      } else {
        printf("Scanf() failed for some reason.\n");
      }
      exit(0);
    }
    

    I have one integer, i. That's four bytes. They are located at i's pointer: &i. When I call scanf(), I say to read an integer from standard input, and fill in those four bytes with that integer. Scanf() returns the number of successful reads that it did. If our read is successful, the program prints i in decimal and in hexadecimal.

    UNIX> ./scanf1
    10
    Just read i: 10 (0xa)
    UNIX> ./scanf1
    Fred
    Scanf() failed for some reason.
    UNIX> ./scanf1
    15.999999999999
    Just read i: 15 (0xf)
    UNIX> ./scanf1
    -15.99999999999999
    Just read i: -15 (0xfffffff1)
    UNIX> ./scanf1
    <CNTL-D>
    Scanf() failed for some reason.
    UNIX> echo "" | ./scanf1
    Scanf() failed for some reason.
    UNIX> echo 15fred | ./scanf1
    Just read i: 15 (0xf)
    UNIX>
    
    Let's go over these examples.

    The program scanf2.c is buggy.

    int main()
    {
      int *i;
    
      printf("i = 0x%lx\n", (unsigned long) i);
      if (scanf("%d", i) == 1) {
        printf("Just read i: %d (0x%x)\n", *i, *i);
      } else {
        printf("Scanf() failed for some reason.\n");
      }
      exit(0);
    }
    

    It will compile (although some nosy compilers will figure out it's buggy and yell at you). Whether the bug manifests or not is a matter of luck. Here's the program on my Mac in 2015:

    UNIX> echo 10 | ./scanf2
    i = 0x7fff5fc01052
    Bus error
    UNIX>
    
    What happened? The answer is that i is an uninitialized variable. It randomly started with a value of 0x7fff5fc01052. When scanf() tried to stuff the value 10 into those four bytes, a hardware error was generated -- that's the bus error. If you're lucky, when your program has uninitialized variables, they lead to segmentation violations and bus errors. If you're unlucky, they won't, and you don't discover your bug until (potentially much) later.

    Just to test on some other machines, here it is on my Raspberry Pi in 2018:

    @raspberrypi:~/CS360/cs360-lecture-notes/CStuff$ echo 10 | ./scanf2
    i = 0x0
    Segmentation fault
    pi@raspberrypi:~/CS360/cs360-lecture-notes/CStuff$
    
    The fact that i was zero is good here -- the segmentation violation clues us into the fact that there is a bug.

    In 2018, my Mac gave me the disaster output:

    UNIX> echo 10 | ./scanf2
    i = 0x7fff57c662a0
    Just read i: 10 (0xa)
    UNIX> 
    
    The variable i just happens to be a legal and aligned address. The value 10 has been stuffed into bytes 0x7fff57c662a0 to 0x7fff57c662a3. Who knows what that is in my program. The fact that my program simply exits means that this bug is benign, but if I were to have lots more going on in my program, this bug would be extremely difficult to figure out. The reason is that when the error manifests, it will be much later in the program, when some other part of the program is using addresses 0x7fff57c662a0 to 0x7fff57c662a3. This is why it pays to be careful when you are programming.

    Strings and scanf

    As we know, a string in C is an array of char's. Recall, a char is a one-byte integer, which means that it has values between -128 and 127. Each of those values matches to a printable character, with zero equaling the "null" character. A string is an array of chars that ends with the null character. The following program (scanf3.c) uses scanf() to read a string from standard input, and then to print the individual characters:

    #include <stdio.h>
    #include <stdlib.h>
    
    int main()
    {
      char s[10];
      int i;
    
      if (scanf("%s", s) != 1) exit(0);
    
      for (i = 0; s[i] != '\0'; i++) {
        printf("Character: %d: %3d %c\n", i, s[i], s[i]);
      }
      exit(0);
    }
    

    Since an array variable like s is equivalent to a pointer to the first element, we do not have to pass &s to scanf() -- we simply pass s.

    This program allows us to see the ASCII character codes for the characters in the string "Jim-Plank":

    UNIX> echo "Jim-Plank" | ./scanf3
    Character: 0:  74 J
    Character: 1: 105 i
    Character: 2: 109 m
    Character: 3:  45 -
    Character: 4:  80 P
    Character: 5: 108 l
    Character: 6:  97 a
    Character: 7: 110 n
    Character: 8: 107 k
    UNIX> 
    
    Scanf() with strings is problematic. Think about what happens when you enter a string with more than 10 characters. Memory will get stomped on, just like the strcpy() and strcat() examples above with "T. J. Houshmanzadeh".

    Sscanf()

    Sscanf() is just like scanf(), except it takes an additional string as its first parameter, and it "reads" from that string instead of from standard input. It returns the number of correct matches that it made. Thus, it is quite convenient for converting strings to integers and doubles. It is far superior to atoi() and atof() because it lets you know when it fails, which is quite important.

    Here's an example program that reads lines of text from standard input, and attempts to convert them to ints and doubles. It is in sscanf1.c:

    #include <stdio.h>
    
    int main()
    {
      char buf[1000];
      int i, h;
      double d;
    
      while (fgets(buf, 1000, stdin) != NULL) {
        if (sscanf(buf, "%d", &i) == 1) {
          printf("When treated as an integer, the value is %d\n", i);
        } 
        if (sscanf(buf, "%x", &h) == 1) {
          printf("When treated as hex, the value is 0x%x (%d)\n", h, h);
        } 
        if (sscanf(buf, "%lf", &d) == 1) {
          printf("When treated as a double, the value is %lf\n", d);
        }
        if (sscanf(buf, "0x%x", &h) == 1) {
          printf("When treated as a hex with 0x%%x formatting, the value is 0x%x (%d)\n", h, h);
        }
        printf("\n");
      }
    }
    

    Here is an example of it running.

    UNIX> ./sscanf1
    10
    When treated as an integer, the value is 10
    When treated as hex, the value is 0x10 (16)
    When treated as a double, the value is 10.000000
    
    55.9
    When treated as an integer, the value is 55
    When treated as hex, the value is 0x55 (85)
    When treated as a double, the value is 55.900000
    
    .5679
    When treated as a double, the value is 0.567900
    
    a 
    When treated as hex, the value is 0xa (10)
    
    0x10
    When treated as an integer, the value is 0
    When treated as hex, the value is 0x10 (16)
    When treated as a double, the value is 16.000000
    When treated as a hex with 0x%x formatting, the value is 0x10 (16)
    
    UNIX> 
    
    The first four inputs should be straightforward. That last one is a little confusing, even to me, and the man page on sscanf() is not helpful. From that, it appears that %x and %lf recognize "0x" in the input and perform the proper conversion in hex. %d does not. That's one of those "features" on which I wouldn't rely -- I bet it's not implemented on all machines (that's just my gut feeling).

    Strdup()

    You'll be seeing more of strdup() in the Fields lecture, but I'll mention it now. The prototype of strdup() is:

    char *strdup(char *s);
    

    It does the following:

    In other words, it makes a copy of the string, allocating memory for the copy. Since it calls malloc(), if you are finished with the copy, you should call free() on it, to avoid memory leaks. Again, we'll see more of that in the Fields lecture.

    Other useful procedures

    I don't go over these, but you'll use them from time to time. It's good to aware of them. Read their man pages.