char *strcpy(char *s1, const char *s2);Strcpy() assumes that s2 is a null-terminated string, and that s1 is a (char *) with enough characters to hold s2, including the null character at the end. Strcpy() then copies s2 to s1. It also returns s1. Why would you return your first argument? The answer is historical -- I'll talk about it with strdup().
Here's a simple program that uses strcpy() to initialize three strings and print them out (this is in src/strcpy.c):
For those unfamiliar with "Give Him Six!", please see this, this or this.
/* Initialize three strings using strcpy() and print them. */ #include <stdio.h> #include <string.h> int main() { char give[5]; char him[5]; char six[5]; strcpy(give, "Give"); strcpy(him, "Him"); strcpy(six, "Six!"); printf("%s %s %s\n", give, him, six); return 0; } |
It runs fine:
UNIX> bin/strcpy Give Him Six! UNIX>Suppose I try to copy a string that's too big. For example, look at src/strcpy2.c:
/* What happens when you call strcpy and didn't allocate enough memory? */ #include <stdio.h> #include <string.h> typedef unsigned long UL; int main() { char give[5]; char him[5]; char six[5]; /* Print the addresses of the three arrays. */ printf("give: 0x%lx him: 0x%lx six: 0x%lx\n", (UL) give, (UL) him, (UL) six); /* This is the same as before -- nice strcpy() statements, and then print. */ strcpy(give, "Give"); strcpy(him, "Him"); strcpy(six, "Six!"); printf("%s %s %s\n", give, him, six); /* Now, this strcpy() is copying a string that is too big. */ strcpy(him, "T.J. Houshmandzadeh"); printf("%s %s %s\n", give, him, six); return 0; } |
Clearly there's a problem with this -- the string "T.J. Houshmandzadeh" is much larger than five characters. Some compilers, like the one on my new Macintosh, will compile this, but others, like the one on my old Macintosh, will take issue with it:
UNIX> gcc -o bin/strcpy2 src/strcpy2.c src/strcpy2.c: In function 'main': src/strcpy2.c:21: warning: call to __builtin___strcpy_chk will always overflow destination buffer UNIX>That's a wise compiler. However, compilers are not all-seeing and all-knowing. Let's fool it by writing our own wrapper around strcpy() -- now it can't figure out the problem. The code is in src/strcpy3.c.
/* This is the same as strcpy2.c, but I write a procedure to call strcpy(), so that even a smart compiler won't figure out that I have a problem. */ #include <stdio.h> #include <string.h> typedef unsigned long UL; void my_strcpy(char *s1, char *s2) { strcpy(s1, s2); } int main() { char give[5]; char him[5]; char six[5]; printf("give: 0x%lx him: 0x%lx six: 0x%lx\n", (UL) give, (UL) him, (UL) six); strcpy(give, "Give"); strcpy(him, "Him"); strcpy(six, "Six!"); printf("%s %s %s\n", give, him, six); my_strcpy(him, "T.J. Houshmandzadeh"); printf("%s %s %s\n", give, him, six); return 0; } |
Now run it. Your memory addresses may differ, and your output may differ, but the interrelationship will be the same. I ran this in 32-bit mode on my old Mac:
UNIX> bin/strcpy3 give: 0xbfffe060 him: 0xbfffe050 six: 0xbfffe040 Give Him Six! deh T.J. Houshmandzadeh Six! UNIX>Take a minute and try to figure out what's going on. Look at the following picture of memory -- I'm drawing this in big-endian, because it makes the character strings easier to parse. When we start, space has been allocated for give, him and six:
|----4 bytes----| | | | 0 | 1 | 2 | 3 | (I'm drawing this in big endian) | | six----------> | | 0xbfffe040 | | 0xbfffe044 | | 0xbfffe048 | | 0xbfffe04c him----------> | | 0xbfffe050 | | 0xbfffe054 | | 0xbfffe058 | | 0xbfffe05c give---------> | | 0xbfffe060 | | 0xbfffe064 | | 0xbfffe068 | | 0xbfffe06cNow, we make the first three strcpy() calls. At the point of the first printf() statement, memory looks like:
|----4 bytes----| | | | 0 | 1 | 2 | 3 | (I'm drawing this in big endian) six----------> |'S'|'i'|'x'|'!'| 0xbfffe040 | 0 | | | | 0xbfffe044 | | | | | 0xbfffe048 | | | | | 0xbfffe04c him----------> |'H'|'i'|'m'| 0 | 0xbfffe050 | | | | | 0xbfffe054 | | | | | 0xbfffe058 | | | | | 0xbfffe05c give---------> |'G'|'i'|'v'|'e'| 0xbfffe060 | 0 | | | | 0xbfffe064 | | 0xbfffe068 | | 0xbfffe06cNow, we make the call strcpy(him, "T.J. Houshmandzadeh"). What happens is that the entire string is copied to him, and this overruns the memory allocated for give:
|----4 bytes----| | | | 0 | 1 | 2 | 3 | (I'm drawing this in big endian) six----------> |'S'|'i'|'x'|'!'| 0xbfffe040 | 0 | | | | 0xbfffe044 | | | | | 0xbfffe048 | | | | | 0xbfffe04c him----------> |'T'|'.'|'J'|'.'| 0xbfffe050 |' '|'H'|'o'|'u'| 0xbfffe054 |'s'|'h'|'m'|'a'| 0xbfffe058 |'n'|'d'|'z'|'a'| 0xbfffe05c give---------> |'d'|'e'|'h'| 0 | 0xbfffe060 | 0 | | | | 0xbfffe064 | | 0xbfffe068 | | 0xbfffe06cSo this means that him is indeed "T.J. Houshmandzadeh", but give has been modified as well, to be "deh". This accounts for the printout of:
deh T.J. Houshmandzadeh Six!The bottom line is that when you modify memory that you have not allocated (as I did when I called strcpy(him, "T.J. Houshmandzadeh");), then strange things will happen. They have explanations, but until you figure it out, it will be confusing. If you're lucky, you get a segmentation violation or a bus error. If you're unlucky, you get wierd, inexplicable output. A corollary of this is that when you get a segmentation violation, a bus error, or wierd, inexplicable output, then chances are you have modified memory that you didn't allocate.
Here's the output on my Mac in 2021 -- I may well make this a clicker question, but see if you can figure out the output here.
UNIX> bin/strcpy3 give: 0x7ffeeea63197 him: 0x7ffeeea63192 six: 0x7ffeeea6318d Give Him Six! Houshmandzadeh T.J. Houshmandzadeh Six! UNIX>
char *strcat(char *s1, const char *s2);Strcat() assumes that s1 and s2 are both null-terminated strings. Strcat() then concatenates s2 to the end of s1. I don't know what it returns -- read the man page if you care. Strcat() assumes that there is enough space in s1 to hold these extra characters. Otherwise, you'll start stomping over memory that you didn't allocate. Here is a simple example: (this is in src/strcat.c):
/* Using strcpy() and strcat() to create the string "Give Him Six!" incrementally. */ #include <stdio.h> #include <string.h> int main() { char givehimsix[15]; strcpy(givehimsix, "Give"); printf("%s\n", givehimsix); strcat(givehimsix, " Him"); printf("%s\n", givehimsix); strcat(givehimsix, " Six!"); printf("%s\n", givehimsix); return 0; } |
The output is predictable:
UNIX> bin/strcat Give Give Him Give Him Six! UNIX>Look at src/strcat2.c. Can you explain why the output is the way that it is? Try filling in memory as we did in the strcpy2 example above.
UNIX> bin/strcat2 give: 0xbfffe060 him: 0xbfffe050 six: 0xbfffe040 Give Him Six! deh T.J. Houshmandzadeh Six! deh Help! T.J. Houshmandzadeh Help! Six! UNIX>C-style strings are a little more difficult to handle than C++ style strings. For example, suppose you wanted to create a string with a given number of j's. In C++, you might write the following (src/makej.cpp):
/* Create a string with a given number of j's by using string concatenation. */ #include <iostream> #include <cstdio> #include <cstdlib> using namespace std; int main(int argc, char **argv) { int i, n; string s; if (argc != 2) { fprintf(stderr, "usage: makej number\n"); exit(1); } n = atoi(argv[1]); for (i = 0; i < n; i++) s += "j"; // Here is the string concatenation. cout << s << endl; return 0; } |
Suppose you want to write the equivalent in C. It's a little more difficult, as you need to call malloc() first, to allocate the string. However, here it is (src/strcat3.c):
/* Trying to use strcat() like C++ string concatenation. */ #include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char **argv) { char *s; int i; int n; if (argc != 2) { fprintf(stderr, "usage: strcat3 number\n"); exit(1); } n = atoi(argv[1]); s = (char *) malloc(sizeof(char)*(n+1)); strcpy(s, ""); for (i = 0; i < n; i++) strcat(s, "j"); /* Here's the strcat() call, which is really inefficient. */ printf("%s\n", s); return 0; } |
When you run them on small numbers, they appear equivalent:
UNIX> bin/makej 50 jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj UNIX> bin/strcat3 50 jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj UNIX>However, try them on a really big number. Here, I'm going to redirect standard output to /dev/null, which throws it away, and I'm going to time it with time:
UNIX> time sh -c "bin/makej 1000 > /dev/null" 0.002u 0.004s 0:00.01 0.0% 0+0k 0+0io 0pf+0w # Blink of an eye. UNIX> time sh -c "bin/makej 10000 > /dev/null" 0.002u 0.004s 0:00.00 0.0% 0+0k 0+0io 0pf+0w # Blink of an eye. UNIX> time sh -c "bin/makej 100000 > /dev/null" 0.004u 0.004s 0:00.01 0.0% 0+0k 0+0io 0pf+0w # Blink of an eye. UNIX> time sh -c "bin/strcat3 1000 > /dev/null" 0.002u 0.004s 0:00.00 0.0% 0+0k 0+0io 0pf+0w # Blink of an eye. UNIX> time sh -c "bin/strcat3 10000 > /dev/null" 0.039u 0.004s 0:00.04 75.0% 0+0k 0+0io 0pf+0w # A little slower UNIX> time sh -c "bin/strcat3 100000 > /dev/null" 3.468u 0.005s 0:03.47 99.7% 0+0k 0+0io 0pf+0w # Nearly 100 times slower! UNIX>See the problem? The C++ string maintains the string's length, so concatenation is fast. In contrast, strcat() has to find the end of the string at each call, which makes the program O(n2). We can fix it, since we know where the end of the string is. This is in strcat4.c:
#include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char **argv) { char *s; int i; int n; if (argc != 2) { fprintf(stderr, "usage: strcat4 number\n"); exit(1); } n = atoi(argv[1]); s = (char *) malloc(sizeof(char)*(n+1)); strcpy(s, ""); for (i = 0; i < n; i++) strcat(s+i, "j"); /* The only changed line */ printf("%s\n", s); return 0; } |
UNIX> bin/strcat4 50 jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj UNIX> time sh -c "bin/strcat4 100000 > /dev/null" 0.003u 0.004s 0:00.01 0.0% 0+0k 0+0io 0pf+0w # Back to a the blink of an eye UNIX>Such is life in C.
size_t strlen(char *s);Strlen() assumes that s is a null-terminated string. It returns the number of characters before the null character. Strlen() is pretty obvious: (this is in src/strlen.c):
#include <stdio.h> #include <string.h> int main() { char give[5]; char him[5]; char six[5]; strcpy(give, "Give"); strcpy(him, "Him"); strcpy(six, "Six!"); printf("%s %s %s\n", give, him, six); printf("%ld %ld %ld\n", strlen(give), strlen(him), strlen(six)); return 0; } |
Output:
UNIX> bin/strlen Give Him Six! 4 3 4
int strcmp(const char *s1, const char *s2) # We use ints as bools in C. int strncmp(const char *s1, const char *s2, int n)Strcmp() performs a lexicographic comparison of two strings. It returns 0 if they are equal, a negative number if s1 is less than s2, and a positive number otherwise. You will use strcmp() quite a bit in this class, because it's the easiest way to compare two strings.
Strncmp() stops comparing after n characters, if the null character has not be reached yet. It's a good exercise for you to do the D2 250-point problem from Topcoder SRM 683 as a standalone program in C, using strncmp() and strlen() rather than the C++ string library. I'll probably do it in class.
char *strchr(const char *s, int c);Strchr() is how you perform "find" for single characters in C strings. It assumes that s is a null-terminated string. C is an integer, but it is treated as a character. Strchr() returns a pointer to the first occurrence of the character equal to c in s. If s does not contain c, then it returns NULL.
Here is a simple program that prints out whether each line of standard input contains a space (this is in src/strchr.c):
/* Use strchr() to determine if each line of standard input has a space. */ #include <stdio.h> #include <string.h> int main() { char line[100]; char *ptr; while (fgets(line, 100, stdin) != NULL) { ptr = strchr(line, ' '); if (ptr == NULL) { printf("No spaces\n"); } else { printf("Space at character %ld\n", ptr-line); } } return 0; } |
Since you haven't seen fgets() before, go ahead and read the man page. The arguments are a buffer of chars, the size of the buffer, and a "stream" from which to read. stdin is a global variable, defined in stdio.h that specifies to read from standard input. fgets() reads a line of text from the stream, up to the number of characters specified. It will include the newline at the end of the line, which is often a pain. Not so here, though.
I'm doing a little pointer arithmetic here -- ptr-line returns the number of characters between line and ptr. Here's an example of this running:
UNIX> bin/strchr Jim No spaces Jim Plank Space at character 3 James Plank Space at character 5 HI! Space at character 0 HI!! Space at character 0 <CNTL-D> UNIX>We can modify this to print out where all the spaces are. Check out strchr2.c:
UNIX> bin/strchr2 Jim No spaces Jim Plank Space at character 3 Jim Plank Space at character 3 Space at character 4 Give Him Six!!! Space at character 0 Space at character 1 Space at character 6 Space at character 7 Space at character 8 Space at character 12 Space at character 13 Space at character 14 <CNTL-D> UNIX>Go over the code -- why do I say
ptr = strchr(ptr+1, ' ');instead of
ptr = strchr(ptr, ' ');If you don't know, copy the code, modify it, and see for yourself!
If you want to find substrings rather than single characters, use strstr() (read the man page).
Here's a simple example in src/scanf1.c:
/* Read a single integer from standard input using scanf. */ #include <stdio.h> #include <stdlib.h> int main() { int i; if (scanf("%d", &i) == 1) { printf("Just read i: %d (0x%x)\n", i, i); } else { printf("Scanf() failed for some reason.\n"); } exit(0); } |
I have one integer, i. That's four bytes. They are located at i's pointer: &i. When I call scanf(), I say to read an integer from standard input, and fill in those four bytes with that integer. Scanf() returns the number of successful reads that it did. If our read is successful, the program prints i in decimal and in hexadecimal.
UNIX> bin/scanf1 10 Just read i: 10 (0xa) UNIX> bin/scanf1 Fred Scanf() failed for some reason. UNIX> bin/scanf1 15.999999999999 Just read i: 15 (0xf) UNIX> bin/scanf1 -15.99999999999999 Just read i: -15 (0xfffffff1) UNIX> bin/scanf1 <CNTL-D> Scanf() failed for some reason. UNIX> echo "" | bin/scanf1 Scanf() failed for some reason. UNIX> echo 15fred | bin/scanf1 Just read i: 15 (0xf) UNIX>Let's go over these examples.
The program scanf2.c is buggy.
int main() { int *i; printf("i = 0x%lx\n", (unsigned long) i); if (scanf("%d", i) == 1) { printf("Just read i: %d (0x%x)\n", *i, *i); } else { printf("Scanf() failed for some reason.\n"); } exit(0); } |
It will compile (although some nosy compilers will figure out it's buggy and yell at you). Whether the bug manifests or not is a matter of luck. Here's the program on my Mac in 2015:
UNIX> echo 10 | bin/scanf2 i = 0x7fff5fc01052 Bus error UNIX>What happened? The answer is that i is an uninitialized variable. It randomly started with a value of 0x7fff5fc01052. When scanf() tried to stuff the value 10 into those four bytes, a hardware error was generated -- that's the bus error. If you're lucky, when your program has uninitialized variables, they lead to segmentation violations and bus errors. If you're unlucky, they won't, and you don't discover your bug until (potentially much) later.
Just to test on some other machines, here it is on my Raspberry Pi in 2018:
@raspberrypi:~/CS360/cs360-lecture-notes/CStuff$ echo 10 | bin/scanf2 i = 0x0 Segmentation fault pi@raspberrypi:~/CS360/cs360-lecture-notes/CStuff$The fact that i was zero is good here -- the segmentation violation clues us into the fact that there is a bug.
In 2018, my Mac gave me the disaster output:
UNIX> echo 10 | bin/scanf2 i = 0x7fff57c662a0 Just read i: 10 (0xa) UNIX>The variable i just happens to be a legal and aligned address. The value 10 has been stuffed into bytes 0x7fff57c662a0 to 0x7fff57c662a3. Who knows what that is in my program. The fact that my program simply exits means that this bug is benign, but if I were to have lots more going on in my program, this bug would be extremely difficult to figure out. The reason is that when the error manifests, it will be much later in the program, when some other part of the program is using addresses 0x7fff57c662a0 to 0x7fff57c662a3. This is why it pays to be careful when you are programming.
/* This program uses scanf and %s to read a string and print out the characters. You should *only* use scanf and %s if you are guaranteed that the string you are reading will not be bigger than the memory allocated to it. Otherwise, you expose yourself to a buffer overflow attack. */ #include <stdio.h> #include <stdlib.h> int main() { char s[10]; int i; if (scanf("%s", s) != 1) exit(0); for (i = 0; s[i] != '\0'; i++) { printf("Character: %d: %3d %c\n", i, s[i], s[i]); } exit(0); } |
Since an array variable like s is equivalent to a pointer to the first element, we do not have to pass &s to scanf() -- we simply pass s.
This program allows us to see the ASCII character codes for the characters in the string "Jim-Plank":
UNIX> echo "Jim-Plank" | bin/scanf3 Character: 0: 74 J Character: 1: 105 i Character: 2: 109 m Character: 3: 45 - Character: 4: 80 P Character: 5: 108 l Character: 6: 97 a Character: 7: 110 n Character: 8: 107 k UNIX>Scanf() with strings is problematic. In particular, think about what happens when you enter a string with more than 10 characters. Memory will get stomped on, just like the strcpy() and strcat() examples above with "T. J. Houshmanzadeh". For example, let's send a string with 80,000 'j' characters to bin/scanf3:
UNIX> bin/makej 80000 | bin/scanf3 Segmentation fault: 11 UNIX>We were lucky to get a segmentation violation -- allowing your input to stomp on your memory is the heart of what's called a "buffer overflow attack". Using scanf() with strings is a very good way to expose yourself to a buffer overflow attack, unless you can guarantee that your input actually behaves. Using fgets() and subsequently calling sscanf() is a safer way to go.
Here's an example program that reads lines of text from standard input, and attempts to convert them to ints and doubles. It is in src/sscanf1.c:
#include <stdio.h> int main() { char buf[1000]; int i, h; double d; while (fgets(buf, 1000, stdin) != NULL) { if (sscanf(buf, "%d", &i) == 1) { printf("When treated as an integer, the value is %d\n", i); } if (sscanf(buf, "%x", &h) == 1) { printf("When treated as hex, the value is 0x%x (%d)\n", h, h); } if (sscanf(buf, "%lf", &d) == 1) { printf("When treated as a double, the value is %lf\n", d); } if (sscanf(buf, "0x%x", &h) == 1) { printf("When treated as a hex with 0x%%x formatting, the value is 0x%x (%d)\n", h, h); } printf("\n"); } } |
Here is an example of it running.
UNIX> bin/sscanf1 10 When treated as an integer, the value is 10 When treated as hex, the value is 0x10 (16) When treated as a double, the value is 10.000000 55.9 When treated as an integer, the value is 55 When treated as hex, the value is 0x55 (85) When treated as a double, the value is 55.900000 .5679 When treated as a double, the value is 0.567900 a When treated as hex, the value is 0xa (10) 0x10 When treated as an integer, the value is 0 When treated as hex, the value is 0x10 (16) When treated as a double, the value is 16.000000 When treated as a hex with 0x%x formatting, the value is 0x10 (16) UNIX>The first four inputs should be straightforward. That last one is a little confusing, even to me, and the man page on sscanf() is not helpful. From that, it appears that %x and %lf recognize "0x" in the input and perform the proper conversion in hex. %d does not. That's one of those "features" on which I wouldn't rely -- I bet it's not implemented on all machines (that's just my gut feeling).
char *strdup(const char *s); |
It is basically implemented as follows:
char *strdup(const char s) { return strcpy(malloc(strlen(s)+1), s); } |
In other words, it makes a copy of the string, allocating memory for the copy. Since it calls malloc(), if you are finished with the copy, you should call free() on it, to avoid memory leaks. See how it uses the return value of strcpy() that we all ignore? That's the only time you'll see that return value used. Again, we'll see more of strdup() in the Fields lecture.