CS360 Lecture notes -- C (Mostly Scanf)


Moving from C++ to C

This class is taught in C, rather than C++. The reasoning is as follows: Because C hides so much less from you than C++, you have a much easier time figuring out what's going on when you run one or more programs. This will be a little painful, because you lose so many of the wonderful things about C++ on which you have grown to rely, like cin, strings, objects with methods, and the standard template library. Sorry.

These lecture notes detail the parts of C++ that you lose when you migrate to C, and how you replace them.

You have to use gcc to compile programs in this class. You cannot use g++. Don't give the TA's C++ code and say you didn't know. You know.

Time to learn C.


Header files

As with C++, you include standard header files with #include. You include the file name in less-than/greater-than signs, and you include the .h extension. Instead of starting your programs with:

#include <iostream>
using namespace std;

you start them with:

#include <stdio.h>
#include <stdlib.h>

I never liked that "using namespace std" stuff anyway.


Comments

Comments in C are delimited by "/*" and "*/". The former starts the comment, which can span multiple lines, and the latter ends the comment. (C++ style commenting has been added to the C standard, so you can use it, but I don't -- you never know when you're going to be running on that 1979 VM....)

Bye-bye, cin and cout

Frankly, this isn't too painful, and will be less so when you learn the fields library. I'm assuming that you already know printf() from previous classes. That handles output. For input, we'll focus on three procedures that are defined in stdio.h: scanf(), fscanf() and fgets().

Scanf() is like printf() in that it takes a format string and some parameters. However, instead of writing the parameters to the terminal, it reads from the terminal (or whatever is standard input). Where scanf() confuses people is that there are no reference variables in C, so you have to use pointers. If you put "%d" in the format string, then scanf() will read an integer. The parameter that you have to pass is a pointer to the integer that you want read. The storage for the integer has to exist. Scanf() will read the integer from standard input, and will fill in the four bytes of the integer.

Let's start with an example program in scanf1.c:

#include <stdio.h>
#include <stdlib.h>

main()
{
  int i;
  
  if (scanf("%d", &i) == 1) {
    printf("Just read i: %d (0x%x)\n", i, i);
  } else {
    printf("Scanf() failed for some reason.\n");
  }
  exit(0);
}

I have one integer, i. That's four bytes. They are located at i's pointer: &i. When I call scanf(), I say to read an integer from standard input, and fill in those four bytes with that integer. Scanf() returns the number of successful reads that it did. If our read is successful, the program prints i in decimal and in hexadecimal. You didn't forget hexadecimal, did you?

UNIX> scanf1
10
Just read i: 10 (0xa)
UNIX> scanf1
Fred
Scanf() failed for some reason.
UNIX> scanf1
15.999999999999
Just read i: 15 (0xf)
UNIX> scanf1
-15.99999999999999
Just read i: -15 (0xfffffff1)
UNIX> scanf1
<CNTL-D>
Scanf() failed for some reason.
UNIX> echo "" | scanf1
Scanf() failed for some reason.
UNIX> 
UNIX> echo 15fred | scanf1
Just read i: 15 (0xf)
UNIX> 
Let's go over these examples.

The program scanf2.c is buggy.

main()
{
  int *i;
  
  printf("i = 0x%lx\n", (unsigned long) i);
  if (scanf("%d", i) == 1) {
    printf("Just read i: %d (0x%x)\n", *i, *i);
  } else {
    printf("Scanf() failed for some reason.\n");
  }
  exit(0);
}

It will compile (although some nosy compilers will figure out it's buggy and yell at you). However, it usually won't run without a problem. Here it is on my Mac:

UNIX> echo 10 | scanf2
i = 0x7fff5fc01052
Bus error
UNIX> 
What happened? The answer is that i is an uninitialized variable. It randomly started with a value of 0x7fff5fc01052. When scanf() tried to stuff 10 into those four bytes, a hardware error was generated -- that's the bus error. If you're lucky, when your program has uninitialized variables, they lead to segmentation violations and bus errors. If you're unlucky, they won't, and you don't discover your bug until (potentially much) later.

Why did I need to put the (unsigned long) typecast statement? Because the compiler knows that when I say "%lx" in the formatting statement of printf(), I am expecting an unsigned long. The typecast statement says "Yes, I know you want an unsigned long -- just do this anyway, please."


Strings and scanf

A string in C is an array of char's. Recall, a char is a one-byte integer, which means that it has values between -128 and 127. Each of those values matches to a printable character, with zero equalling the "null" character. A string is an array of char's that ends with the null character. The following program (scanf3.c) uses scanf() to read a string from standard input, and then to print the individual characters:

#include <stdio.h>
#include <stdlib.h>

main()
{
  char s[10];
  int i;
  
  if (scanf("%s", s) != 1) exit(0);

  for (i = 0; s[i] != '\0'; i++) {
    printf("Character: %d: %3d %c\n", i, s[i], s[i]);
  }
}

Since an array variable like s is equivalent to a pointer to the first element, we do not have to pass &s to scanf() -- we simply pass s.

This program allows us to see the ASCII character codes for the characters in the string "Jim-Plank":

UNIX> echo "Jim-Plank" | scanf3
Character: 0:  74 J
Character: 1: 105 i
Character: 2: 109 m
Character: 3:  45 -
Character: 4:  80 P
Character: 5: 108 l
Character: 6:  97 a
Character: 7: 110 n
Character: 8: 107 k
UNIX> 
scanf() with strings is problematic. Consider the following program, in scanf4.c:

#include <stdio.h>
#include <stdlib.h>

main()
{
  char s1[10];
  char s2[10];
  int i;
  
  printf("s1: 0x%lx\n", (unsigned long) s1);
  printf("s2: 0x%lx\n", (unsigned long) s2);

  printf("\nEnter s1 and s2:\n\n");
  
  if (scanf("%s", s1) != 1) exit(0);
  if (scanf("%s", s2) != 1) exit(0);

  printf("\n");
  printf("s1: %s\n", s1);
  printf("s2: %s\n", s2);
}

We run it, and for the second scanf() call, we enter a string that is much bigger than ten characters:

UNIX> scanf4
s1: 0x7fff5fbfdc60
s2: 0x7fff5fbfdc50

Enter s1 and s2:

Jim
0123456789abcdefghijk

s1: ghijk
s2: 0123456789abcdefghijk
UNIX> 
Take a look at what has happened. s2's address is 16 less than s1's address. So, when the scanf() statement for s2 reads 21 bytes, the first 16 go into addresses 0x7fff5fbfdc50 through 0x7fff5fbfdc5f, and the remaining 5 go into 0x7fff5fbfdc60 to 0x7fff5fbfdc64. In other words, the remaining 5 go where s1 is pointing. That's why s1 is changed to "ghijk".

The following output is a little more confusing, so let's take a closer look:

UNIX> scanf4
s1: 0x7fff5fbfdc60
s2: 0x7fff5fbfdc50

Enter s1 and s2:

Jim
0123456789abcdef

s1: 
s2: 0123456789abcdef
UNIX> 
What happened to s1? Recall that C-style strings are arrays that end with the null character. So the second scanf() put 17 characters into s2. The last of these is the null character, which happens to also be the first character of s1. Hence, s1 becomes an empty string.

Finally, the output above was on my Macintosh (in 2015). When I ran it on my Linux box (in 2017), I got the following:

UNIX> scanf4
s1: 0x7ffcedf06110
s2: 0x7ffcedf06120

Enter s1 and s2:

Jim
0123456789abcdefghijk

s1: Jim
s2: 0123456789abcdefghijk
UNIX> scanf4
s1: 0x7ffd5d7ccee0
s2: 0x7ffd5d7ccef0

Enter s1 and s2:

0123456789abcdefghijk
Jim

s1: 0123456789abcdefJim
s2: Jim
UNIX> 
Fun facts -- the values of s1 and s2 have changed from run to run. Really? You can ask me in class, but it may be more appropriate for you to ask your professor in CS361.

Second, you'll note that s2 is 16 bytes bigger than s1, which is the opposite of what it was on my Mac. So, when I entered "Jim" and then "0123456789abcdefghijk", the second scanf() indeed overruns s2, but it doesn't affect s1, and you see no bug.

If you instead enter "0123456789abcdefghijk" and then "Jim", you'll see that s1 overruns s2, and then "Jim" overwrites the end of s1. Although the output of the program changes from machine to machine, the output is deterministic once you know what the values of s1 and s2 are. Put another way, on a test, I can tell you the first two lines of output, and what the input is. Then I can ask you what the rest of the output is, and you have enough information to tell me.


The bottom line with all of this is that scanf() with "%s" is dangerous, unless you know that your input is constrained to be safe. Serious problems can arise if you allow users to generate input that stamps on memory that is not allocated. This kind of program is open to what are known as "buffer overflow attacks," one of which famously shut down the Internet in 1988.

We'll continue with more C next lecture.