Sprintf, Sscanf, C strings


Topcoder Practice Problems


Introduction - C Style Strings

Sprintf() and sscanf() are string conversion procedures from the C Stdio Library. The functionality that you get with sprintf() and sscanf() is handled in C++ with stringstreams. However, sprintf() and sscanf() are far easier to use, so you should learn them.

Both of them work with "C style" strings. These are arrays of bytes, (of type char). The convention with C-style strings is that the array contains printable characters, terminated with the null character (which you specify as '\0' -- Its actual value is zero). Note, I say that this is a "convention." That's because it is up to anyone using and manipulating C-style strings to make sure that the array of bytes is in the proper format -- it is not automatically handled for you like strings are in C++.

Slight digression: I assume that you've had this information before, but a little review never hurts -- printable characters in C and C++ are simply bytes, which are integers between -128 and 127. The data type is char. You can print them as integers by using printf("%d", ...). When you print them as characters, you use printf("%c", ...). There is a mapping of integers to characters called "ASCII." You don't need to care what this mapping is, except you should know that:
  • Characters '0' through '9' are contiguous integers, and are smaller than 'A'-'Z' and 'a'-'z'.
  • Characters 'A' through 'Z' are contiguous integers, and are smaller than 'a'-'z'.
  • Characters 'a' through 'z' are contiguous integers.
When you print a string in C or C++, you are printing an array of bytes using the ASCII mapping. With printf("%s", s) in particular, you are passing a pointer to bytes, and printf() prints the characters associated with each byte in succession, until it reached the null character, specified as '\0\, whose integer value is zero.

You access this buffer with a pointer to its first character, of the type "char *." Or you use an array of characters (an array, not a C++ vector) that you allocate yourself, such as, for example "char buf[10]," which is an array of 10 characters. Let's look at an example, in c_str.cpp:

/* Line  1 */    #include <string>
/* Line  2 */    #include <cstdio>
/* Line  3 */    #include <cstdlib>
/* Line  4 */    #include <iostream>
/* Line  5 */    using namespace std;
/* Line  6 */    
/* Line  7 */    int main()
/* Line  8 */    {
/* Line  9 */      char buf[10];
/* Line 10 */      char *str;
/* Line 11 */      int i;
/* Line 12 */      string cpps;
/* Line 13 */    
/* Line 14 */      str = buf;
/* Line 15 */    
/* Line 16 */      for (i = 0; i < 6; i++) buf[i] = 'A'+i;
/* Line 17 */      buf[i] = '\0';
/* Line 18 */    
/* Line 19 */      printf("When I print buf with percent s, I get: %s\n", buf);
/* Line 20 */      printf("When I print str with percent s, I get: %s\n", str);
/* Line 21 */    
/* Line 22 */      cpps = buf;
/* Line 23 */      str[0] = 'X';
/* Line 24 */      cpps[1] = 'Y';
/* Line 25 */    
/* Line 26 */      cout << "This is cpps: " << cpps << endl;
/* Line 27 */      cout << "This is str:  " << str << endl;
/* Line 28 */      cout << "This is buf:  " << buf << endl;
/* Line 29 */      return 0;
/* Line 30 */    }

On lines 9 and 10 of this program, I declare an array of eight characters called buf, and a character pointer called str. The very first action that I perform (on line 14) is have str point to the first byte of buf. This looks as follows:

In other words, I have allocated eight bytes, and I may access them in two ways -- via the variable buf and via the variable str. Next, (lines 16 - 20), I set the first five characters to 'A', 'B', 'C', 'D', 'E' and 'F'. I set the next character to the null character, and I print out both buf and str. At this point, they are:

I can see this when I run the program by taking a look at the first two lines of output:

UNIX> g++ -std=c++98 c_str.cpp
UNIX> a.out | head -n 2
When I print buf with percent s, I get: ABCDEF
When I print str with percent s, I get: ABCDEF
UNIX> 
Next, (line 22), I assign the c++ string cpps to equal buf. This creates a new C++ string which is a heavier-weight data structure, because it contains more information than simply an array of bytes. It copies the bytes of buf into its own data structure. Here is a picture after line 22:

Finally, (lines 23) I change the first byte of str to 'X'. Because str is pointing to the same bytes as buf, that changes the first character of buf. I also change the second byte of cpps to 'Y'. Although this looks like the previous statement, it is a bit different, because the C++ string class overloads the bracket operators so that it finds the underlying bytes of the string, and changes the second one. Here's what they look like afterward:

This explains the last three lines of output, where I print out cpps, str and buf.

UNIX> a.out | tail -n 3
This is cpps: AYCDEF
This is str:  XBCDEF
This is buf:  XBCDEF
UNIX> 
Make sure you understand every line of this program, and in particular, why it is that str and buf utilize the same string buffer, and cpps utilizes a different string.

C++ strings

Strings in C++ are very nicely handled -- you may manipulate them, allocate them, read them and write them quite seamlessly. You may view a C++ string as a class that has, at its core, a C style string. It's something like the following (this is not exactly right, but for the purposes of this explanation, it's good enough):
class string {
  public:
    unsigned long long size();
    char *c_str();
    ...
  private:
    char *underyling_string;
    unsigned long Size;
    ...
};
When you create a string, for example:
  int i;
  string s;

  for (i = 'A'; i < 'F'; i++) s.push_back(i);
Then the string structure does some work to allocate memory for its string buffer, and when the loop is done, you'll have Size equaling 5, and underlying_string will point to a buffer of at least six bytes, the first 5 of which are 'A', 'B', 'C', 'D' and 'E', and the last of which is '\0'. We can view it as follows:

You'll note, in the drawing above, the amount of memory that underlying_string is pointing to is greater than the six characters required to store 'A' through 'E' and the null character. That can happen, and it will be different from machine to machine. However, let's go with this example. Let's suppose you do two more "push_back" commands:

  s.push_back('F');
  s.push_back('G');
Our string now looks as follows:

Now, suppose you do one more "push_back":

  s.push_back('H');
The pointer to the buffer that holds the string has run out of memory. So, what the string class does is allocate a new buffer, and copy the string there. The state of our string will look something like this:

What I'm trying to convey here is that the old buffer is discarded and a new one is used. The old buffer will be released to the memory management system to be reused, and the string uses the new buffer until it fills up.


The c_str() method of a C++ string

The c_str() method of a C++ string returns a const char *. This is a pointer to the first byte of the C-style string that is held in the class of the C++ string. The keyword "const" means that you should not try to modify this string, and if you do, the compiler will exit with an error. We can get around this, by the way, which we'll do later. As I intimate above, C++ strings will keep filling in their underlying C-style string until they run out of room, at which point they allocate a new string. We can prove that with the following program, in buffer_changes.cpp:

#include <string>
#include <cstdio>
#include <cstdlib>
#include <iostream>
using namespace std;

int main()
{
  string s;
  const char *cs;
  int i;

  cs = s.c_str();

  for (i = 1; i <= 10000; i++) {
    s.push_back('A');
    if (s.c_str() != cs) {
      printf("The underlying buffer changed at size: %d\n", i);
      cs = s.c_str();
    }
  }
  return 0;
}

This keeps adding characters to a C++ string, s, and it notes when storage for the underlying C-style string changes. Check it out as it runs (on my mac):

UNIX> g++ -std=c++98 buffer_changes.cpp 
UNIX> a.out
The underlying buffer changed at size: 1
The underlying buffer changed at size: 2
The underlying buffer changed at size: 3
The underlying buffer changed at size: 5
The underlying buffer changed at size: 9
The underlying buffer changed at size: 17
The underlying buffer changed at size: 33
The underlying buffer changed at size: 65
The underlying buffer changed at size: 129
The underlying buffer changed at size: 257
The underlying buffer changed at size: 513
The underlying buffer changed at size: 1025
The underlying buffer changed at size: 2049
The underlying buffer changed at size: 8136
UNIX> 
I think we can all figure out that the underlying buffers are allocated to be powers of two (plus one - you need room for the null character).

One thing that you should get out of this program is that you should not store c_str() pointers if you change the C++ string, because the underlying buffer can change.


Don't mess with the bytes that c_str() returns.

The const keyword typically keeps you out of danger, but you can get around it. The following program should show you why you shouldn't do that. Here, I "typecast" the return value of c_str() to a char * that doesn't have the "const" keyword. Then I modify the C-style string so that it puts the null character after the first character. Then I print out the C++ string's size, plus I print out the string using both printf() and cout. The program is in bad_c_str.cpp:

#include <string>
#include <cstdio>
#include <cstdlib>
#include <iostream>
using namespace std;

int main()
{
  string s;
  char *cs;

  s = "ABCDE";
  cs = (char *) s.c_str();
  cs[1] = '\0';
  
  cout << s.size() << endl;
  cout << s << endl;
  printf("%s\n", s.c_str());

  s.push_back('F');

  cout << s.size() << endl;
  cout << s << endl;
  printf("%s\n", s.c_str());
  
  return 0;
}

When I run it (again on my mac), you see some pretty odd behavior:

UNIX> g++ -std=c++98 bad_c_str.cpp
UNIX> a.out
5
ACDE
A
6
ACDEF
A
UNIX> 
You'll note that putting the null character into s turns the C style string into a one-character string, but the C++ string retains its size, and when you print it out with cout, it basically skips over the null character and keeps printing. When you print out the c_str() with printf(), it stops at the null character.

In other words, this is a program itching with bugs. Don't do what I've done here; however, it's good to see what's happening.


sprintf()

Sprintf() does what printf() does, only it takes as its first argument a pointer to a buffer of bytes, and instead of printing to the screen, it puts its output into that buffer. Here's a very simple example of putting 5 numbers into a string (sprintf1.cpp):

#include <string>
#include <cstdio>
#include <cstdlib>
#include <iostream>
using namespace std;

int main()
{
  char buf[8];
  string s;
  int i;

  cin >> i;

  sprintf(buf, "%d %d %d %d %d", i, i+1, i+2, i+3, i+4);
  s = buf;

  cout << s << endl;

  return 0;
}

When we run it, we see that the string s is set to "1 2 3 4 5":

UNIX> g++ -std=c++98 sprintf1.cpp
UNIX> echo 1 | a.out
1 2 3 4 5
UNIX> 
You want to make sure that you allocate a buffer that is big enough. If you don't, the sprintf() call will overrun memory, and when you do that, odd things may happen. Here's an example, in sprintf2.cpp:

#include <string>
#include <cstdio>
#include <cstdlib>
#include <iostream>
using namespace std;

int main()
{
  char buf1[8];
  char buf2[8];
  char buf3[8];
  int i;

  buf1[0] = '\0';
  buf2[0] = '\0';
  buf3[0] = '\0';

  cin >> i;

  printf("Before:\n");
  printf("buf1: %s\n", buf1);
  printf("buf2: %s\n", buf2);
  printf("buf3: %s\n", buf3);

  sprintf(buf2, "%d %d %d %d %d", i, i+1, i+2, i+3, i+4);

  printf("After:\n");
  printf("buf1: %s\n", buf1);
  printf("buf2: %s\n", buf2);
  printf("buf3: %s\n", buf3);

  return 0;
}

If you're lucky, you get a seg-fault. Here, you're not so lucky (again, this is on my mac -- results of this program will differ from machine to machine). The weird behavior is buf1. Look at what it is before and after the sprintf() statement:

UNIX> g++ -std=c++98 sprintf2.cpp 
UNIX> echo 1000 | a.out
Before:
buf1: 
buf2: 
buf3: 
After:
buf1: 003 1004
buf2: 1000 1001 1002 1003 1004
buf3: 
Abort
UNIX> 
We will explore this phenomenon in great detail in CS360. For now, just remember to make sure that your sprintf() buffers are big enough to hold the final strings.

sscanf()

Sscanf() does the opposite of sprintf(). It takes a C-style string as its first argument, and then a format string like sprintf(), and then it attempts to "read" from the first string, converting what it has read into the proper data structures. The variables into which it "reads" must be specified as pointers. Let's look at an example (in sscanf1.cpp):

#include <string>
#include <cstdio>
#include <cstdlib>
#include <iostream>
using namespace std;

int main()
{
  string s;
  int i;

  s = "100";
  sscanf(s.c_str(), "%d", &i);
  printf("i = %d\n", i);

  return 0;
}

This "reads" the string s, and converts it to an integer i, which it then prints:

UNIX> g++ -std=c++98 sscanf1.cpp
UNIX> a.out
i = 100
UNIX> 
You can specify multiple inputs to read, and sscanf() will return the number of items that it read successfully. The program below (sscanf2.cpp) reads a line of text, and then tries to interpret that line as a double, followed by a space, and an int:

#include <string>
#include <cstdio>
#include <cstdlib>
#include <iostream>
using namespace std;

int main()
{
  string s;
  int i, n;
  double d;
  
  getline(cin, s);
  n = sscanf(s.c_str(), "%lf %d", &d, &i);
  printf("n = %d.  d = %lf.  i = %d\n\n", n, d, i);
  return 0;
}

Here it is running on a variety of inputs. Make sure you understand all of these outputs:

UNIX> g++ -std=c++98 sscanf2.cpp
UNIX> echo 10.5 5 | a.out
n = 2.  d = 10.500000.  i = 5

UNIX> echo 10.5 Fred | a.out
n = 1.  d = 10.500000.  i = 0

UNIX> echo Fred 5 | a.out
n = 0.  d = 0.000000.  i = 0

UNIX> echo 10.5xyz 55
n = 1.  d = 10.500000.  i = 0

UNIX> echo go vols | a.out
n = 0.  d = 0.000000.  i = 0

UNIX> 
Your input fields don't have to be separated by spaces. The following program reads lines of text, which are in the format "h:m:s". (in sscanf3.cpp):

#include <string>
#include <cstdio>
#include <cstdlib>
#include <iostream>
using namespace std;

int main()
{
  string l;
  int h, m, s, n;
  double d;
  
  getline(cin, l);
  n = sscanf(l.c_str(), "%d:%d:%d", &h, &m, &s);
  printf("n = %d.  h = %d.  m = %d.  s = %d.\n", n, h, m, s);
  return 0;
}

That may well be handy for lab 1.....

UNIX> g++ -std=c++98 sscanf3.cpp
UNIX> echo '55:33:22' | a.out
n = 3.  h = 55.  m = 33.  s = 22.
UNIX>