CS140 Lecture notes -- Strings

Directory: ~plank/cs140/notes/String

Lecture notes: http://www.cs.utk.edu/~plank/plank/classes/cs140/Notes/String

Last modification date: Mon Jan 30 14:38:26 EST 2012

Dealing with C style strings

At the end of this lecture, I give you some optional material about C and C++ strings. I am making that optional, though, because I think at this stage it is more confusing than good. Thus, you should simply use C++ style strings for everything, but be aware that you will have to deal with C-style strings in a few situations.

Some procedures and methods require them. You have already seen printf(). Opening files with fstreams also requires C-style strings rather than C++ (at least in pre-C++11 compilers--the C++11 standard allows C++ strings to be passed to fstreams)-- simply be aware of it, and use the c_str() method when you want to convert it to a C-style string for this reason.
Argv is composed of C-style strings.. When you encounter a C-style string, just copy it to a C++ string, and then you can use it happily.
sscanf(): You'll be exposed to the joys of sscanf() soon enough. Just keep it in the back of your mind.

Innocently consuming gobs of memory

Let's take a look at two really simple programs. The first, gendouble.cpp, adds up some number of random doubles and prints the result:

#include <iostream>
#include <sstream>
#include <cstdlib>
#include <cstdio>
using namespace std;

main(int argc, char **argv)
{
  double d;
  int n, i;
  istringstream ss;

  if (argc != 2) { cerr << "usage: gendouble iterations\n"; exit(1); }
  ss.str(argv[1]);
  if (!(ss >> n)) { cerr << "usage: gendouble iterations\n"; exit(1); }

  d = 0;
  for (i = 0; i < n; i++) {
    d += drand48();
  }

  cout << d << endl;
}

When we run it, we expect the final sum to be roughly n/2, and it runs pretty quickly on my MacbookPro. In fact, running it for roughly 1G iterations (10⁹) takes 30 seconds (time prints out timing information from the operating system -- the third word is the wall-clock time):

UNIX> time gendouble 1000
497.784
0.000u 0.001s 0:00.00 0.0%	0+0k 0+0io 0pf+0w
UNIX> time gendouble 10000
4983.82
0.001u 0.001s 0:00.00 0.0%	0+0k 0+0io 0pf+0w
UNIX> time gendouble 100000
49964.4
0.006u 0.001s 0:00.00 0.0%	0+0k 0+0io 0pf+0w
UNIX> time gendouble 1000000
500184
0.047u 0.001s 0:00.05 80.0%	0+0k 0+0io 0pf+0w
UNIX> time gendouble 10000000
5.00124e+06
0.382u 0.002s 0:00.38 100.0%	0+0k 0+0io 0pf+0w
UNIX> time gendouble 100000000
5.00023e+07
3.794u 0.012s 0:03.82 99.4%	0+0k 0+0io 0pf+0w
UNIX> time gendouble 1000000000
4.99991e+08
37.854u 0.121s 0:38.13 99.5%	0+0k 0+0io 0pf+0w
UNIX>

Now, let's change the program slightly to append random doubles to a string genstring.cpp

#include <iostream>
#include <sstream>
#include <cstdlib>
#include <cstdio>
using namespace std;

main(int argc, char **argv)
{
  string s;
  int n, i;
  istringstream ss;
  ostringstream so;

  if (argc != 2) { cerr << "usage: gendouble iterations\n"; exit(1); }
  ss.str(argv[1]);
  if (!(ss >> n)) { cerr << "usage: gendouble iterations\n"; exit(1); }

  s = "";
  for (i = 0; i < n; i++) {
    so.clear();
    so.str("");
    so << drand48() << endl;
    s += so.str();
  }
  if (n <= 10) cout << s;
}

When we run it with an argument of 10, we get ten random doubles:

UNIX> genstring 10
0.396465
0.840485
0.353336
0.446583
0.318693
0.886428
0.0155828
0.58409
0.159369
0.383716
UNIX>

And when we try to time it, we get much slower running times than with gendouble:

UNIX> time genstring 1000
0.002u 0.001s 0:00.00 0.0%	0+0k 0+0io 0pf+0w
UNIX> time genstring 10000
0.019u 0.001s 0:00.02 50.0%	0+0k 0+0io 0pf+0w
UNIX> time genstring 100000
0.150u 0.004s 0:00.15 100.0%	0+0k 0+0io 0pf+0w
UNIX> time genstring 1000000
1.417u 0.038s 0:01.46 98.6%	0+0k 0+0io 0pf+0w
UNIX> time genstring 10000000
14.000u 0.314s 0:14.37 99.5%	0+0k 0+0io 0pf+0w
UNIX> time genstring 100000000
136.256u 2.652s 2:25.52 95.4%	0+0k 0+0io 6pf+0w
UNIX>

That last output line -- where it says "6pf+0w" -- means that we're starting to have problems finding memory for the program. When we run it with twice that value, we actually run out of memory!

UNIX> time genstring 200000000
genstring(21164) malloc: *** mmap(size=1073745920) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
terminate called after throwing an instance of 'std::bad_alloc'
  what():  St9bad_alloc
Abort
162.765u 2.998s 2:50.55 97.1%	0+0k 0+0io 10pf+0w
UNIX>

Think about it -- each random number consumes about 8 digits, so including the newline, each double consumes 9 bytes. With 100,000,000 doubles, that is nearly a gigabyte of memory. My Macbook has 1.5 GB, so it's not surprising that I would run out of memory when n equals 200,000,000.

The reason I go over this program is that with C++, it's really easy to write programs that behave pathelogically concerning memory, and memory-burning programs are much harder on a computer than CPU-burning programs. Thus, I want you to start thinking about memory when you write your programs.

Two useful string methods -- find and substr

Find() is a method that allows you to find characters or substrings within strings. Read the reference page from www.cppreference.com. It defines four find functions, and these can have multiple sets of parameters. The program string-find.cpp illustrates all of those parameter combinations:

#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
using namespace std;

main()
{
  string a, b;
  int i;

  a = "Lighting Strikes.  Lightning Strikes Again.";
  b = "Light";

  printf("    ");
  for (i = 0; i < 43; i++) printf("%d", i%10);
  printf("\n");

  printf("a = %s\n", a.c_str());
  printf("b = %s\n", b.c_str());
  printf("a.find(b) = %ld\n", a.find(b));
  printf("a.find(b, 1) = %ld\n", a.find(b, 1));
  printf("a.find(b, 20) = %ld\n", a.find(b, 20));
  printf("a.find('g') = %ld\n", a.find('g'));
  printf("a.find('g', 20) = %ld\n", a.find('g', 20));
  printf("a.find(\"Strike\") = %ld\n", a.find("Strike"));
  printf("a.find(\"Strike\", 20) = %ld\n", a.find("Strike", 20));
  printf("a.find(\"Aging\", 0, 2) = %ld\n", a.find("Aging", 0, 2));
  printf("string::npos = %ld\n", string::npos);
}

The first three find() calls illustrate finding a C++ string within a string. It returns the index of the first occurrence of the substring. If you call find() with a second argument, it says to start looking after that index. The first occurrence of "Light" after character 1 is at character 19. If find() fails, it returns string::npos, which is in reality -1. However, you should use string::npos rather than -1 to make your programs more portable.

The next two find()'s show finding a character, and the next two show finding a C style substring. The last one shows that if you give it a C style substring, a starting index and a third argument -- the length -- it will only look for length characters of the substring. Thus, even though "Aging" doesn't appear in the string, we're only looking for the first two characters -- "Ag" -- which occur at index 37.

UNIX> string-find
    0123456789012345678901234567890123456789012
a = Lighting Strikes.  Lightning Strikes Again.
b = Light
a.find(b) = 0
a.find(b, 1) = 19
a.find(b, 20) = -1
a.find('g') = 2
a.find('g', 20) = 21
a.find("Strike") = 9
a.find("Strike", 20) = 29
a.find("Aging", 0, 2) = 37
string::npos = -1
UNIX>

The feature of C++ that lets you define multiple instances of a procedure or method that work on multiple types of arguments is called polymorphism. If you give a combination of arguments that is not supported, then you will get a compilation error. For example, in bad-find.cpp we make a seemingly innocuous call of "a.find(b, 1, 3)":

#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
using namespace std;

main()
{
  string a, b;
  int i;

  a = "Lighting Strikes.  Lightning Strikes Again.";
  b = "Light";

  printf("    ");
  for (i = 0; i < 43; i++) printf("%d", i%10);
  printf("\n");

  printf("a = %s\n", a.c_str());
  printf("a.find(b, 1, 3) = %d\n", a.find(b, 1, 3));
}

This doesn't compile, because there is no definition of find(string, int, int). There are the following definitions:

find(string)
find(string, int)
find(char *)
find(char *, int)
find(char *, int, int)

None of them match, so you get the following, rather cryptic and voluminous error message:

UNIX> g++ -o bad-find bad-find.cpp
bad-find.cpp: In function 'int main()':
bad-find.cpp:20: error: no matching function for call to 'std::basic_string, std::allocator >::find(std::string&, int, int)'
/usr/include/c++/4.4/bits/basic_string.tcc:714: note: candidates are: typename std::basic_string<_CharT, _Traits, _Alloc>::size_type std::basic_string<_CharT, _Traits, _Alloc>::find(const _CharT*, typename _Alloc::rebind<_CharT>::other::size_type, typename _Alloc::rebind<_CharT>::other::size_type) const [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]
/usr/include/c++/4.4/bits/basic_string.h:1660: note:                 typename _Alloc::rebind<_CharT>::other::size_type std::basic_string<_CharT, _Traits, _Alloc>::find(const std::basic_string<_CharT, _Traits, _Alloc>&, typename _Alloc::rebind<_CharT>::other::size_type) const [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]
/usr/include/c++/4.4/bits/basic_string.h:1674: note:                 typename _Alloc::rebind<_CharT>::other::size_type std::basic_string<_CharT, _Traits, _Alloc>::find(const _CharT*, typename _Alloc::rebind<_CharT>::other::size_type) const [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]
/usr/include/c++/4.4/bits/basic_string.tcc:737: note:                 typename std::basic_string<_CharT, _Traits, _Alloc>::size_type std::basic_string<_CharT, _Traits, _Alloc>::find(_CharT, typename _Alloc::rebind<_CharT>::other::size_type) const [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]
UNIX>

There are bunch of other types of find(). Read the reference from www.cppreference.com to see how they all work.

Substr() is a method that takes a starting index and an optional count, and returns a substring of a string. The simple example program is string-sub.cpp

#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
using namespace std;

main()
{
  string a;
  int i;

  a = "Lighting Strikes.  Lightning Strikes Again.";

  printf("    ");
  for (i = 0; i < 43; i++) printf("%d", i%10);
  printf("\n");

  printf("a = %s\n", a.c_str());
  printf("a.substr(19) = %s\n", a.substr(19).c_str());
  printf("a.substr(19, 13) = %s\n", a.substr(19, 13).c_str());
  printf("a.substr(19, 13).substr(5) = %s\n", a.substr(19, 13).substr(5).c_str());
}

When only one argument is given, it returns a substring from the given index to the end of the string. If two arguments are given, it returns the specified number of characters. Since the substring is a string, you can call its methods, such as c_str() and substr().

UNIX> string-sub
    0123456789012345678901234567890123456789012
a = Lighting Strikes.  Lightning Strikes Again.
a.substr(19) = Lightning Strikes Again.
a.substr(19, 13) = Lightning Str
a.substr(19, 13).substr(5) = ning Str
UNIX>

C style strings, C++ style strings, the const keyword and memory

These notes are optional.

One inconvenient fact of life is that we have to acknowledge and utilize a second representation of strings: their representation in C. This is for many reasons:

String literals are in fact C-style strings.
If you use printf(), you have to use C-style strings.
Argv is an array of C-style strings.
They are required in some C++ functions, like open in fstream.

A C-style string is an array of characters that ends with the NULL character ('\0' -- whose value is actually zero). Although there are libraries that let you manipulate C-style strings, they are far more cumbersome than C++ strings because you have to explicitly manage memory yourself. We'll see a lot more of this in CS360. For now, take a look at the program argv-mess.cpp:

#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
using namespace std;

main(int argc, char **argv)
{
  string a, b;
  char *ca, *ca2, *ca4; 
  const char *ca3;

  if (argc != 2) { cerr << "usage: argv-mess arg1\n"; exit(1); }

  a = argv[1];
  ca = argv[1];
  ca2 = ca;
  b = a;
  ca3 = a.c_str();

  printf("%-30s %7s %7s %7s %7s %7s %7s\n", "", "a", "b", "ca", "ca2", "ca3", "argv[1]");
  printf("%-30s %7s %7s %7s %7s %7s %7s\n", "", "-------", "-------", "-------", "--------", "-------", "-------");

  printf("%-30s %7s %7s %7s %7s %7s %7s\n", "Start:", 
         a.c_str(), b.c_str(), ca, ca2, ca3, argv[1]);
 
  a[0] = 'Y';
  printf("%-30s %7s %7s %7s %7s %7s %7s\n", "After setting a[0] to 'Y':", 
         a.c_str(), b.c_str(), ca, ca2, ca3, argv[1]);
  
  ca[0] = 'L';
  printf("%-30s %7s %7s %7s %7s %7s %7s\n", "After setting ca[0] to 'L':", 
         a.c_str(), b.c_str(), ca, ca2, ca3, argv[1]);
  
  a = "XX";
  printf("%-30s %7s %7s %7s %7s %7s %7s\n", "After setting a to \"XX\":",
         a.c_str(), b.c_str(), ca, ca2, ca3, argv[1]);
}

This program has two C++ strings (a and b), two (char *)'s (ca and ca2), and a const char * (ca3). It first sets a to equal argv[1]. This converts a C style string (argv[1]) to a C++ style string, which makes a copy. Second, it sets ca to equal argv[1]. This is different -- because ca is a pointer, it doesn't make a copy -- ca and argv[1] simply point to the same character array.

We next set ca2 to equal ca. Once again, that simply sets one pointer to another. It doesn't make a copy of the array's contents. The next statement, which sets b to equal a does make a copy -- when you set one string equal to another, the string library makes a copy.

Finally, we set ca3 to equal a.c_str() -- the C++ string library maintains strings as C-style strings with extra information. When you asking for c_str(), you get a pointer to the underlying C-style string. However, the compiler makes you declare the pointer as a const, which means that you cannot modify the string. That is for safety -- you can look at the string, but you can't mess with it.

We print everything out, and then we change a[0] to 'Y'. We print everything out again, and then we change ca[0] to 'L'. We print everything out again, and then we set a to "XX". We finish by printing everything out again:

UNIX> argv-mess Ho
                                     a       b      ca     ca2     ca3 argv[1]
                               ------- ------- ------- -------- ------- -------
Start:                              Ho      Ho      Ho      Ho      Ho      Ho
After setting a[0] to 'Y':          Yo      Ho      Ho      Ho      Ho      Ho
After setting ca[0] to 'L':         Yo      Ho      Lo      Lo      Ho      Lo
After setting a to "XX":            XX      Ho      Lo      Lo      Ho      Lo
UNIX>

It's important for you to understand this output. In the beginning, all strings are "Ho", but there are in actuality three copies of the string:

a and ca3 point to one copy.
b points to another copy.
ca, ca2 and argv[1] all point to the third copy.

When we set a[0] to 'Y', you can see that only a is affected. It should be clear that b, ca, ca2 and argv[1] are not affected. It's a little surprising the ca3 is unaffected. What happened is that the string library changed the underlying string. Drag -- what you should learn from this is that you can't keep pointers around to the underlying strings -- if you modify a C++ string, any pointers that you had from c_str() may no longer be valid.

Now, when we set ca[0] to 'L', you see that ca, ca2 and argv[1] are all changed. That's because they all point to the same character array, and we just changed the first character in that array.

Finally, when we set a to "XX", again only a is changed. Once again -- ca3's contents cannot be relied upon.

Hammering home the point: C-style strings are simply arrays of characters. A C-style string will be a (char *), which points to the first element of the array. Making copies of C-style strings does not make actual copies -- you are simply assigning a pointer.

C++ style strings, on the other hand, are heavyweight objects that maintain extra information like the size of the string. When you copy a C++ string, you make a copy of the contents. That's usually what you want.