CS202 Lecture notes -- The Basics of Strings and Vectors (plus getline)

  • James S. Plank
  • Directory: ~jplank/cs202/notes/SV-Basics
  • Lecture notes: http://web.eecs.utk.edu/~jplank/plank/classes/cs202/Notes/SV-Basics
  • Notes created in 2011
  • Last modification date: Tue Aug 24 16:59:34 EDT 2021

    String Basics

    Strings are a fundamental class supported by C++ to do text processing.

    They are very natural, so you can often write programs with them without thinking about it too much. For example, the following program (src/string-basic.cpp) illustrates many functionalities with strings:

    I'm assuming that all of this is review from CS102. src/string-basic.cpp

    /* This program illustrates some basical functionalities with strings. */
    
    #include <iostream>
    #include <cstdio>
    using namespace std;
    
    int main()
    {
      string a, b, c;
      size_t i;
    
      /* String assignment from literals. */
    
      a = "LIGHTNING";
      b = "Lightning";
      c = "Strikes";
    
      /* Printing out strings and their sizes. */
    
      cout << "a: " << a << " -- " << a.size() << " characters." << endl;
      cout << "b: " << b << " -- " << b.size() << " characters." << endl;
      cout << "c: " << c << " -- " << c.size() << " characters." << endl;
      cout << endl;
    
      /* Modifying a string. */
    
      printf("Changing all but the first character of a to lower case:\n\n");
    
      for (i = 1; i < a.size(); i++) a[i] += ('a' - 'A');
      cout << "Changed a to: " << a << endl << endl;
    
      /* Testing equality and comparison. */
    
      printf("Testing equality: (a == b): %d.  (a == c): %d.  (b == c): %d\n", (a == b), (a == c), (b == c));
    
      printf("Comparison:       (a >= b): %d.  (a >= c): %d.  (b >= c): %d\n", (a >= b), (a >= c), (b >= c));
      printf("Comparison:       (a <= b): %d.  (a <= c): %d.  (b <= c): %d\n", (a <= b), (a <= c), (b <= c));
      printf("Comparison:       (a <  b): %d.  (a <  c): %d.  (b <  c): %d\n", (a <  b), (a <  c), (b <  c));
      printf("Comparison:       (a >  b): %d.  (a >  c): %d.  (b >  c): %d\n", (a >  b), (a >  c), (b >  c));
      cout << endl;
    
      /* Showing how addition is overloaded to do string concatenation. */
    
      a = b + c;
      cout << "a = b + c: a is now: " << a << endl;
    
      return 0;
    }
    

    Nothing surprising happens when we run this:

    UNIX> bin/string-basic
    a: LIGHTNING -- 9 characters.
    b: Lightning -- 9 characters.
    c: Strikes -- 7 characters.
    
    Changing all but the first character of a to lower case:
    
    Changed a to: Lightning
    
    Testing equality: (a == b): 1.  (a == c): 0.  (b == c): 0
    Comparison:       (a >= b): 1.  (a >= c): 0.  (b >= c): 0
    Comparison:       (a <= b): 1.  (a <= c): 1.  (b <= c): 1
    Comparison:       (a <  b): 0.  (a <  c): 1.  (b <  c): 1
    Comparison:       (a >  b): 0.  (a >  c): 0.  (b >  c): 0
    
    a = b + c: a is now: LightningStrikes
    UNIX> 
    
    Those last three functionalities (comparison and concatenation) exploit a feature of C++ that you may learn more about someday. It's called "operator overloading," which allows you to redefine basic operators in the language to work on user-defined objects. In this case, '==', '<=', '>=', '<', '>' and '+' have all been defined to work on strings. It's pretty convenient in this case, but you should know that I'm not a fan of operator overloading in general. Put that in the back of your mind.

    Getline()

    The procedure getline(cin, s) reads a line of input from standard input and puts it into the string s. Spaces are preserved. Later, we will use getline() and stringstreams to process input. However, here, we simply read lines of standard input with getline() (this in src/getline.cpp):

    /* This program uses getline() to read lines of text, and print their line numbers and size. */
    
    #include <iostream>
    #include <cstdio>
    using namespace std;
    
    int main()
    {
      string s;
      int ln, len;
    
      ln = 0;
      while (getline(cin, s)) {
        ln++;
        len = s.size();
        printf("Line %2d - Size: %3d - %s\n", ln, len, s.c_str());
      }
      return 0;
    }
    

    Running it, we see that it works as promised:

    UNIX> head -n 5 data/input.txt                        # head -n x prints the first x lines of a file.
    Give me a weapon of power, which no one else may hold,
    Defend the Gods with honor, To lead the BRAVE and BOLD
    
        LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
        LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
    UNIX> head -n 5 data/input.txt | bin/getline
    Line  1 - Size:  54 - Give me a weapon of power, which no one else may hold,
    Line  2 - Size:  54 - Defend the Gods with honor, To lead the BRAVE and BOLD
    Line  3 - Size:   0 - 
    Line  4 - Size:  45 -     LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
    Line  5 - Size:  45 -     LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
    UNIX> 
    

    Vector Basics

    Vectors are the simplest part of the Standard Template Library (STL). There's a lot of linguistic hocus pocus that has to go on for the STL to work as it does. We won't concern ourselves with that in this class, but we will teach you how to make use of the library effectively. Vectors are a general-purpose array class. To use a vector, you include <vector>, and then you declare a vector as follows:

    vector <TYPE> variable-name(size)  // The size is optional
    

    The TYPE can be a basic type like int or double, or it can be more complex, like a C++ class or even another vector. You can declare a vector as starting with a certain number of elements, or you can declare it to be empty. In either case, you can dynamically modify the vector's size with the resize() method. You can use the size() method to get the vector's current size.

    A very simple example program is in src/vec1.cpp:

    /* A simple program to show the basics of vectors:
         - Declaring them.
         - Checking their size
         - Resizing
         - Setting values;
     */
    
    #include <cstdio>
    #include <vector>
    #include <iostream>
    using namespace std;
    
    int main()
    {
      vector <int> v1;
      vector <double> v2(10);
      size_t i;
    
      /* Print out v1's size and v2's size.  It is unfortunate that size() returns
         an "size_t", which is an unsigned long, instead of an int, so you must use
         "%lu" or "%ld" instead of %d inside printf().  We'll talk about it in class.  */
    
      printf("V1's size: %lu.  V2's size: %lu\n", v1.size(), v2.size());
    
      /* Resize the vectors and print out the new sizes. */
    
      v1.resize(5);
      v2.resize(8);
      printf("V1's size: %lu.  V2's size: %lu\n", v1.size(), v2.size());
    
      /* Set the vectors' values, and print them out. */
    
      for (i = 0; i < v1.size(); i++) v1[i] = 10 + i;
      for (i = 0; i < v2.size(); i++) v2[i] = 20.3 + i;
    
      printf("V1:"); 
      for (i = 0; i < v1.size(); i++) printf(" %d", v1[i]); 
      printf("\n");
    
      printf("V2:"); 
      for (i = 0; i < v2.size(); i++) printf(" %.1lf", v2[i]);
      printf("\n");
    
      return 0;
    }
    

    This program declares an empty integer vector v1 and a ten-element double vector v2, and prints their sizes. It then resizes the vectors to five and eight and prints their sizes again. It then initializes the elements of v1 and v2 in two for loops and prints out the two vectors. Straightforward stuff:

    UNIX> bin/vec1
    V1's size: 0.  V2's size: 10
    V1's size: 5.  V2's size: 8
    V1: 10 11 12 13 14
    V2: 20.3 21.3 22.3 23.3 24.3 25.3 26.3 27.3
    UNIX> 
    
    One note -- when you print a size() with printf(), you need to specify "%lu" or "%ld" instead of "%d". This is because sizes are 64-bit, unsigned quantities. If you don't do it, you will get a compiler warning. Similarly, you'll want to declare i to be a size_t, which again is an unsigned long. Otherwise, you'll get compiler warnings when you compare i to v1.size().

    When you create array elements, default values are placed in there. For example, string arrays start with default empty strings. Numerical values should be zero, but frankly I'd be leery of trusting that. You can specify what the default values should be as a second parameter to the resize() method.

    Take a careful look at src/vec2.cpp:

    /* This program shows some subtleties of resizing vectors. */
    
    #include <cstdio>
    #include <vector>
    #include <iostream>
    using namespace std;
    
    int main()
    {
      vector <int> v1;
      size_t i;
    
      /* Start with two v1.resizes, setting new elements to 22 and 33. */
    
      v1.resize(5, 22);
      v1.resize(8, 33);
    
      printf("Initial V1:            ");
      for (i = 0; i < v1.size(); i++) printf(" %d", v1[i]); 
      printf("\n");
    
      /* Chop it down to six elements. */
    
      v1.resize(6);
      printf("v1.resize(6):          ");
      for (i = 0; i < v1.size(); i++) printf(" %d", v1[i]); 
      printf("\n");
    
      /* Now resize to 10 elements, setting the new ones to 44. */
    
      v1.resize(10, 44);
      printf("v1.resize(10, 44):     ");
      for (i = 0; i < v1.size(); i++) printf(" %d", v1[i]); 
      printf("\n");
    
      /* Does this add 5 new random elements, or 5 copies of one random element? */
    
      v1.resize(15, rand());
      printf("v1.resize(15, rand()): ");
      for (i = 0; i < v1.size(); i++) printf(" %d", v1[i]); 
      printf("\n");
    
      return 0;
    }
    

    We first resize v1 to hold five elements that are 22, and then we resize v2 to hold eight elements with a default of 33. This raises a question -- will only the new elements be 33, or will all of them? Look at the output:

    UNIX> bin/vec2
    Initial V1:             22 22 22 22 22 33 33 33
    v1.resize(6):           22 22 22 22 22 33
    v1.resize(10, 44):      22 22 22 22 22 33 44 44 44 44
    v1.resize(15, rand()):  22 22 22 22 22 33 44 44 44 44 16807 16807 16807 16807 16807
    UNIX> 
    
    Only the new ones are given values. The third resize() removes two elements, and then the fourth increases the size from 6 to 10, putting the value of 44 into the new elements.

    The last resize() adds 5 new elements and gives them a default of rand(), which is a random integer. You may wonder, will that create five different random numbers, or five copies of one random number? It is the latter, because the resize() command simply calls rand() once, and its return value is passed to the resize() command. (BTW, I will discuss random numbers later in class).


    push_back(), reverse, tail

    Two canonical vector programs are reverse.cpp and mytail.cpp. The first is really simple -- it prints out the lines of standard input in reverse order. You need a data structure like a vector to do this, because you can't print the first line of output until you've read the last line of input, and you have to store all of those lines somewhere.

    Here's src/reverse.cpp:

    /* This program uses a vector to print standard input in reverse order. */
    
    #include <vector>
    #include <iostream>
    using namespace std;
    
    int main()
    {
      vector <string> lines;
      int i;
      string s;
    
      /* Read every line of standard input into a vector called lines. */
    
      while (getline(cin, s)) lines.push_back(s);
    
      /* Now print lines in reverse order. */
    
      for (i = lines.size()-1; i >= 0; i--) cout << lines[i] << endl;
    
      return 0;
    }
    

    It makes use of the vector method push_back(), which appends an element to a vector. push_back() is guaranteed to run quickly, and it is much more convenient than resizing the array.

    To show reverse running, I first call cat -n on data/input.txt. That prints data/input.txt to the screen with line numbers.

    UNIX> cat -n data/input.txt
         1	Give me a weapon of power, which no one else may hold,
         2	Defend the Gods with honor, To lead the BRAVE and BOLD
         3	
         4	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
         5	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
         6	
         7	Bequeathed to me by Odin, Molded by the Dwarfs
         8	MINE! This shimmering mallet, The Symbol of the Norse
         9	
        10	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
        11	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
    UNIX> cat -n data/input.txt | bin/reverse
        11	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
        10	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
         9	
         8	MINE! This shimmering mallet, The Symbol of the Norse
         7	Bequeathed to me by Odin, Molded by the Dwarfs
         6	
         5	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
         4	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
         3	
         2	Defend the Gods with honor, To lead the BRAVE and BOLD
         1	Give me a weapon of power, which no one else may hold,
    UNIX> 
    
    You may have noticed that I didn't use a size_t for i, but instead used an int. That's because size_t's are unsigned quantities, and therefore cannot assume negative values. If you try to use a size_t for i, you'll get a compiler warning:
    reverse.cpp:19:30: warning: comparison of unsigned expression >= 0 is always true
          [-Wtautological-compare]
      for (i = lines.size()-1; i >= 0; i--) cout << lines[i] << endl;
    
    Do you think it's a good thing that when i is a size_t, that it can't be set to -1? I personally don't, but no one asked me. Be aware of it, and pay attention to the warnings emitted by the compiler.

    The second program performs the same functionality as the tail command -- it prints out the last ten lines of standard input. We can write a simple version of tail that is like reverse.cpp. It reads all of the lines into a vector and then prints out just the last ten lines. It's in src/mytail1.cpp:

    /* This program prints out the last ten lines of a file, (or the whole file if it 
       has fewer than ten lines).  It reads all of the lines into a vector, and then
       prints out the last ten entries. */
    
    #include <vector>
    #include <iostream>
    using namespace std;
    
    int main()
    {
      vector <string> lines;
      size_t i;
      string s;
    
      /* Read each line into the vector */
    
      while (getline(cin, s)) lines.push_back(s);
    
      /* Compute the first line to print */
    
      if (lines.size() < 10) {
        i = 0;
      } else {
        i = lines.size()-10;
      }
    
      /* And then print the lines. */
    
      for ( ; i < lines.size(); i++) cout << lines[i] << endl;
    
      return 0;
    }
    

    We need the if statement to handle files that are smaller than 10 lines. If we didn't have the if statement, then who knows what i would be when we set it to lines.size()-10, because size_t's can't have negative values. Hopefully that bug would be manifested by a segmentation violation, but you never know. Since we've put in that if statement, there is no bug:

    UNIX> cat -n data/input.txt | bin/mytail1
         2	Defend the Gods with honor, To lead the BRAVE and BOLD
         3	
         4	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
         5	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
         6	
         7	Bequeathed to me by Odin, Molded by the Dwarfs
         8	MINE! This shimmering mallet, The Symbol of the Norse
         9	
        10	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
        11	    LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
    UNIX> head -n 4 data/input.txt | bin/mytail1
    Give me a weapon of power, which no one else may hold,
    Defend the Gods with honor, To lead the BRAVE and BOLD
    
        LIGHTNING STRIKES LIGHTNING STRIKES AGAIN
    UNIX> 
    
    I'll contend, though, that src/mytail1.cpp is not as good of a program as it could be. Why? Consider what happens if you call it on a file with 1,000,000 lines. You are storing all 1,000,000 lines, but you are only printing the last ten. That's a big waste of memory!

    This problem is fixed in src/mytail2.cpp:

    /* This program also prints out the last ten lines of standard input, however unlike
       mytail1.cpp, it only stores ten lines, rather than the entire file.  You keep track 
       of the total number of lines in the variable "ln", and you simply keep overwriting
       the strings in the "lines" vector, until you get to the end of the file. */
    
    #include <cstdio>
    #include <vector>
    #include <iostream>
    using namespace std;
    
    int main()
    {
      vector <string> lines;
      int i, ln;
      string s;
    
      /* Read the lines into elements 0 through 9 of the vector "lines." */
    
      ln = 0;
      while (getline(cin, s)) {
        if (ln < 10) {
          lines.push_back(s);
        } else {
          lines[ln%10] = s;
        }    
        ln++;
      }
    
      /* Set i to be (ln-10), or 0 if we haven't read ten lines. */
    
      i = ln-10;
      if (i < 0) i = 0;
    
      /* Now print out the last ten lines. */
    
      for ( ; i < ln; i++) cout << lines[i%10] << endl;
      return 0;
    }
    

    Once lines becomes ten elements long, we no longer call push_back(), but instead replace the oldest element with s. When reading is done, we have the last ten lines, but not always starting at element 0. To print them out, we need to find the array element for each of the last ten lines. Consider line x. If it is in the array, it will be in element x%10. Thus, if the file has ln total lines and ln > 10, then we want to print out lines ln-10 to ln-1. The for loop that ends the program does just that.