Finding out if I've Improved as a Programmer: Part 2

I've been looking into whether or not I've improved as a programmer by dusting off an old shell program I wrote for a college course and seeing if I can make it any better. The exercise has been eye-opening so far. I really have come a long way on my programmer's journey, with much longer to go, I'm sure. To quickly recap, all of the refactorings I'm doing are going up on my GitHub repo so you can follow along with the changes, and I identified these general issues with the code:
  • It doesn't compile
  • Minimal tests
  • Inconsistent indenting and formatting
  • One long function in main()
  • Poor structure and organization
  • Major memory management issues
In the first article, I tackled the first three items, so I now have the last three—breaking up main(), giving the program better structure, and improving memory management—to address before I'll be happy with the revision. Let's jump right in, shall we?

Argument Processing


When I wrote this program, the first thing I did in main() was declare a bunch of variables:
  // variable initialization
  struct command_t *command; // Holds parsed command
  char *commandLine;         // Holds command before being parsed
  char *shellPath;           // Holds SHELLPATH environment variable
  char *commandPath[1024];   // Holds the tokenized paths
  char *tempCmd;             // Holds a path to be cat with command and exec'd
  int LENGTH = 513;          // Maximum length of command
  FILE *inputSrc = stdin;    // Changes to batch file name in batch mode
  char *checkEOF;            // Set to NULL if Ctrl-D is pressed
  int error = -1;            // Used to find valid path to command
  bool echoLine = 0;         // If in batch mode and VERBOSE is set, echoLine=1
  bool execute = 1; // Set to false if command should no longer be executed
  static bool TRUE = 1;
  static char PROMPT[] = "koblensk> "; // Prompt to appear for user
This used to be a common way of writing C programs because C used to require that all local variables be declared at the beginning of a function. The problem with this structure is that most of these variable declarations are nowhere near where they are used, so even though they are well-commented, that doesn't help when you're looking at a variable that's used way down in a long function where the explanation isn't anywhere nearby. I compounded this problem by having a large number of long-lived variables in one long function. As we work through the rest of main() extracting functions, we'll pull these declarations down to their first use, and many of them will likely end up in the extracted functions. It will be much better in the end, but we're not going to do it all at once. It's easier to take care of the variables as we go along.

After these declarations, we have a comment that marks the following section as initialization, and another comment that describes the immediately following block as processing the program's arguments:
  // Shell Initialization

  // Checks whether to run in batch mode and echo commands
  if (argc == 1) {

  } else if (argc == 2) {
    inputSrc = fopen(argv[1], "r");
    if (inputSrc == NULL) {
      puts("\nFile not found.");
      puts("Check your file name and path.\n");
      exit(0);
    } // if
    if (getenv("VERBOSE") != NULL) {
      echoLine = 1;
    } // if

  } else {
    puts("\nToo many arguments for shell command.");
    puts("To run in batch mode use: shell .\n");
    exit(0);
  } // if - else if - else
A few things could be improved here. First, the comments are definitely calling out for some function definitions. Whenever you see comments describing sections of code, there's a good chance that functions can be extracted from that code to make it more understandable. We'll get to that soon. Second, the comments at the end of the if and else blocks are totally unnecessary. These are small if-else statements, and the comments just add line noise without adding any value. If an if block is so long that it would require these types of comments to sort it out, then it should be refactored to make it shorter. Last, this code isn't doing one thing; it's doing two things. One thing is obviously setting the mode of the shell based on the number of arguments, but then it's also checking for a VERBOSE environment variable to set a flag that marks whether the shell should echo input commands or not. This variable echoLine is only used once, and it would be much better served as a function, so let's extract it:
bool echo_command_from(const FILE *input_src) {
  if (input_src != stdin && getenv("VERBOSE") != NULL) {
    return true;
  }
}
Then, when we replace the usage with the new function:
    if (echo_command_from(inputSrc)) {
      printf("%s\n", commandLine);
    } // if
It looks like we can pull the print statement into the function and make the function perform the task of echoing commands instead of only checking if it should:
void echo_command_from(const FILE *input_src, const char *command_line) {
  if (input_src != stdin && getenv("VERBOSE") != NULL) {
    printf("%s\n", command_line);
  }
}
That simplifies the call site to this:
    echo_command_from(inputSrc, commandLine);
So, we've eliminated a variable and simplified some code. Let's keep going with the rest of the argument count checking. We can pull the whole if-else statement into a function like this:
FILE *get_input_source(int argc, const char *argv[]) {
  if (argc == 1) {
    return stdin;
  } else if (argc == 2) {
    FILE *input_src = fopen(argv[1], "r");
    if (input_src == NULL) {
      puts("\nFile not found.");
      puts("Check your file name and path.\n");
      exit(EXIT_FAILURE);
    }
    return input_src;
  } else {
    puts("\nToo many arguments for shell command.");
    puts("To run in batch mode use: shell .\n");
    exit(EXIT_FAILURE);
  }
}
Then the input_src variable can be assigned directly from this function:
  FILE *input_src = get_input_source(argc, argv);
Notice how I'm methodically changing camelCase names to snake_case? I find them easier to read because the '_' acts as a space. Since this is a small, self-contained program, I'm going all out on the refactorings and changing variable names with impunity. I also changed the arguments for exit() to be the proper named values instead of literals to make it more readable. The rest of the function seems fine the way it is. Using an else-if chain is probably unnecessary since each block either returns or exits, but I don't have a real preference here and the result is equivalent.

The next block of code gets the shell path from an environment variable:
  // Get SHELLPATH
  if (getenv("SHELLPATH") == NULL) {
    shellPath = NULL;

  } else {
    shellPath =
        (char *)malloc((strlen(getenv("SHELLPATH")) + 1) * sizeof(char));
    strcpy(shellPath, getenv("SHELLPATH"));
  } // else
This, too, can be extracted into a function, but upon further inspection of where shellPath is used, it can actually be replaced with calls to getenv("SHELLPATH"), and the above code can simply be removed. One of the calls is for a use in strtok(), which is a little dangerous because strtok() modifies the input string that's passed to it, but we're not keeping a reference to the shell path string anymore so it should be fine. With these changes, it's probably time to do a commit before moving on.

Reassessing the Tests


Before we continue on with the refactoring, we should address some issues with testing as it's currently done. First, I'm using the command "/bin/ls -a" in the batch file. Because I use Vim, and Vim creates some temporary files in the working directory, the output of ls changes sometimes. This causes the diff to be wrong when it shouldn't. To remedy that problem, I created a simple subdirectory with three files to create some static test output for ls:

test/
  |
  +-> a.txt
  +-> b.txt
  +-> c.txt

Then I can change the first line of batch.txt to:
/bin/ls -a test
The golden_output.txt file needs to be regenerated for the new program output, and since we just committed code changes, now is a good time to tweak the tests. We should try to avoid making code changes and test changes in the same commit, unless the code changes require test changes, as will happen later. Even in the case where both changes are required, we want to minimize code changes when changing tests to prevent bugs from sneaking in as much as possible.

The next thing we should take care of is a test hole that you may have noticed as we were changing code dealing with the VERBOSE environment variable. We never test this environment variable, but we should. To do that, we need to set the variable to something and then run the shell again so that it will print out the commands in the batch file:
$ VERBOSE="true" ./shell batch.txt > golden_output_verbose.txt
Now we can compare against verbose output, but our test runs are getting a bit cumbersome. We need to run four fairly long commands for a test run, plus the compile command. Let's make a short Makefile to simplify our lives:
TGT= shell
OBJECTS=
CFLAGS= -march=native -pipe -std=c11
CXXFLAGS= -march=native -pipe -std=c++11
CC= gcc
CXX= g++
unexport VERBOSE

all: ${TGT}

${TGT}:

.PHONY: clean test

clean:
 $(RM) $(TGT) $(TGT).out

test: ${TGT}
 SHELLPATH="/usr/bin:/bin" ./shell batch.txt > output.txt
 diff output.txt golden_output.txt
 SHELLPATH="/usr/bin:/bin" VERBOSE="true" ./shell batch.txt > output.txt
 diff output.txt golden_output_verbose.txt
Now we can run the tests by typing make test, and we're on our merry way. These kinds of development environment improvements are all part of the game in programming. Itches need to be scratched, pain points need to be relieved, and tests need to run with minimal effort. It theoretically saves time in the end, or at least gives us the satisfaction of control. I'm not sure which. Anyway, time for another commit before getting back to refactoring.

Parsing SHELLPATH


The next block of code is where the shell path is split into an array of paths that the program can use to search for where input commands live in the file system:
  // Find the full pathname for the file and execute command
  int pathCount = 0;

  if (getenv("SHELLPATH") != NULL) {

    commandPath[0] = (char *)malloc(LENGTH * sizeof(char));
    char *pathTmp = strtok(getenv("SHELLPATH"), ":");
    if (pathTmp) {
      strcpy(commandPath[0], pathTmp);
    } else {
      commandPath[0] = NULL;
      printf("OOPS\n");
      exit(1);
    }

    while (commandPath[pathCount] != NULL) {
      pathCount++;

      char *sTmp = strtok(NULL, ":");
      if (sTmp) {
        commandPath[pathCount] = (char *)malloc(256 * sizeof(char));
        strcpy(commandPath[pathCount], sTmp);
      } else {
        commandPath[pathCount] = NULL;
      }
    } // while

    // temp = strlen(commandPath[pathCount-1]) - 1;
    // commandPath[pathCount-1][temp] = '\0';
    commandPath[pathCount] = NULL;
  } // if
This code can also be extracted neatly into a function, but we're going to do some more work on it as well. The first thing we should do is fix the commandPath declaration. Currently it's declared as a 1024-element array of strings, but we can easily figure out exactly how big we need this array to be instead of statically allocating a fixed number of elements. I probably preferred the static allocation when I first wrote this program because I was afraid of dynamically allocated memory in C. To be sure, it can be dangerous if you're not careful, and I've wasted many hours debugging segmentation faults because of malloc(). But fixed-size arrays have their own issues, and dynamically allocated memory should be used when it's called for. To do the path counting we need to count how many colons are in the shell path with a simple function:
int count_char(const char *str, char ch) {
  int count = 0;
  for (; *str; str++) {
    if (*str == ch) count++;
  }
  return count;
}
Before calling this function in the function we're extracting to parse the shell path, we want to check if the shell path is empty because an empty shell path means zero paths, but finding zero colons means there's one shell path. If the shell path is empty, we can return NULL for the command_paths array. (The variable was changed to be plural because it's an array, and again, I prefer snake_case.) Otherwise, we can count up the colons, add one to get the number of command paths, and allocate memory for enough command paths plus one, leaving room for a NULL terminator in the array:
char **parse_shell_path() {
  char *shell_path = getenv("SHELLPATH");
  if (shell_path == NULL) return NULL;

  int path_count = count_char(shell_path, ':') + 1;
  char **command_paths = (char **)malloc((path_count + 1) * sizeof(char *));
  if (command_paths == NULL) {
    puts("ERROR: Out of memory in parse_shell_path() allocating command_paths");
    exit(EXIT_FAILURE);
  }
The tactic of exiting early from a function is one I like to use to avoid nested if statements. This practice makes functions more linear and easier to read. The error checking and special cases are at the top, and the main job of the function comes at the end without being pushed off the right end of the screen with indents. This preference does not invalidate my indifference to else-if chains as described above, because in that case, else-if chains are actually laid out linearly. In this case with error checking, if we had multiple error checks, multiple if statements would be nested inside each other, resulting in more levels of indentation.

The rest of the code in this function used to be a convoluted way of using strtok() to grab the paths out of shell_path. It used a preamble and a while loop with separate variables for pointing to the tokenized strings, and it had inconsistent error handling between the two parts. It looks like I didn't have a solid understanding of how strtok() worked because the code is kind of flailing around, trying to get the right stuff out of the string. My refactoring of this code uses a single for loop to get the job done, and it error checks the result of the malloc() as well:
  int i = 0;
  for (char *path = strtok(getenv("SHELLPATH"), ":"); path != NULL; path = strtok(NULL, ":"), i++) {
    command_paths[i] = (char *)malloc(strlen(path) * sizeof(char));
    if (command_paths[i] == NULL) {
      printf("ERROR: Out of memory in parse_shell_path() allocating command_paths[%d]\n", i);
      exit(EXIT_FAILURE);
    }
    strcpy(command_paths[i], path);
  }

  command_paths[i] = NULL;

  return command_paths;
}
The function wraps up by making sure the last element of command_paths is NULL, and returns it. Make sure you return the right values from functions! I initially forgot the final return statement, and was thoroughly confused when my tests started failing. It took a couple minutes to figure out what was going on because I thought I had messed up the character counting function. It turns out that was all correct, and what was actually happening was the command_paths in main() was always getting assigned to NULL. This is why we write tests.

Initialization is Done


That wraps up the initialization code. It has been dramatically simplified to this:
  // variable initialization
  struct command_t *command; // Holds parsed command
  char *commandLine;         // Holds command before being parsed
  char *tempCmd;             // Holds a path to be cat with command and exec'd
  int LENGTH = 513;          // Maximum length of command
  char *checkEOF;            // Set to NULL if Ctrl-D is pressed
  int error = -1;            // Used to find valid path to command
  bool execute = 1; // Set to false if command should no longer be executed
  static bool TRUE = 1;
  static char PROMPT[] = "koblensk> "; // Prompt to appear for user

  FILE *input_src = get_input_source(argc, argv);

  char **command_paths = parse_shell_path();
We removed a couple variables and extracted the work of initialization into functions that make initializing input_src and command_paths a snap. Most of the rest of the variables are going to fall out once we make it through the main loop, but that's a task for next time. The last changes made go into one more commit, and then we'll wait to see how the rest of the code refactors next time.

No comments:

Post a Comment