Sunday, October 31, 2010

get a line or two

Here is a function in the example program:

int getline(char s[], int lim)
{
int c, i;

i = 0;

while (--lim > 0 && (c=getchar()) != EOF && c != '\n')
s[i++] = c;

if (c == '\n')
s[i++] = c;

s[i] = '\0';
return i;
}

Unfortunately the indentations are stripped out, but that's ok.

In the whole program there is a function for getting a line, and it's called inside a function for reading lines. Yo dawg...

It's important to know that the readlines function is waiting on i to be 0 so it knows it's read as much as it can. If getchar() encounters EOF it's going to skip over the parts where i gets incremented.

So getchar() is a function that has pissed me off since day one because it's so poorly elaborated on in any text I've found. It also has the peculiar property of being obviously Get Character and yet it returns values to an int (in the K&R examples, anyways). I believe that this is because of how throwing the returned value into a char would be interpreted once EOF is hit. I think it's just safer to use int... I'm just not sure why right now (although I recall reading this). Something about how EOF isn't the same in all operating systems, or how negative numbers are interpreted when they are char.

Anyways, getline() is passed a character array (more on this at the end) and a number from the readlines() function. The integer lim is supposed to be how long of a line we're allowing. If the line gets too long (or getchar() pulls in EOF or a newline) our while loop dies and AT THE LEAST s[0] is a string terminator.

I'm getting ahead of myself.

We have this loop, and this loop essentially starts slurping in characters from the input stream. When you execute the program that benign looking getchar() buried in this while condition inside of a function inside of a function makes the program just sit there with a blank terminal. You have to enter something (sentences) and hit enter (newline) a few times to get the ball rolling (since the program is about sorting sentences by length). In Windows, you send EOF with ctrl-z and hitting enter.

Alight, so getchar() LOOKS like it's getting whole character strings based off how you're entering them in the console, but it's not. That threw me for a while. It's only getting one character at a time. What I'm not certain about (again...not well documented anywhere) is if getchar() sees characters as I'm entering them, or only after I hit enter. I believe it's only after I hit enter, since I have to hit enter after sending EOF with ctrl-z.

So with that knowledge...

I type in a sentence and hit enter. getchar() starts churning through the input one character at a time and passes the value (ascii value?) as an integer to c. Every character that isn't EOF or a newline gets placed in an array s[]. A character array. Full of ints? Woof.

This is one of those fun tidbits I didn't know before I started this book, but there is a different between i++ and ++i. i++ evaluates i and THEN increments it. ++i increments first and then evaluates i. So when we're saying:

s[i++] = c;

we're saying TWO things. First, pass c into s[i], and then increment i. That was confusing to me for a while because when I was first learning you would make that two statements:

s[i] = c;
i = i + 1;

Moving on....

So we're throwing c in to s until we reach our limit or c is EOF or newline. Lets say c is EOF.

If c is EOF, the while loop ends, and the next if statement is skipped over (since we're not newline) and then s[i] becomes a the string terminating zero. Again remember back in the while loop we passed c to i and THEN incremented i, so once we bail out of our while loop we're already at the next i for placing our terminating zero. Cool beans.

If c is newline, then we go to the next if statement and I suppose check again if it's newline (I wonder if there is a better way to do that) and pass that newline into c, and then go to the next statement which adds the terminating zero.

Then the function returns how long the string is.

So. This function takes lines (defined by a string ended with a newline) from the input stream and puts them into an array with a proper terminating zero and then returns how long that array is. What I don't understand yet is why. Do those strings get kept? Sometimes pass by value and pass by reference still confuses me. Does my newly formed string s[] not get destroyed when the function is done because I passed where it's being stored by reference? At first glance it doesn't seem so, but I might be wrong. In the function calling this getline() function there is a character array:

char line[MAXLEN]

and getline() is called by:

getline(line, MAXLEN)

so does getline actually alter the char variable line[], or is it only working on s[]?

Tricky tricky. I could quickly write a program to test what happens (is line passed its value or reference), but I'm tired and I need to go to bed.

1 comment:

  1. Yep, the argument 'char s[]' is just another way of passing 'line[]' by reference into the getline function. When you modify 's[]' in getline, you are directly modifying the data in 'line[]'. That is to say, s==line and s[0]==line[0], (...).

    The reason getchar() returns 'int' is a little hackish, but pretty common in C. The function is supposed to return a 'char' from input AND let you know if there's no input left. But, it only has one return value to convey this information. Since there are 256 possible values an 8-bit 'char' can take and since one unique value is required to signify EOF, you have 257 unique return values. An 'int' type can hold this data just fine, so it was chosen instead of 'char'.

    Choosing 'int' is very seldom an arbitrary choice. Since 'int' is (by convention) the word size of whatever architecture you are programming for, it is normally the most efficient value for the processor to process. You'll see this a lot in C when the 'logical' return type is something smaller than 'int'.

    In cases where the return data is larger than 'int' (in this example, a string), you see the returned value(s) passed via memory (the string storage that starts at address '&line[0]') using a supplied reference (s[]==line[]).

    Finally, your hunch that something is destroyed upon leaving getline() is correct. 's[]', which is getline()'s local pointer copy of 'line[]' is deallocated when you leave getline(). However, (since it's only a pointer) the data it once pointed to is not affected by its demise.

    ReplyDelete