Hey, Bob! I have a persistent difficulty with temporary variable names. What should I do??

I have a problem that comes up a lot in programming, and I’m sure the experts can help me out. It has to do with giving names to objects that I’m working with. Sometimes the naming is natural; for example, if I have a variable for sex in a survey data, I can define “male” as 1 for men and 0 for women and then continue with names such as “male00” for data in the 2000 survey, “male04” for the 2004 survey, and so forth, or else use a consistent set of names and define things within dataframes. In either case, this is no problem.

What is more of a hassle is coming up with names for temporary varaibles. For example, should I use “i” for every looping index, with “j” for the nested loops, “k” for the next level, and so forth? This can get confusing when referring to array indexes (sometimes you end up needing things like a[j,i]), or should I go for mnemonics, such as “i” to index survey respondents, “s” for states, “e” for ethnic groups, “r” for religious denominations, etc? Sometimes I’ve tried to reserve “j” for the lowest level of poststratification cel, but that isn’t always convenient.

At other times I’ve tried more descriptive names, for example n.state for the number of states (50 or 51, depending on whether D.C. is included) and i.state for state indexes, thus giving loops such as “for (i.state in 1:n.state)” or “for (i.state in states),” where “states” has been pre-defined as the vector of state numbers. This approach is currently my favorite–I tried to stick to something like it in my book with Jennifer–but can also create its own difficulties, as I have to remember the names of all the indexing variables.

Beyond looping indexes, there are all sorts of temporary variables I’m creating all the time; for example, after fitting a multilevel model, pulling out coefficient estimates: “fix <- fixef (M2000.all)." There's always a tension in naming these temporary quantities: on one hand, I don't want meaningless names such as "temp" floating around; on the other, every new variable name is another thing to remember, which is annoying if it's only being used in two lnes of the program. I guess the real solution is to ruthlessly compartmentalize: in R, this essentially means that you turn every paragraph of code into a function and get rid of all globally-defined variables. I haven't always had the discipline to do this, but maybe I should try.

17 thoughts on “Hey, Bob! I have a persistent difficulty with temporary variable names. What should I do??

  1. I've always liked to name temporary variables with a "_" preceding them, like "_j", or "_age", etc. You can then get rid of them by deleting everything that starts with "_". If you know another language, you have doubled your variable names. Sometimes "age" is the variable in the dataset and "edad" is the helper. In words that are the same, the plural isn't — the variable "hospital" transforms the helper into "hospitales". Other ways: a counter always starts with "ct_" like "ct_groupid" so, again, you can delete all the counters since they start with "ct_". And you know what is the role of a variable that starts with "ct_". Indexes are "i_" and so on.
    Hope this helps. I appreciate when you help others.

  2. Writing code is writing for two very different audiences. The first audience is the compiler (or interpreter, as the case may be). The second is made up of all future programmers who have to read your code (including the future you).

    If you're having trouble remembering variable names, then your intuition about comparmentalizing your code is exactly correct. The other important tool is to adopt a consistent set of coding standards. I would recommend looking for a published set specific to your main language. I'm not an R programmer, so I don't know if any exist for it.

    The key here is to internalize a set of practices that allow you to stay in the flow of programming. Having to stop and think about variable names leads to bugs.

  3. Yeah, if you're really overflowing with variables you definitely need to start using functions (and a more functional style, where there are usually no side effects, just arguments are return values).

    Some people try to use lines of code to gauge how long a typical function should be, but a more useful measure is cyclomatic complexity. There's no need to go into the details of what that means, but basically, any one function should only have a few 'decision points', like ifs, elses, et cetera. Ten decision points is getting into the realm of way too complex, and typically functions should only have four or five at most. These are just guidelines, of course. Another principle is that functions should encapsulate single aspects of functionality — for instance, if you feel you need to put an "and" in a function name to be accurate, it probably shouldn't be one function.

  4. Ruthless compartmentalization is almost surely the way to go. Apart from the decrease in cognitive load from tracking all your variables, a nice side-effect is that the higher-level logic starts to look more and more like pseudo-code, describing intentions rather than low-level details, e.g.

    <pre>
    for i=1:N
    do_that_thing(x,y,z)
    then_this(x)
    dont_forget_that_fiddly_bit(x,z)
    </pre>

    If it makes it any easier, the initial need for discipline can soon give way to a mild fetish that brings a great sense of satisfaction!

  5. Subscripts are the worst because anything other than single letters creates a lot of typing. There's heck to pay if you don't have some standards within your own work, such as i=individual, t=time, h=household, b=brand or product, f=county (FIPS) etc.

    My advice: stay away from "s" for states. I've often used "s" for the store subscript, but "s" has too many other uses and I've generally regretted using it in any subscript context.

  6. IMHO as an experienced programmer, the advice given above is very good. Except maybe the use of non-English variable names for open source programs or other programs with international development teams. I don't know what the R language coding style standard is, but in Python standard library identifiers "SHOULD use English words wherever feasible". The coding standard also states that for "Python coders from non-English speaking countries: please write your comments in English, unless you are 120% sure that the code will never be read by people who don't speak your language."

    But also keep in mind the #1 Python coding style guide rule, and applicable for all programming languages: "A Foolish Consistency is the Hobgoblin of Little Minds".

  7. In my code, I distinguish the use of my temporary variables.

    In general, any variable that is part of an interface gets a fully descriptive english name.

    Any variable that represents things from your data model, and is delivered by reference [local changes affect value in the global scope] also get fully descriptive names.

    Any local copies of things from the data model gets a leading underscore, and a briefly descriptive name.

    Finally, anything that I use for convenience I give a leading underscore and don't really pay attention to what I call it, otherwise. Loop indices are a good example of this. If you were going to translate a loop to english, the semantic value of the index name is entirely context dependent, so the name itself doesn't do much for you.

  8. Andrew, you answered your own question w.r.t. modularity. You may also be able to answer many of your future questions by drawing an analogy between writing understandable code and writing understandable math. You should be able to apply your formidable skill in the latter to the former.

    William (of?) Ockham has the right perspective here on your two audiences. The compiler doesn't care what you name variables. But for another person to read it, even if that's yourself a few hours or days later, it helps to make it all presentable.

    Chris is getting at what's known as "literate programming". It's a kind of self documenting code because the names of routines are meaningful. Ultimately, you won't need any documentation in the code itself, just on the public functions you expose.

    Literate programming provides two huge reductions in cognitive load, which is what good programming practice is all about. First is what Chris mentioned — it lets you read the top-level code like pseudo-code. The other just as big benefit is that it lets you independently test and develop the modules.

    As Russel Dohan's point out, by being more bite-sized, debugging's much easier. A good measure of complexity is nesting — highly nested code with big blocks inside the loops is hard to debug. The point is that complexity grows superlinearly in the size of your function.

    It's absolutely crucial to do top-down design and bottom-up development. Always build on top of pieces you trust.

    The only thing you need to do with variable names is keep them consistent. That is, if you index a matrix x[i,j], don't later use x[j,i]. I don't know if R has conventions/idioms, but if it does, follow them. This is absolutely not the place to innovate.

    David, in R, the functions are called with named arguments, as in "mean=x, cov=Sigma,…", so you definitely get named arguments. Variable names that are too long become hard to read. I disagree that variable names in loops don't matter — their identity doesn't matter much, but they should be conssitent. As ZBicyclist pointed out, if you use x[i,j] at one point, you shouldn't switch to x[j,i] later. Usually in R, you have the convention of I and J being the dimensions and i,j being the indexing variables, so this all tends to be pretty consistent, e.g. for i in 1:I { for j in 1:J { … } }.

    If you declare variables with local scope — that is, in the block where they're used just before they're used, you don't have to scan over lots of code to find things. IDEs can help, but I don't know if there are any for R. Also, auto-completion in packages like emacs are a huge time-saver when you have long names.

  9. If we are talking R, lists and environments can go a long way to keeping variables neat and tidy.

    Lists many R programmers are familiar with, but environments tend to get skipped over. assign() takes a character string, a value, and optionally, an environment in which to make the assignment.

    > assign("a", 4)

    > a

    [1] 4

    assigns a value of 4 to the variable a in the global environment.

    > blog.environment assign("b", 10, env=blog.environment)

    assigns 10 to b in the blog environment. Note that b is not available globally.

    > b

    Error: object "b" not found

    but can be accessed in two ways

    > blog.environment$b

    [1] 10

    > get("b", env=blog.environment)

    [1] 10

    Since assign() and get() take character strings, you can programmatically assign values to arbitrarily named variables.

    I'll second the other useful comments, particularly concerning functional programming to compartmentalize code (make a function work only with its inputs without relying on global variables). Functional programming styles also help debugging.

  10. vaguely related: whenever i'm filling up a matrix with junk "data" i'm going to replace with real data later, i'll fill it with 77s or such. numbers i know are unlikely to pop up in ordinary usage, and if i see them in my output, i know something has gone wrong. if i created a blank matrix filled with zeros, and i see zeros at the end, i won't immediately know if i'm actually getting zeros in the output, or if my code is borked.

  11. Not a direct answer … but matrix operations in R are pretty efficient, and often more efficient than loops. So looking for opportunities to use matrix ops can help in reducing loops and therefore array subscripts – and improve speeds.
    This link is for Scilab (bit like Matlab), but shows the approach.
    http://www.di.ens.fr/~brette/Scilab/efficientscil

  12. I used to try to use matrix operations in R, but in recent years I've been doing more looping. The trouble for me is that I'm much more likely to make a mistake when I work with matrices. When I explicitly loop, I'm clearer on what's actually going on.

  13. I used to use i,j,k, etc until it started leading to problems. The goal of encapsulation so that this isn't an issue is admirable, but, in my experience writing research code, unrealistic. Proper encapsulation often requires more engineering than I'm willing to put into a piece of code. Plus, it can obfuscate important optimizations in code which can be vectorized. For the work I do, these performance differences can be the difference between results in a few minutes and results in a few days or weeks.

    My solution is to use relevant words (or abbreviations) followed by a capital I, as in Index. E.G., if I have an array called people, then I might index it with personI, persI or pI if I'm feeling really lazy. For a matrix I might index rows/columns with rowI/colI or simply rI/cI. This tends to also make indexing bugs more obvious as the code itself becomes a bit more meaningful.

  14. Right – don't write "R programs", write R functions.

    Use meaningful and even very long (if necessary) names for the functions. Use a consistent naming technique e.g. verb + object; getParameter, checkValue, plotArray. If you are consistent, then reading and understanding the code will be easy.

    If a function gets too long (say, more than one screenful) then apply this principle to this function, replacing pieces of code by function calls. Apply the principle recursively.

    Prepend a dot (".") to names of "helper" functions that are not real utilities but only used within one or two other functions, so you will immediately know the function is not really intended for use by the "end user."

    Recently I have noticed that even this is not enough to keep my sanity when writing large R packages. The answer is to write methods instead of plain functions.

    As a principle if your function returns a certain special data structure (such as list with named objects) then you should have your function return an object of some class instead.

    Such objects should be generated only by specific creator functions that check the parameter values thoroughly, so whenever you have a object of certain class you can be sure that the field values are valid. This way you can avoid repeating checking and validating arguments in your code. Further processing of the objects should be done by specialized methods only.

    And, never assume that you'll never have to touch the code again. The chances are that you'll have to modify it in the future to do something similar or more complicated.

  15. To the degree that it is possible, avoid variables all together. The apply family of functions in R, for example, is a good opportunity to cut down on a temporary storage. R's other vectorizing operations offer similar opportunities for streamlining your code.

    On a related note, using functions to remember state is a good way to take it off your plate. For example, if I have some normalizing constant I want to remember, I can create a function that returns another function to handle that for me:

    normalizingFunctionMaker

  16. The advice that's been given is much more sophisticated than what I'm going to say; having said that…

    One more advantage of breaking code into functions is that you can then run unit tests on the functions. It's usually possible to quickly write a slow but transparent version and then include a script that compares the output of both functions on a very small subset of the data. I've had good results using perl's test packages — I make system calls to R in batch mode, and the R files are basically

    source("fast_method_file.R")

    # define the slow method

    if (!isTRUE(all.equal(slowmethod(arguments), fastmethod(arguments))))
    stop("slow method and fast method don't match")

    These (obviously) end in an error if the functions don't match up — perl can then handle that error. Lately, I've started using make files for the same purpose, but it's a little more of a pain to keep everything straight. Having the slow version can be invaluable when debugging the fast version. This approach makes me much more confident that I've done vectorization correctly, for example.

    That's a little off topic though. More relevant: when looping k in 1:whatever, you can add a ".k" to the variables that are changing in each iteration. I've found this useful in heavily nested loops. If you're using an editor with autocomplete, this doesn't even add any typing.

Comments are closed.