Tidyverse Skeptic

Why the tidyverse is not a good vehicle for teaching noncoders

Author

Norm Matloff

Author bio

Main Points
  • The focus of this essay is teaching R to noncoders. I believe that teaching R via the tidyverse is counterproductive for this group of learners.

  • The essay is not about teaching R to computer science students, nor is it about whether R is a good tool for experience coders (which is a matter of personal taste).

  • In teaching almost any topic, the KISS Principle – Keep It Simple, Stupid – must take first priority. In the case of teaching R, base-R should be used.

  • The tidyverse (“Tidy”) is more complex and abstract than base-R, hence harder to learn, directly antithetical to Tidy’s stated goals.

  • The Tidy functional programming (FP) philosophy replaces straightforward, easily grasped, loops with abstract use of functions. Since functions are the most difficult aspect for noncoder R learners, FP is clearly not the right path for such learners.

  • Debugging Tidy code is difficult, especially in the case of pipes. Yet noncoder learners are the ones who make the most mistakes, so it makes no sense to have them use a coding style that makes it difficult to track down their errors.

  • “Testimonials” regarding Tidy are misleading. Tidy-based courses restrict coverage to a small set of operations, mainly a few dplyr verbs, producing the illusion of “success.” The students are happy and the instructor is happy, but the students are unprepared to use R productively in their work.

  • Tidy-taught students then struggle with tasks in their daily work, trying to shoehorn a task into the Tidy framework.

  • Moreover, there is an unwritten rule that one stay with that framework, even for something as mundane as lm. There Tidy has its own summary functions (broom and methods for augment and glimpse), to be used in lieu of summary.

  • Accordingly, Tidy has a function for everything under the sun; dplyr alone has 297 functions! One of course need not know all of them, but when a novice R coder asks how to do X, they are referred to some specialized Tidy function for X, rather than pointing them to simpler base-R approaches. This results in cognitive overload.

  • Those knowing only Tidy are locked out from effectively using the vast majority of statistical packages on CRAN, which require knowledge of the ‘$’ operator, vectors, subscripts, summary and so on.

  • Using base-R as the vehicle of instruction brings students to a more skilled level, in shorter time, than does use of Tidy. My own online course has students doing useful operations starting in the very first lesson.

  • Accordingly, noncoder learners should be taught a mix of base-R and Tidy, starting first with base-R, but then with some coverage of Tidy, since the latter is so popular, and of data.table, as it is so fast. “Graduates” of mixed courses are then equipped to apply whichever tool they find most convenient in any given application.

  • Unfortunately, this middle-ground approach is unacceptable to many Tidy advocates, especially Posit.

  • The dominance of an open-source project by a commercial entity is in general a major concern. Posit’s overall contribution to R is highly praiseworthy, but it took a wrong turn with Tidy, which would just be a niche package without Posit’s aggressive promotion. It certainly would not be used in teaching R to noncoders, for the reasons given above – too complex, too abstract.

Interestingly, leading tidyverse promoters concede many of the above points.

  • In an article by six prominent Tidy advocates (An Educator’s Perspective, see below), they concede that their offered base-R code is simpler (“more compact”) than their preferred Tidy code.

  • Tidy promotes using FP instead of loops. But Hadley Wickham, inventor of the tidyverse (it was originally called the “Hadleyverse” by others), has written: “it may take a while to wrap your head around” FP.

  • One of the most prominent R advocates, Prof. Roger Peng, concedes that teaching R via the tidyverse greatly limits what “graduates” of such courses can do with R, and indirectly, concedes that it gives students a false sense of accomplishment.

    [The advent of Tidy] allowed a relatively small set of tools to provide a wide range of operations when it came to data wrangling. The opinionated nature of the tools naturally limited somewhat the flexibility of how things could be done. But this reduced complexity was what made the tools so appealing to new users. There were just fewer things to learn.

  • Tidyers agree that Tidy code, especially pipes, is harder to debug.

Prologue: The KISS Principle: “Keep It Simple, Stupid”

  • Among other things, I once was employed briefly as an instructor of English As a Second Language in San Francisco Chinatown.

  • I taught the lowest level, ESL 50, adult immigrant students who barely knew the alphabet. The overriding principle I found is to start simply. In teaching low level ESL, one does not start with Shakespeare.

  • The KISS Principle applies similarly to use of the tidyverse for teaching noncoders. Compared to base-R, the tidyverse (“Tidy”) is much more complex and abstract, the “Shakespeare” of R.

  • In other words, Tidy uses unnecessarily complicated machinery, an obvious drawback for teaching beginners, and yet only equips students to perform a narrow set of operations.

    It would be like teaching English As a Second Language as follows: The instructor teaches the students to say “Good Morning,” by providing them with a template like,

    It is my fervent wish that you find your morning unusually fulfilling.

    The students would be told to substitute morning with afternoon or evening. But they would have no idea what the other words mean, and it would not generalize much. The students may even rave about these English “lessons,” but they won’t have learned much at all.

Sticking to simple concepts is the overriding principle in this document. A number of concrete examples will be presented, in Section 3 and elsewhere in this document.

Preliminaries

Key base-r vs. Tidy aspects

  • By “base-R” I mean array indices, loops, if-then-else and so on.

  • I do NOT advocate a slow, boring approach in which, for instance, one starts with “R data can be numeric, character or logical…”

  • The main issue is how to teach R learners to do data analysis, practical stuff that they can skillfully apply.

  • I view Tidy-based courses as characterized by

    • focus on dplyr, purrr and pipes

    • a ban/near ban on the use of loops, the ‘$’ sign, brackets etc.

Function notation

It is common to denote functions with parentheses, as in “The main linear models function in R is lm().” I write ‘lm’ rather than ‘lm()’, as it matters crucially when one function has another as an argument; ‘lm()’ will produce an error, a common one among beginners.

Note on ggplot2

  • Many view ggplot2 as part of the tidyverse, but that is incorrect. That package predates Tidy, Posit’s marketing of Tidy as including ggplot2 only came later.

  • Hadley Wickham as stated that if he had to do it over, he would write ggplot2 differently, to follow Tidy principles.

  • This is important, since many who praise Tidy cite ggplot2 as their first reason for being pro-Tidy in teaching.

  • By the way, another excellent R graphics package is lattice. Unfortunately, lacking the heavy support of a commercial entity like Posit, lattice has not been developed nearly as much as ggplot2.

The role of Posit Corp.

The up side:

  • I like and admire the Posit people, including the tidyverse originator, Hadley Wickham, and have always supported them, both privately and publicly.

  • They and I have been interacting from the beginning, when the firm consisted only of founder JJ Allaire and ace developer Joe Cheng.

  • Hadley was the internal reviewer for my book (at his request), The Art of R Programming, and I was one of the internal reviewers for Hadley’s Advanced R (2nd. ed.).

  • Posit is to be praised for greatly expanding the size and diversity of the R community.

  • One positive result of Posit’s role as a commercial entity is that it has subsidized major, beneficial growth of packages such as ggplot2.

The down side:

  • I believe that Posit took a wrong turn when it decided to promote the tidyverse for beginning learners who have no prior coding background. Tidy makes it more difficult for noncoders to learn R, and leaves them less able to use it productively. As someone who is passionate about teaching and a longtime R enthusiast, this is quite troubling to me.

  • As a general principle, it can be unhealthy for a commercial entity to dominate an open-source project.

  • The high popularity of Posit’s own products have come at the expense of other excellent products, notably lattice, a more intuitive graphics package than ggplot2, and data.table, a faster and more memory-efficient library than dplyr.

  • Without the aggressive promotion by Posit, the tidyverse would just be a niche package. In particular, it would not be used to teach noncoder learners of R, for the very reasons shown in this document – Tidy is much too abstract and too complex for this type of student.

Examples

Case study: delayed learning (I)

Let’s look at an example in my online R tutorial, fasteR. Consider, for instance, an innocuous line like

> hist(Nile)

i.e. drawing a simple histogram of R’s built-in Nile River dataset.

This is in the very first lesson in my tutorial. Easy! By contrast, the Tidy crowd forbids use of base-R plots, insisting on using ggplot2. The latter is a great package (though again it should not be considered part of the tidyverse), but here we have a much more important goal – to give students an actual useful application of R right from the start. Tidy greatly impedes that goal.

To be Tidy the instructor would have to do something like

> library(ggplot2)
> dn <- data.frame(Nile)
> ggplot(dn) + geom_histogram(aes(Nile),dn)

Here the instructor would have a ton of things to explain – packages, data frames, ggplot(), the aes argument, the role of the ‘+’ (it’s not addition here) and so on – and thus she could definitely NOT present it in the first lesson.

Things would be simpler for Tidy instructors if they were to use qplot() in the ggplot2 package, but the ggplot2 documentation says ” ‘qplot()’ is now deprecated in order to encourage the users to learn ‘ggplot()’ as it makes it easier to create complex graphics.” Again, “Stay within the framework,” even though it’s more difficult for noncoder learners.

Also in my very first lesson, I do

> mean(Nile[80:100])

printing the mean Nile River flow during a certain range of years. Incredibly, not only would this NOT be in a first lesson with Tidy, the students in a Tidy course may actually never learn how to do this. Typical Tidyers don’t consider vectors very important for learners, let alone vector subscipts.

As a concrete example of this Tidy point of view, consider the book Getting Started with R, by Beckerman et al, Oxford University Press, second edition, 2017. The book makes a point of being “tidyverse compliant”. In 231 pages, vectors are mentioned just briefly, with no coverage of subscripts. Hadley, in his Tidy “bible,” R for Data Science”, actually does cover vectors–but not until Chapter 20. In Section 3.5, we will see a group of leading Tidy authors disparage some base-R code in that “it returns a vector.” Again, typical Tidy courses don’t cover vectors at all.

In other words, even after one lesson, the base-R learner would be way ahead of her Tidy counterparts.

Case study: delayed learning (II)

A researcher tweeted in December 2019 that an introductory statistics book by Peter Dalgaard is “now obsolete,” because it uses base-R rather than Tidy. Think of what an “update” to use Tidy would involve, how much extra complexity it would impose on the students. Here is an early lesson from the book:

> thue2 <- subset(thuesen,blood.glucose < 7)

This could easily be in the base-R instructor’s second lesson, if not the first. For Tidy, though, this would have to be changed to

> library(dplyr)
> thue2 <- thue2 %>% filter(blood.glucose < 7)

Here the instructor would first have to teach the pipe operator ‘%>%’, again extra complexity. And in so doing, she would probably emphasize the “left to right” flow of pipes, but the confused students would then wonder why, after that left-to-right flow, there is suddenly a right-to-left flow, with the ‘<-’. (For some reason, the Tidy people don’t seem to use R’s ‘->’ op.)

Again, the tidyverse is simply too complex for R learners without coding background. It slows down their learning process.

Case study: tapply (I)

One of the most commonly-used functions in base-R is tapply. It’s quite easy to teach beginners, by presenting its call form to them:

tapply(what to split, how to split it,what to apply to the resulting chunks)

However, for some reason Tidy advocates deeply resent this function. When the tidyverse was first developed, Prof. Roger Peng gave a thoughtful keynote talk, Teaching R to New Users – from tapply to the tidyverse, also presented as a Web page. Oddly, tapply is not mentioned in Dr. Peng’s talk or in the printed version, but the title says it all: One should teach Tidy, not tapply.

Surprisingly, though, in making his comparison of Tidy to base-R, his base-R example is aggregate, not tapply. The complexity of aggregate function makes for an unfair comparison, A STRAW MAN; tapply is much simpler, and produces simpler code than with Dr. Peng’s aggregate version.

Below is the example from the Peng talk, using the built-in R dataset airquality. We find the mean ozone level by month:

group_by(airquality, Month) %>% 
   summarize(o3 = mean(Ozone, na.rm = TRUE))

Here is the base-R version offered by Dr. Peng:

aggregate(airquality[, “Ozone”], 
      list(Month = airquality[, “Month”]), 
      mean, na.rm = TRUE)

Indeed, that base-R code would be difficult for R beginners. But here is the far easier base-R code, using tapply():

tapply(airquality$Ozone,airquality$Month,mean,na.rm=TRUE)

This fits clearly into our tapply call form shown above:

  • What to split: Ozone

  • How to split it: by Month

  • What to apply to the resulting chunks: mean

The Tidy version requires two function calls rather than one for base-R. The Tidy code is a bit wordier, and requires that one do an assignment (to o3). All in all, the base-R version is simpler, and thus easier for noncoder beginners.

Case study: tapply (II)

Of course, one can find instances in which each approach, base-R and Tidy, is simpler to use than the other. Indeed, I have strongly advocated that noncoder R learners should pick up some of both. (More on this point below.) But I think it’s worth repeating how simple it is for R learners to solve many basic problems using tapply().

Consider this example, a bit more advanced, that I like to use to motivate linear regression models. We wish to predict human weight from height, and wish to first check approximate linearity.

The dataset, involving heights and weights of professional baseball players, mlb, is from my regtools package. For each height value, there are various players of that height, so we plot mean weight against height.

Base-R:

htsAndWts <- tapply(mlb$Weight,mlb$Height,mean)        
plot(htsAndWts)

Tidy (plus ggplot2):

mlb %>% 
   group_by(Height) %>% 
   summarize(weights = mean(Weight)) %>%              
   ggplot(aes(x=Height,y=weights)) + geom_point()                

Two major dplyr functions, pipes, and a somewhat sophisticated usage of ggplot2! It would be out of the question to use this example early in a Tidy course, far too much machinery to cover. But easy to do so in a course using base-R, in the first or second lesson. That should be the goal, empowering students to work on real problems, early on.

Case study: tapply (III)

From the article An Educator’s Perspective, by leading tidyverse instructors (more on the article later in this essay):

The dataset is loan data available in the openintro package.


loans |>
   group_by(homeownership) |>
   summarize(
      num_applicants = n(),
      avg_loan_amount = mean(loan_amount)
   ) |>
   arrange(desc(avg loan amount))

This uses six function calls, including to two advanced functions. By contrast, they first offer an woefully indirect base-R version using once again a straw man, aggregate, followed by – hooray! – a tapply version:

sort(tapply(loans$loan_amount,loans$homeownership,mean))

This dismiss this as (a) it “returns a vector” and (b) it doesn’t include number of people in each category. Objection (a) stems from the Tidyers’ obsession with keeping Tidy analysis self-contained. Objection (b) is easily met by replacing the call to mean by

function(x) c(mean(x),length(x))
subset(loans,annual_income >= 10)

Dogmatic Teaching Is Harmful to Students

  • Most people, on most issues, avoid extremes.

  • But leading teachers of the tidyverse make it a polemic, with dogmatic calls for “purity,” with a Tidy-only approach.

  • The “purity” view is expressed in the Educator’s Perspective article, written by a number of prominent Tidy advocates. (One of them even was selling “tidyverse merch”.)

  • An anti-loops mentality in particular has become the Tidy advocates’ test of whether one is a “true believer” in the Tidy philosophy. I’ve seen Twitter posts in which R learners actually apologize for using loops.

  • That article does suggest that R courses include both the tidyverse and base-r – but with the tidyverse covered first. This of course violates the KISS principle. (I advocate teaching base-R first, then Tidy and data.table.)

The authors oppose teaching base-R and Tidy simultaneously. Yet I am not aware of anyone ever having made such a proposal.

“Don’t Use ‘$’, Don’t Use Brackets, Don’t Use Vectors, Don’t Use Loops, Don’t Use Anything Outside Tidy”

As with the Peng essay, the Educator’s Perspective article sets up a straw man, unnecessarily complicated code involving base-R’s aggregate. It is thus a fundamentally invalid basis of comparison of Tidy and base-R, but the issue of interest here is why they deem the code unacceptable (quoted directly from the article):

  1. It uses the ~, $, and [ operators.

  2. It uses a magic number (2) to hard-code the second variable.

  3. It creates two data frames that need to be merged together.

  4. It passes the name of a function as an argument to a function.

As I stated earlier, even the authors’ own tapply example does not have the issues 2, 3 and 4.

I have no idea what alleged “defect” they have in mind in Objection 4.

But the first is my main point anyway:

The Major Costs of Confining Code to Tidy
  • Base-R versions, using ‘$’ etc., are typically simpler and more accessible to novice R coders, than the Tidy counterparts. (Even the Educator’s Perspective’ authors concede that the tapply version is “more compact.”)

    Again, by banning constructs which would make R learners’ life easier, the Tidyers are acting directly contrary to their goal of inclusivity in the R community.

  • The vast majority of statistical packages on CRAN make heavy use of those base-R constructs proscribed by the Tidyers.

The article contends that use of ‘$’ is error-prone, say due to misspelling names of components. But one could make similar arguments for, say, dplyr::mutate being error-prone. Indeed, my contention here is that:

A common error among Tidy beginners is failure to reassign the result of something like mutate().

The simpler the code, the less chance of errors (and the greater ability to spot and fix errors when they do arise).

The best solution is for instructors to warn students of potential errors of various types.

Finally, as to brackets, Tidy essentially ignores vectors. Consider the following base-R code:

x <- c(5,12,13,1)
x[x > 8]
# or, if preferred:
subset(x, x > 8)

Probably the simplest way to do this in Tidy would be:

x <- c(5,12,13,1)
data.frame(x=x) %>% filter(x > 8)

That’s a lot of machinery to implement one little vector operation – conversion to a data frame, pipes, filter.

Consider this very simple example: Say we wish to add to the mtcars data a column consisting of the horsepower-to-weight ratio.

In base-R, it is extremely simple:

   mtcars$hwratio <- mtcars$hp / mtcars$wt

Want to add a column to a data frame? Hey, just add it!

But in Tidy:

   mtcars %>% mutate(hwratio=hp/wt) -> mtcars

Still not terribly complex, but again we’re invoking some machinery – piping, a function call, and reassigning to the original data frame.

Poor Readability, Unnecessary Cognitive Overload

As noted, R courses using the tidyverse often do rather little beyond their canonical

data %>% group_by %>% mutate %>% summarize

paradigm. That leaves the learners less equipped to put R to practical use, compared to “graduates” of standard base-R courses.

Even more important, once one gets past that simple paradigm, Tidy runs into real problems.

Again, let’s use an mtcars example taken from the online help page for map: The goal is to regress miles per gallon against weight, calculating R2 for each cylinder group. Here’s the Tidy solution, from the help page:

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")

#  output
4         6         8 
0.5086326 0.4645102 0.4229655

There are several major points to note here:

  • The R learner here must learn two different FP map functions for this particular example, actually three functions when the differing types of arguments is accounted for.

  • This is an excellent example of Tidy’s cognitive overload problem. Actually, the tidyverse FP package, purrr, has 55 different map functions! (21 have names beginning with ‘map’, and then 34 more have that string otherwise in their names.)

  • In fact, the online help page for do says the function is being replaced by three new functions, reframe, nest_by and pick. No wonder the Tidy packages are getting larger and larger!

  • That “Tidy” code is a nightmare to read. For instance, the first ‘~’ in that first map call is highly nonintuitive. This is starkly counter to the Tidyers’ claim that Tidy is more intuitive and English-like.

  • Tidy, in its obsession to avoid R’s standard ‘$’ symbol, is causing all kinds of chaos and confusion here.

The hapless student would naturally ask, “Where does that expression ‘summary’ come from?” It would appear that map is being called on a nonexistent variable, summary. In actuality, base-R’s summary function is being called on the previous computation behind the scenes. Again, highly nonintuitive, and NOT stated in the online help page.

The poor student is further baffled by the call to map_dbl. Where did that ‘r.squared’ come from? Again, Tidy is hiding the fact that summary yields an S3 object with component r.squared. Yes, for experienced programmers, sometimes it is helpful to hide the details, but not if it confuses beginners.

The fact is, R beginners would be much better off writing a loop here, avoiding the conceptually more challenging FP. But even if the instructor believes the beginner must learn FP, the base-R version is far easier:

lmr2 <- function(mtcSubset) {
   lmout <- lm(mpg ~ wt,data=mtcSubset)
   summary(lmout)$r.squared
}
u <- split(mtcars,mtcars$cyl)
sapply(u,lmr2)

Here lmr2 is defined explicitly, as opposed to the Tidy version, with its inscrutable ‘~’ within the map call.

Newbies should write this as a plain loop, NOT using purrr or even base-R’s FP constructs. But the Tidy promoters don’t want learners to use loops. So the instructor using Tidy simply would avoid giving students such an example, whereas it would be easy for the base-R instructor to do so.

Again, there is a serious problem of cognitive overload. Tidyverse students are being asked to learn a much larger volume of material. This is ironic, in that the goal in developing Tidy for teaching was to limit a course to just a few verbs. The problem is that the verbs have dozens of variants. This is clearly bad pedagogy.

Tidy advocates say the uniformity of interface in all those functions makes learning them easier. Uniform syntax is nice, yes, but the fact remains that users must learn the semantics of the functions, i.e. what operations they perform. What, for example, is the difference between summarize, summarize_each, summarize_at and summarize_if? Under which circumstances should each be used? The user must sift through this.

As Matt Dowle, creator of data.table, pointed out about dplyr,

It isn’t one function mutate that you combine in a pipe. It’s mutate, mutate_, mutate_all, mutate_at, mutate_each, mutate_each_, mutate_if, transmute, transmute_, transmute_all, transmute_at and transmute_if. And you’re telling me [because of consistency of the user interfaces] you don’t need a manual to learn all those?

Having a common syntax thus does not compensate for this dizzying complexity. Every time a user needs some variant of an operation, she must sift through those hundreds of functions for one suited to her current need.

By contrast, if the user knows base-R (not difficult), she can handle any situation with just a few simple operations. The old adage applies: “Give a man a fish, and he can eat for a day. Teach him how to fish, and he can eat for a lifetime.”

Note that the above example uses the “dot notation,” used when the output of a pipe is fed into a multi-argument function. Here is the advice given at the official tidyverse site:

Placing lhs elsewhere in rhs call

Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).

This phrasing is likely beyond the comprehension of many R beginners. Seeing a few examples would help them, of course, but it is yet another example of how Tidy causes cognitive overload for learners.

Tidyers also claim readability quite literally – one can read Tidy code aloud and get the meaning. Fine, but once one understands the highly intuitive meaning of tapply

tapply(what to split, how to split it,what to apply to the resulting chunks)

one gets exactly the same benefit. Just as, for instance, Tidyers say that ‘%>%’ should be pronounced as “and then,” one can read

tapply(loans$loan_amount,loans$homeownership,mean)

aloud as “Split the loan amount according to ownership, then call ‘mean’ on the results.”

Tidy Obstacles to Debugging

Here is an example from Text Mining with R, by Julia Silge of Posit, and David Robinson. It’s a great introduction to the text analysis field, and I myself have found the book useful.

This example, which may be found here, is a bit more involved than the previous ones, and thus better illustrates my concerns about debugging Tidy code. Courses that teach using Tidy do very well for simple examples; this one is more complex, possibly in terms of writing the code, and definitely in terms of debugging.

Debugging is something I’ve given a fair amount of thought to. I even wrote a book on it, The Art of Debugging (NSP, 2008). As noted earlier, noncoder R learners are the ones most in need of debugging skills, as they make the most errors. Thus the example here is of high importance.

The R package janeaustenr brings in full corpuses of six Austen novels:


> library(janeaustenr)
> books <- austen_books()
> str(books)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       73422 obs. of  2 variables:
$ text: chr  "SENSE AND SENSIBILITY" "" "by Jane Austen" "" ...
$ book: Factor w/ 6 levels "Sense & Sensibility",..: 1 1 1 1 1 1 1 1 1 1 ...
...
> books$text[22225]
[1] "drink to our good journey."

The authors’ first goal is to put all the books together, one row per line of a book, with line and chapter number. Here are a couple of typical rows in the desired result:

"children, the old Gentleman's days were comfort… Sense & Sens…    25       1
...
"As we went along, Kitty and I drew up the blind… Pride & Prej…  7398      39

Here is the solution given by the authors:

original_books <- austen_books() %>%
   group_by(book) %>%
      mutate(line = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
               ignore_case = TRUE)))) %>%
         ungroup()

A base-R version within reach of R beginners would begin with split, to group by books, followed by loop over books. A more advanced base-R coder might use FP, first with lapply and then with Reduce to concatenate the data frames.

The R beginners version may be unfancy, even old-fashioned, but again suitable for beginners – especially from the point of view of debugging, a key point for noncoder R learners, as noted, and my main point in this section. Even the base-R FP version would be quite debuggable (providing the function called by lapply is named).

By contrast, this example epitomizes the problems with debugging Tidy code. Suppose, for instance, that the user had accidentally typed ‘sum’ rather than ‘cumsum’ in the above code. The output would look like this:

> original_books
   text                    book                 line chapter
   <chr>                   <fct>               <int>   <int>
 1 "SENSE AND SENSIBILITY" Sense & Sensibility     1      50
 2 ""                      Sense & Sensibility     2      50
 3 "by Jane Austen"        Sense & Sensibility     3      50
 4 ""                      Sense & Sensibility     4      50
 5 "(1811)"                Sense & Sensibility     5      50
 6 ""                      Sense & Sensibility     6      50
 7 ""                      Sense & Sensibility     7      50
 8 ""                      Sense & Sensibility     8      50
 9 ""                      Sense & Sensibility     9      50
10 "CHAPTER 1"             Sense & Sensibility    10      50
   with 73,412 more rows

The ‘chapter column’ should read 1, not 50. But there is not error message, hence no clue as to what the problem might be.

If the code had not used pipes, one could use the R debug or browser functions, or the Posit IDE debugging tool. Even use of print statements, which I do not recommend in general, would still be effective if the code had been nonpiped. (Details below in the special section on pipes.) But it won’t work in the Tidy, i.e. piped, version.

For instance, consider the code

debug(group_by)
mtcars %>% group_by(cyl)

This produces

debugging in: group_by(., cyl)
debug: {
    UseMethod("group_by")
}

Dealing with this is far beyond the skillset of R beginners.

Pipes are fundamentally unsuitable for use of debugging tools, and even just using print statements is impossible.

One partial remedy is the clever pipecleaner package for debugging pipes. It works only up to a point, and is probably a bit too involved for R beginners. Most signifcantly, the package writeup also notes that

Occasionally it is necessary to restructure code from a piped to an unpiped form.

This makes it clear that pipes (both the Magrittr pipes in Tidy, and the later native pipes added to base-R), were simply not designed with debugging in mind. As the package author says, sometimes the only solution is to convert the code to ordinary unpiped form. (The package has an aid for this.)

Should we teach using pipes or functional composition? Neither!

Pipes play a central role in the Tidy paradigm. According to the Tidy advocates, the claimed alternative – function composition – is confusing for those without coding background. That point is debatable, but once again, the Tidy people are presenting us with a false choice, pipes or function composition.

According to a member of the R Core Group, the latter added a native pipe to R as an “olive branch” to the Tidy community, not as a statement supporting use of pipes. They also wanted to “do pipes the right way.”

There is a “third way”–breaking a sequence of function calls into separate, intermediate results. In other words, instead of

y <- h(g(f(x)))  # functional composition

or

y <- x %>% f %>% g %>% h  # pipes

write

a <- f(x)
b <- g(a)
y <- h(b)

The Tidyers do sometimes mention this Third Way, but immediately dismiss it. Hadley, for instance, in R for Data Science feels that it creates “cluttered” code. Well, OK, each person has his/her own aesthetic values, but really that should pale in comparison to code writability, readability and debuggability.

For example, consider the Jane Austen novels example above.

original_books <- austen_books() %>%
   group_by(book) %>%
      mutate(line = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
               ignore_case = TRUE)))) %>%
         ungroup()

A non-piped version (but still using dplyr) would be, say,

austens <- austen_books() 
booksGrouped <- tibble(group_by(austens,book))
chapterNumbers <- 
   cumsum(str_detect(booksGrouped$text, regex("^chapter [\\divxlc]",
               ignore_case = TRUE)))
mutated <- mutate(booksGrouped,line = row_number(),chapter=chapterNumbers)
original_books <- ungroup(mutated)

Now, suppose the user had accidentally typed ‘sum’ instead of ‘cumsum’. In the unpiped version, the user could run this code one line at a time, pausing to check the result at each step, either via a debugging tool or just using print. She would quickly determine that the line in which chapterNumbers is assigned is the culprit. This is the essence of debugging – first narrow down where the error arises.

As noted, one could not do this with the piped version. Indeed, the user would likely convert to the non-piped version for debugging!

So why not do this in the first place? I assert that it would be easier to write nonpipe code to begin with. During the writing of the code, the breaking down the overall action into simple intermediate steps make it easier for the author to plan out the trajectory of the code. Each intermediate step allows the author to “catch one’s breath.”

For the same reasons, I assert that such code is easier to write and debug, but also easier for others to read.

Isn’t this worth a little bit of clutter?

Even I, with several decades of programming experience in numerous languages, almost never use functional composition (other than a nesting depth of 2). Nor do I use pipes. I use separate intermediate statements, as discussed above. It’s just clearer. And yes, worth the bit of clutter!

Use of Functional Programming (FP)

If there is one aspect of the Tidy-vs.-base-R debate that in my opinion demonstrates the problem with Tidy, it’s that Tidy advocates want R beginners to avoid loops. Indeed, many of those advocates do not even teach loops at all.

The Tidyers want R beginners to use functional programming (FP) instead of loops. But even the Tidyers agree that the concept of functions is one of the hardest for beginners to learn. Thus it makes no sense to force beginners to use the tough concept of FP in lieu of the much more natural loops approach. FP does have its place, but it should be taught as an advanced topic.

It is worth noting that top university Computer Science Departments have shifted away from teaching their introductory programming courses using the FP paradigm, in favor of the more traditional Python, as they deem FP to be more abstract and challenging. If FP is tough for CS students, it makes no sense to have nonprogrammer learners of R use it.

Python does include some FP operations, but this advanced material is usually not taught in first courses.

Even Hadley, in R for Data Science, says:

The idea of passing a function to another function is extremely powerful idea, and it’s one of the behaviours that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it’s worth the investment.

Again, this is arguably correct for experienced coders, but it is bad policy to ask noncoder learners of R to “wrap their heads” around advanced constructs. The KISS principle should be paramount.