diff --git a/data.qmd b/data.qmd index 4747c7b..52d8d56 100644 --- a/data.qmd +++ b/data.qmd @@ -8,7 +8,7 @@ For that reason, deep-learning frameworks like `torch` include an input pipeline In this book, "dataset" (variable-width font, no parentheses), or just "the data", usually refers to things like R matrices, `data.frame`s, and what's contained therein. A `dataset()` (fixed-width font, parentheses), however, is a `torch` object that knows how to do one thing: *deliver to the caller a* *single item.* That item, usually, will be a list, consisting of one input and one target tensor. (It could be anything, though -- whatever makes sense for the task. For example, it could be a single tensor, if input and target are the same. Or more than two tensors, in case different inputs should be passed to different modules.) -As long as it fulfills the above-stated contract, a `dataset()` is free to do whatever needs to be done. It could, for example, download data from the internet, store them in some temporary location, do some pre-processing, and when asked, return bite-sized chunks of data in just the shape expected by a certain class of models. No matter what it does in the background, all its caller cares about is that it return a single item. Its caller, that's the `dataloader()`. +As long as it fulfills the above-stated contract, a `dataset()` is free to do whatever needs to be done. It could, for example, download data from the internet, store them in some temporary location, do some pre-processing, and when asked, return bite-sized chunks of data in just the shape expected by a certain class of models. No matter what it does in the background, all its caller cares about is that it return a single item. Its caller is the `dataloader()`. A `dataloader()`'s role is to feed input to the model in *batches*. One immediate reason is computer memory: Most `dataset()`s will be far too large to pass them to the model in one go. But there are additional benefits to batching. Since gradients are computed (and model weights updated) once per *batch*, there is an inherent stochasticity to the process, a stochasticity that helps with model training. We'll talk more about that in an upcoming chapter. diff --git a/modules.qmd b/modules.qmd index d9408d5..65266e1 100644 --- a/modules.qmd +++ b/modules.qmd @@ -92,7 +92,7 @@ output$size() [1] 50 16 -So that's the forward pass. How about gradient computation? Previously, when creating a tensor we wanted to figure as a "source" in gradient computation, we had to let `torch` know explicitly, passing `requires_grad = TRUE`. No such thing is required for built-in `nn_module()`s. We can immediately check that `output` knows what to do on `backward()`: +So that's the forward pass. How about gradient computation? Previously, when creating a tensor, we wanted to figure out a "source" in gradient computation, we had to let `torch` know explicitly, passing `requires_grad = TRUE`. No such thing is required for built-in `nn_module()`s. We can immediately check that `output` knows what to do on `backward()`: ```{r} output$grad_fn diff --git a/network_1.qmd b/network_1.qmd index fdd28da..9c35c85 100644 --- a/network_1.qmd +++ b/network_1.qmd @@ -80,7 +80,7 @@ Each unit has its own value for bias, too. b1 <- torch_zeros(1, 8, requires_grad = TRUE) ``` -Just like we saw before, the hidden layer will multiply the input it receives by the weights and add the bias. That is, it applies the function $f$ displayed above. Then, another function is applied. This function receives its input from the hidden layer and produces the final output. In a nutshell, what is happening here is function composition: Calling the second function $g$, the overall transformation is $g(f(\mathbf{X})$, or $g \circ f$. +Just like we saw before, the hidden layer will multiply the input it receives by the weights and add the bias. That is, it applies the function $f$ displayed above. Then, another function is applied. This function receives its input from the hidden layer and produces the final output. In a nutshell, what is happening here is function composition: Calling the second function $g$, the overall transformation is $g(f(\mathbf{X}))$, or $g \circ f$. For $g$ to yield an output analogous to the single-layer architecture above, its weight matrix has to take the eight-column hidden layer to a single column. That is, `w2` looks like this: diff --git a/tensors.qmd b/tensors.qmd index ba724f3..fe359cd 100644 --- a/tensors.qmd +++ b/tensors.qmd @@ -686,7 +686,7 @@ Now, we have sums over rows. Did we misunderstand something about how `torch` or Instead, the conceptual difference is specific to aggregating, or "grouping", operations. In R, *grouping*, in fact, nicely characterizes what we have in mind: We group by row (dimension 1) for row summaries, by column (dimension 2) for column summaries. In `torch`, the thinking is different: We *collapse* the columns (dimension 2) to compute row summaries, the rows (dimension 1) for column summaries. -The same thinking applies in higher dimensions. Assume, for example, that we been recording time series data for four individuals. There are two features, and both of them have been measured at three times. If we were planning to train a recurrent neural network (much more on that later), we would arrange the measurements like so: +The same thinking applies in higher dimensions. Assume, for example, that we have been recording time series data for four individuals. There are two features, and both of them have been measured at three times. If we were planning to train a recurrent neural network (much more on that later), we would arrange the measurements like so: - Dimension 1: Runs over individuals. @@ -759,7 +759,7 @@ Both indexing and slicing work essentially as in R. There are a few syntactic ex This is because just as in R, indexing in `torch` is one-based. And just as in R, singleton dimensions are dropped. -In the below example, we ask for the first column of a two-dimensional tensor; the result is one-dimensional, i.e., a vector: +In the below example, we ask for the first row of a two-dimensional tensor; the result is one-dimensional, i.e., a vector: ```{r} t <- torch_tensor(matrix(1:9, ncol = 3, byrow = TRUE)) @@ -1095,7 +1095,7 @@ m + m2 Error in m + m2 : non-conformable arrays -Neither does it help if we make `m2` a vector. +Neither does it help if we make `m3` a vector. ```{r} m3 <- 1:5 @@ -1163,7 +1163,7 @@ Let's systematize these rules. ### Broadcasting rules -The rules are the following. The first, unspectactular though it may look, is the basis for everything else. +The rules are the following. The first, unspectacular though it may look, is the basis for everything else. (1) We align tensor shapes, *starting from the right*. @@ -1199,4 +1199,4 @@ torch_zeros(4, 3, 2, 1)$add(torch_ones(4, 3, 2)) # error! ------------------------------------------------------------------------ -Now, that was one of the longest, and least applied-seeming, perhaps, chapters in the book. But feeling comfortable with tensors is, I dare say, a precondition for being fluent in `torch`. The same goes for the topic covered in the next chapter, automatic differentiation. But the difference is, there `torch` does *all* the heavy lifting for us. We just need to understand what it's doing. +Now, that was one of the longest, and least applied-seeming, perhaps, chapters in the book. But feeling comfortable with tensors is, I dare say, a precondition for being fluent in `torch`. The same goes for the topic covered in the next chapter, automatic differentiation. But the difference is that `torch` does *all* the heavy lifting for us. We just need to understand what it's doing.