The default behaviour of many functions is to reject data containing missing values -- this is natural when the result would depend on the missing value, were it not missing. You should NOT use missing values in boolean tests: if you test wether two numbers are equal, and if one or both is are missing, then you cannot conclude: the result will be NA. A data frame may be seen as a list of vectors, each with the same length.

Usually, the table has one row for each subject in the experiment, and one column for each variable measured in the experiement -- as the different variables measure different things, they maight have different types: some will be quantitative numbers; each column may contain a measurement in a different unit , others will be qualitative i.

The "str" command prints out the structure of an object any object and display a part of the data it contains. The "summary" command print concise information about an object here, a data. We can turn the columns the data. Do not forget to "detach" the data. More precisely you have two data frames a with columns x, y, z and b with columns x1, x2, y,z and certain observations rows of a correspond to certain observations of b: the command merges them to yield a data frame with columns x, x1, x2, y, z.

In this example, the command. By default, the join is over the columns common present in both data frames, but you can restrict it to a subset of them, with the "by" argument.

Data frames are often used to store data to be analyzed. We shall detail those examples later -- do not be frightened if you have never heard of "regression", we shall shortly demystify this notion. We shall see in a separate section how to transform data frames, because there are several ways of putting the result of an experiment in a table -- but usually, we shall prefer the one with the most rows and the fewer columns.

Some people may advise you to use the "subset" command to extract subsets of a data. Actually, you can do the same thing with the basic subsetting syntax -- which is more general: the "subset" function is but a convenience wrapper around it. Vectors only contain simple types numbers, booleans or strings ; lists, on the contrary, may contain anything, for instance sata frames or other lists.

They can be used to store complex data, for instamce, trees. They can also be used, simply, as hash tables. You can access one element with the "[[" operator, you can access several elements with the "[" operator. The results of most statistical analyses is not a simple number or array, but a list containing all the relevant values.

Matrices are 2-dimensional tables, but contrary to data frames whose type may vary from one column to the next , their elements all have the same type. We have already seen the "cbind" and "rbind" functions that put data frames side by side or on top of each other: they also work with matrices. Actually, one rarely need the inverse of a matrix -- we usually just want to multiply a given vector by this inverse: this operation is simpler, faster and numerically more stable.

We do not really need it, because the Pivot Algorithm is already implemented in the "solve" command. We shall see later another application to the simulation of non independant normal variables, with a given variance-covariance matrix. When you look at them, matrices are rather complicated there are a lot of coefficients. However, if you look at the way they act on vectors, it looks rather simple: they often seem to extend or shrink the vectors, depending on their direction.

Diagonalizable matrices sound good, but there may still be a few problems. First, the eigen values and the eigen vectors can be complex -- if you want to interpret them as a real-world quantity, it is a bad start. However, the matrices you will want to diagonalize are often symetric real matrices: they are diagonalizable with real eigen values and eigen vectors.

Second, not all matrices are diagonalizable. For instance,. However, the set of non-diagonalizable matrices has zero measure: in particular, if you take a matrix at random, in some "reasonable" way "reasonable" means "along a probability measure absolutely continuous with respect to the Lebesgue measure on the set of square matrices of size n , the probability that it be diagonalizable over the complex numbers is 1 -- we say that matrices are almost surely diagonalizable.

Should you be interested in the rare cases when the matrices are not diagonalizable for instance, if you are interested in matrices with integer, bounded coefficients , you can look into the Jordan decomposition, that generalizes the diagonalization and works with any matrix. This decomposition may also be seen as a sum of matrices of rank 1, such that the first matrices in this sum approximate "best" the initial matrix. We can meet this decomposition in Least Squares Estimates: when we try to minimize the absolute value of Ax-b, this amounts to solve.

Contigency tables are arrays computed with the "table" function , when there are more than two variables. One may attach meta-data to an object: these are called "attributes". For instance, names of the elements of a list are in an attribute.

Some people even suggest to use this to "hide" code -- but choosing an interpreted language is a very bad idea if you want to hide your code. Typically, when the data has a complex structure, you use a list; but when the bulk of the data has a very simple, table-like structure, you store it in an array or data frame and put the rest in the attributes. For instance, here is a chunk of an "lm" object the result of a regression :. We shall soon see another application of attributes: the notion of class -- the class of an object is just the value of its "class" attribute, if any.

If you want to use a complex object, obtained as the result of a certain command, by extracting some of its elements, or if you want to browse through it, the printing command is not enough: we need other means to peer inside an object. The "unclass" command removes the class of an object: only remains the underlying type usually, "list". As a result, it is printed by the "print. The "str" function prints the contents of an objects and truncates all the vectors it encounters: thus, you can peer into large objects.

Finally, to get an idea of what you can do with an object, you can always look the code of its "print" or "summary" methods. The "deparse" command produces a character string whose evaluation will yield the initial object the resulting syntax is a bit strange: if you were to build such an object from scratch, you would not proceed that way.

First example: we have measured several qualitative variables on several hundred subjects. The data may be written down as a table, one line per subject, one column per variable. We can also use a contingency table it is only a good idea of there are few variables, otherwise the array would mainly contain zeroes; if there are k variables the array would hane k dimensions.

The "ftable" command presents the result in a slightly different way more readable if there are more variables. Other example: we made the same experiment, with the same subject, three times. We can represent the data with one row per subject, with several results for each. We can also use one row per experiment, with the number of the subject, the number of the experiment 1, 2 or 3 and the result.

Exercice: Write function to turn one representation into the other. Hint: you may use the "split" command that separates data along a factor. Other example: Same situation, but this time, the numner or experiments per subject is not constant.

The first representation can no longer be a data frame: it can be a list of vectors one vector for each subject. The second representation is unchanged. I never use those functions: fell free to skip to the next section that present more general and powerful alternatives. In SQL this is the language spoken by databases -- to simplify things, you can consider that a database is a set of data.

For instance, if you store you personnal accounting in a database, giving, for each expense, the amount and the nature rent, food, transortation, taxes, books, cinema, etc. The "by" function assumes that you have a vector, that you want to cut into pieces and on whose pieces you want to apply a function. Sometimes, it is not a vector, but several: all the columns in a data. You can then replace the "by" function by "aggregate". The "apply" function applies a function mean, quartile, etc.

It also works in higher dimensions. The second argument indicates the indices along which the program should loop, i. The "tapply" function groups the observations along the value of one or several factors and applies a function mean, etc. The "by" command is similar. The "sapply" function applies a function to each element of a list or vector, etc.

The "lapply" function is similar but returns a list. In particular, the "sapply" function can apply a function to each column of a data. The "split" command cuts the data, as the "tapply" function, but does not apply any function afterwards.

At the begining of this document, list the most important sections, list what the reader is expected to be able to do after reading this document. In R, many commands handle vectors or tables, allowing an almost loop-less programming style -- parallel programming. Thus, the computations are faster than with an explicitely written loop because R is an interpreted language.

The resulting programming style is very different from what you may be used to: here are a few exercises to warm you up. We shall need the table functions we have just introduced, in particular "apply". Many people consider the "apply" function as a loop: in the current implementation of R, it might be implemented as a loop, if if you run R on a parallel machine, it could be different -- all the operations could be run at once.

This really is parallelization. Exercice: Let x be a table. Compute the sum of its rows, the sum of each of its columns. If x is the contingency table of two qualitative variables, compute the theoretical contingency table under the hypothesis that the two variables are independant.

Exercice: Let x be a boolean vector. Count the number of sequences "runs" of zeros for instance, in , there are 6 runs: 00 0 00 0 0 0. Count the number of sequences of 1. Counth the total number of sequences. Same question for a factor with more tham two levels. Present the result as a table. Let r be the return of a financial asset. The clustered return is the accumulated return for a sequence of returns of the same sign.

The trend number is the number of steps in such a sequence. The average return is their ratio. Compute these quantities. You do not print a string with the "print" function but with the "cat" function.

The "print" function only gives you the representation of the string. You can concatenate strings with the "paste" function. To get the desired result, you may have to play with the "sep" argument. Sometimes, you do not want to concatenate strings stored in different variables, but the elements of a vector of strings.

If you want the result to be a single string, and not a vector of strings, you must add a "collapse" argument. In some circumstances, you can even need both the "cat" function does not accept this "collapse" argument. The "nchar" function gives the length of a string I am often looking for a "strlen" function: there it is.

It may seem out of place to speak of regular expressions in a document about statistics: it is not. We shall see well, not in the current version of this document, but soon -- I hope that stochastic regular expressions are a generalization of Hidden Markov Models HMM , which are the analogue of State Space Models for qualitative time series.

If you understood the last sentence, you probably should not be reading this. The "regexpr" performs the same task as the "grep" function, but gives a different result: the position and length of the first match or -1 if there is none. Sometimes, you want an "approximate" matches, not exact matches, accounting for potential spelling or typing mistakes: the "agrep" function provides suc a "fuzzy" matching.

It is used by the "help. The "gsub" function replaces each occurrence of a string a regular expression, actually by a strin. When you read data from various sources, you often run into date format problem: different people, different software use different formats, different conventions. The only unambiguous, universal format is the ISO one, not really used by people but rather by programmers: dates are coded as.

The main rationale for this format is that when you write a numeric quantity you start with the largest units and end with the smallest; e. Why should it be different for dates? We should start with the largest unit, the years, procedd with the next largest the months, and end with the smallest, the days. This format has an advantage: if you want to sort data according to the date, your program just has to be able to sort strings, it need not be aware of dates.

It does not look ambiguous hours, minutes, seconds, hundredths of seconds -- for some applications, you may even need thousandths of seconds , but the time zone is missing. Most of the problems you have with times comes from those time zones. If you want to extract part of a date, you can use the format" function.

For instance, if I want to aggregate my data by month, I can use. There is another caveat about the use of dates as indices to arrays: as a date is actually a number, if you use it as an index, R will understand the number used to code the date say for as a row or column number, nor a row or column name.

When using dates as indices, always convert them into strings. The two classes are interchangeable, only the internal representation changes use the first, more compact, one in data. What's nice is that it can play 1 vs 1. This way you can play against your classmate, friend or play a match against your teacher.

Pay attention to privacy and don't use sirnames. Only letters and numbers Password: Please re-enter your password:. Username: Only letters and numbers Password:. Times tables games My Smart Horse. Rally V Submarine Math. Animal Rescue. Number Diving. Many adults seem to forget how difficult and time consuming it was to learn the multiplication facts. Research says the best way to remember is by using visual images and stories. It really can be FUN!

What I enjoy most about the Premium Learning System is that it allows me to monitor student progress in real time. Students enjoy it because, with Zippy as a guide and games to entertain them, the program feels more like playing than working. Every student benefits from this unique approach to learning multiplication facts. The real strength of this program is the salvation it offers the academically challenged and at-risk learner.

My son struggled in math last year gradually losing self-confidence. In particular, if your data are not gaussian i. Furthermore, the presence of even a single extreme value "outlier" will change the value of the mean and the standard deviation and therefore change the scaling. Sometimes, other transformations might make the distribution closer to a gaussian i.

For instance, for skewed distributions, taking the logarithm or the square root is often a good idea other sometimes used transformations include: power scales, arcsine, logit, probit, Fisher.

In some situations, other transformations are meaningful: power scales, arcsine, logit, probit, Fisher, etc. Whatever the analysis you perform, it is very important to look at your data and to transform them if needed and possible. If you really want your distribution to be bell-shaped, you can "forcefully normalize" it -- but bear in mind that this discards relevant information: for instance, if the distribution was bimodal, i.

We have just seen that for a centered statistical series, i. One can show exercice that:. The third moment of a centered statistical series is called skewness. For a symetric series, it is zero. To check if a series is symetric and to quantify the departure from symetry, it suffices to compute the third moment of the normalized series.

The fourth moment, tells if a series has fatter tails i. The fourth moment of a gaussian random variable is 3; one defines the kurtosis as the fourth moment minus 3, so that the kurtosis of a gaussian distribution be zero, that of a fat-tailed one be positive, that of a no-tail one be negative. You stumble upon this notion, for instance, when you study financial data: we often assume that the data we study follow a gaussian distribution, but in finance more precisely, with high-frequency intra-day financial data , this is not the case.

The problem is all the more serious that the data exhibits an abnormal number of extreme values outliers. To see it, we have estimated the density of the returns and we overlay this curve with the density of a gaussian distribution. The vertical axis is logarithmic.

You can notice two things: first, the distribution has a higher, narrower peak, second, there are more extreme values.

You can do several simulations as we have just done and look at the distribution of the resulting values: by comparison, are those that come from our data that large? We shall see again, later, that kind of measurement of departure from gaussianity -- the computation we just made can be called a "parametric bootstrap p-value computation". Moments allow you to spot non-gaussian features in your data, but they are very imprecise they have a large variance and are very sensitive to outliers -- simply because they are defined with powers, that amplify those problems.

One can define similar quantities without any power, with a simple linear combination of order statistics. L1 is the usual mean; L2 is a measure of dispersion: the average distance between two observations; L3 is a measure of asymetry, similar to the skewness; L4 is a measure of tail thickness, similar to the kurtosis. We can see the data we are studying as an untidy bunch of numbers, in which we cannot see anything that is why you will often see me using the "str" command that only displays the beginning of the data: displaying everything would not be enlightening.

There is a simple way of seeing someting in that bunch of numbers: just sort them. That is better, but we still have hundreds of numbers, we still do not see anything. In those ordered numbers, you may remark that the first two digits are often the same. Furthermore, after those two digits, there is only one left. Thus, we can put them in several classes or "bins" according to the first two digits and write, on the bin, the remaining digit.

This is called a "stem-and-leaf plot". It is just an orderly way of writing down our bunch of number we have not summurized the data yet, we have not discarded any information, any number. We could also do that by hand before the advent of computers, people used to do that by hand -- actually, it is no longer used.

Yet, if there are many data, or if there are several observations with the same value, the resulting graph is not very readable. We can add some noise to that the points do not end up on top of one another. Exercise: to familiarize yourself with the "rnorm" command and a few others , try to do that yourself. The two horizontal parts correspond to the two peaks of the histogram, to the two modes of the distribution.

Actually, it is just a scatter plot with an added dimension. The "rug" function adds a scatter plot along an axis. It also helps to see that the data is discrete -- in a scatter plot with no added noise no jitter , you would not see it.

In some cases, the observations the subjects are named: we can add the names to the plot it is the same plot as above, unsorted and rotated by 90 degrees. From a theoretical point of view, the cumulative distribution curve is very important, even if its interpretation deos not spring to the eyes.

In the following examples, we present the cumulative distribution plot of several distributions. The box-and-whiskers plots are a simplified view of the cumulative distribution plot: they just contain the quartiles, i. A box-and-whiskers plot is a graphical representation of the 5 quartiles minimum, first quartile, median, third quartile, maximum.

On this example, we can clearly see that the data are not symetric: thus, we know that it would be a bad idea to apply statistical procedures that assume they are symetric -- or even, normal. This graphical representation of the quartiles is simpler and more directly understandable than the following, in terms of area. In this example, the four areas are equal; this highlights the often-claimed fact that the human eye cannot compare areas. In this example, there are no outliers.

If there are only a few outliers, really isolated, they might be errors -- yes, in the real life, the data is "dirty" Then, we usually transform the data, by applying a simple and well-chosen function, so that it becomes gaussian more about this later.

This was a second use of box-and-whiskers plots: spotting outliers. Their presence may be perfectly normal but you must beware that they might bias later computations -- unless you choose robust algorithms ; they may be due to errors, that are to be corrected; they may also reveal that the distribution is not gaussian and naturally contains many outliers "fat tails" -- more about this later, when we mention the "extreme distributions" and high-frequency intra-day financial data.

You could plot boxes whose whiskers would extend farther for larger samples, but beware: even if the presence of extreme values in larger samples is normal, it can have an important leverage effect, an important influence on the results of your computations.

You can also represent those data with a histogram: put each observation in a class the computer can do this and, for each class, plot a vertical bar whose height or area is proportionnal to the number of elements. There is a big, unavoidable prolem with histograms: a different choice of classes can lead to a completely different histogram.

But their position, as well, can completely change the histogram and have it look sometimes symetric, sometimes not. For instance, in neither of the following histograms does the peak look symetric but the asymetry is not in the same direction.

You can replace the histogram with a curve, a "density estimation". If you see the data as a sum of Dirac masses, you can obtain such a function by convolving this sum with a well-chosen "kernel", e. This density estimation can be adaptive: the bandwidth of this gaussian kernel can change along the sample, being larger when the point density becomes higher the "density" function does not use an adaptive kernel -- check function akj in the quantreg package if you want one.

Density estimations still have the first problem of histograms: a different kernel may yield a completely different curve -- but the second problem disappears. One can add many other elements to a histogram.

For instance, a scatterplot, or a gaussian density to compare with the estimated density. When you look at your data, one of the first questions you may ask is "are they symetric?

The following plot simply sorts the data and tries to pair the first with the last, the second with the second from the end, etc..

The following is a plot of the distance to the median of the n-i -th point versus that of the i-th point. The problem is that it is rather hard to see if you are "far away" from the line: the more points the sample has, the more the plot looks like a line.

Here, with a points this is a lot , we are still far away from a line.

In this part, after quickly listing the main characteristics of the language, we present the basic data types, how to create them, how to explore build your own cartoon character free, how to extract pieces of them, how to modify them. We then jump exercices free fr maths tables index htm more advanced subjects most of which can -- should? Actually, R is a programming language: as such, the piano 1993 full movie online free exercices free fr maths tables index htm the usual control structures loops, conditionnals, recursion, etc. Switch I do not like this command -- this is probably the last time you see it in this document :. In particular, exercices free fr maths tables index htm you need it, you can write functions that take other functions as argument -- and in case you wonder, yes, you need it. When you call a function you can use the argument names, without any regard to their order this is very useful for functions that expect many arguments -- in particular arguments with default values. After the arguments, exercices free fr maths tables index htm the definition of a function, you can put three dots represented the arguments that have not been specified and that can passed through another function very often, the "plot" function. In particular, you cannot write a function that modifies a global variable. Well, if you really want, you can: see the "Dirty Tricks" part -- but you should not. But sometimes, it does exercices free fr maths tables index htm work that well: if we want to peer inside the "predict" function that we use for predictions of linear models, we get. This is a generic function: we exercices free fr maths tables index htm use the same function on different- how to renew avast free license, hifont cool font text free apk, easycap driver windows 10 64 bit download free, bryson tiller set you free lyrics, impossible changer mot de passe zimbra free, game of thrones season 6 episode 2 full episode free, fl studio 20 reg key file free download, watch indian tv channels live free, google play app download for android free Everything multiplication at genericpills24h.comPlay Free Math GamesAccount Options