Summary statistics jmp

if we are interested in whether two strings represent the same address, we could work with some string distance that would tolerate spelling mistakes and swapping positions of words, but make sure to distinguish different terms and names.

a street is part of a city, two cities are equal, a continent contains a country)

if we are interested in their part of relationship, we can define a total order (e.g.if we are interested in the distance between them, we can work with their geolocation, which basically gives us a two-dimensional numerical space, thus interval.This should be obvious by now, but let me give you a last example: when working with geographical locations, we have lots of different way to approach them: you get these only with interval data, due to the need for a distance metric.Īt the end, I want to stress again that the order and metrics you define on your data are very contextual. mean, but also standard deviation, percentiles, etc.median - as says, as long as you have an order, you can derive your median.confidence interval could also be useful. Then we can also derive all the other measures that lists in their answer. mode - both when working with categorical and ordinal data, we can tell which element is most frequently used.But let's see examples on whether/how we could generalise them to categorical or ordinal data: Since statistics works with numbers, its functions are well defined over intervals. Ok, now let's see how some summary statistics fit in this. For example, for addresses, John Smith Street and John Smith Road are quite close in terms of string similarity, but obviously represent two different entities that could be miles apart. There are a number of string similarity metrics that come in handy when working with strings. Hard to make a case for such a relationship.Īnother set that we often work with is strings. Another example, think of 5-point Likert scales, and how the analysis we apply on them assumes that the distance between strongly agree and agree is the same as disagree and neither agree nor disagree. For example, most popular deep learning algorithms work with real numbers taking advantage of their interval and continuous properties. When one uses a machine learning algorithm without knowing how it works, one risks making such assumptions unwillingly, thus potentially invalidating one's own results. If we are not careful to disregard order and distance, we practically convert our categorical data in interval data. This is why we need to be careful when we assign numbers to our categories. Interval variable is one, whose domain defines distances between elements (a metric), thus allowing us to define intervals.Īs the most common set that we use, natural and real numbers have standard total order and metrics. "somewhat agree" is definitely closer to "strongly agree" than "disagree".

A Likert-scale is a good example of a definition of an ordinal variable. for every two elements of the domain, we can tell that either they are identical, or one is bigger than the other. Ordinal variable is one that has a total order defined over the domain, i.e. Examples, depend on the context, but I'd say in the general case, it is difficult to compare days of the week: is Monday before Sunday, if so, what about next Monday? Maybe an easier, but less used example are pieces of clothes: without providing some context that would make sense of an order, it is difficult to say whether trousers come before jumpers or vice versa. Let's start with establishing the definitions of the domains:Ĭategorical variable is one whose domain contains elements, but there's no known relationship between them (thus we have only categories). I do appreciate the other answers, but it seems to me that some topological background would give a much-needed structure to the responses.

YOUR CART

Summary statistics jmp