statsutils is a suite of command-line tools to perform simple statistical data manipulations and analyses from data stored in text files.
The programs that can currently be built with this package are:
name | purpose | progress |
---|---|---|
randint | generate integer random numbers in an interval | 100% |
runif | generate random numbers uniformly on [0, 1) | 100% |
rnorm | generate random numbers following a normal distribution | 100% |
hist | create an histogram | 50% |
stats | basic descriptive statistics | 90% |
pair | compare two paired variables | 90% |
maketrix | format text file into columns | 100% |
transpose | transpose a matrix | 80% |
Examples:
$ rnorm -m 10 -s 3 30 | stats
n=30
min=4.208317
max=14.964540
median=9.248378
mean=10.035583
sd=3.033801
absdev=2.535955
mad=3.137011
skew=0.214240
kurtosis=-1.120684
conf_int_95_inf=8.902743
conf_int_95_sup=11.168423
$ rnorm 200 | hist
-2.22 3 ***
-1.68 12 ************
-1.14 21 *********************
-0.61 36 ************************************
-0.07 48 ************************************************
0.47 39 ***************************************
1.01 26 **************************
1.55 9 *********
2.09 3 ***
2.62 2 **
$ rnorm -m 10 -s 3 30 | maketrix 2 | pair
Analysis for 15 points:
Column1 Column2 Difference
Means 11.1890 10.5444 -0.6446
SDs 3.1579 2.8622 4.1277
t(14) 13.7228 14.2683 -0.6048
p 1.642e-09 9.845e-10 0.555
Correlation r-squared t(13) p
0.0623 0.0039 0.2251 0.8254
Intercept Slope
9.9126 0.0565
Note that this is a work in progress. More tools are planned (suggestions and contributions are welcome!)
This package takes inspiration from |STAT
.
|STAT
embodies the Unix philosophy according to which each program does one
thing and does it well (“To do a new job, build afresh rather than complicate
old programs by adding new “features”. Expect the output of every program to
become the input to another, as yet unknown, program.”
wikipedia)
It heavily relies on the piping concept of unix. In the words of |STAT
’s
author:
An analysis consists of an extraction of data, optional transformations, and some analysis. Pictorially, this can be shown as:
data extract transform format analysis results where a copy a subset of the data has been extracted, transformed, reformatted, and analyzed by chaining several programs. Data manipulation functions, sometimes built into analysis programs in other packages, are distinct programs in |STAT. The use of pipelines, signaled with the pipe symbol, |, is the reason for the name |STAT.
See https://garyperlman.com/stat/example.html for some examples of usage.
Gary Perlman, the author, wrote a very fine handbook, which is quite readable by newbies.
The original code can be obtained by contacting the author, Gary
Perlman. It still compiles, yet, written in C in the
80s (predating C89), the code has a few weaknesses, notably the use of many
hard-coded limits; the indent
does not manage to parse it because of macros,
and gcc spits many warnings (e.g. about the use of the dangerous function
gets
).
Before the advent of R
, I used to perform most
of my data analyses using |STAT
, in conjunction with
AWK
and
gnuplot
(and yes, it is possible to produce
good-looking graphics with
gnuplot).
I believe there is still a use case for some of such tools: “quick and dirty””
data exploration from the command line, saving the need to launch R
.
The original suite features the following tools:
name | purpose |
---|---|
abut | join data files beside each other |
colex | column extraction/formatting |
dm | conditional data extraction/transformation |
dsort | multiple key data sorting filter |
linex | line extraction |
maketrix | make matrix format from free-format input |
perm | random/numerical/alphabetical permutation |
probdist | probability distribution functions |
ranksort | convert data to ranks |
repeat | repeat strings or lines in files |
reverse | reverse lines, columns, or characters |
series | generate an additive series of numbers |
transpose | transpose matrix format input |
validata | verify data file consistency |
name | purpose |
---|---|
anova | multi-factor analysis of variance, plots |
calc | interactive algebraic modeling calculator |
contab | contingency tables and chi-square |
desc | descriptions, histograms, frequency tables |
dprime | signal detection d’ and beta calculations |
features | tabulate features of items |
oneway | one-way anova/t-test, error-bar plots |
pair | paired data statistics, regression, plots |
rankind | independent conditions rank order analysis |
rankrel | related conditions rank order analysis |
regress | multiple linear regression and correlation |
stats | simple summary statistics |
ts | time series analysis, plots |
Some tools already have good replacements:
abut
, colex
, series
and perm
can be replaced by paste
, cut
, seq
and shuf
from GNU’s coreutils
dm
, linex
can be replaced by the more powerful awk
(e.g., GNU Awk)
calc
can be replaced by calc
We do not aim for an exact replacement: we can also seek inspiration from R names and outputs when clearer.
We start from scratch, so as not to infringe the Copyright of Gary Perlman.
Ideally, one day, the very fine handbook of Gary Perlman could be modified to reflect the new tools, but the license forbids it.
If you do not already have a C compiler and the make
tool, you need to install
them. Under Debian/Ubuntu, this is achieved with the command sudo apt install
build-essential
Some programs rely on functions from the GNU Scientific Library (GSL). You must
therefore install the headers and libraties. Under Ubuntu, this is achieved with
sudo apt install libgsl-dev
.
To compile the programs and do a system-wide install, run:
./configure --prefix
make
sudo make install
To install only for the current user, run:
./configure --prefix=$HOME
make
make install
Check out the file INSTALL
for detailed instructions.
The code is distributed under the GNU Public License v.3 (see LICENSE file in this folder)
(Looking for Contributors to add to the list!)