Data frames in Julia
Data frames are a data structure for manipulating and analysing tabular data. This article explains, using Julia, how to create data frames from Excel tables, how to perform calculations on them and how to add new columns.
First download the XLS files used in this OpenLearn course and convert them into "modern" XLSX.
We can create a data frame from an XLSX table like so:
df = DataFrame(XLSX.readtable("WHO POP TB some.xlsx", 1))
The output should be:
12×3 DataFrame Row │ Country Population (1000s) TB deaths │ Any Any Any ─────┼────────────────────────────────────────────────────── 1 │ Angola 21472 6900 2 │ Brazil 200362 4400 3 │ China 1393337 41000 4 │ Equatorial Guinea 757 67 5 │ Guinea-Bissau 1704 1200 6 │ India 1252140 240000 7 │ Mozambique 25834 18000 8 │ Portugal 10608 140 9 │ Russian Federation 142834 17000 10 │ Sao Tome and Principe 193 18 11 │ South Africa 52776 25000 12 │ Timor-Leste 1133 990
Having the data in memory, we can, for example, sort the rows by TB deaths.
sort(df, "TB deaths")
The result is:
12×3 DataFrame Row │ Country Population (1000s) TB deaths │ Any Any Any ─────┼────────────────────────────────────────────────────── 1 │ Sao Tome and Principe 193 18 2 │ Equatorial Guinea 757 67 3 │ Portugal 10608 140 4 │ Timor-Leste 1133 990 5 │ Guinea-Bissau 1704 1200 6 │ Brazil 200362 4400 7 │ Angola 21472 6900 8 │ Russian Federation 142834 17000 9 │ Mozambique 25834 18000 10 │ South Africa 52776 25000 11 │ China 1393337 41000 12 │ India 1252140 240000
As can be seen in the header, the columns aren't properly typed (Any
). The following code fixes this problem:
df[!, "Population (1000s)"] = convert.(Int64, df[!, "Population (1000s)"]) df[!, "TB deaths"] = convert.(Int64, df[!, "TB deaths"])
Finally, we can add a new column to the data frame like so:
df[!, "TB deaths (per 100,000)"] = df[!, "TB deaths"] * 100 ./ df[!, "Population (1000s)"]
The data frame now looks as follows:
12×4 DataFrame Row │ Country Population (1000s) TB deaths TB deaths (per 100,000) │ Any Int64 Int64 Float64 ─────┼─────────────────────────────────────────────────────────────────────────────── 1 │ Angola 21472 6900 32.1349 2 │ Brazil 200362 4400 2.19603 3 │ China 1393337 41000 2.94258 4 │ Equatorial Guinea 757 67 8.85073 5 │ Guinea-Bissau 1704 1200 70.4225 6 │ India 1252140 240000 19.1672 7 │ Mozambique 25834 18000 69.6756 8 │ Portugal 10608 140 1.31976 9 │ Russian Federation 142834 17000 11.9019 10 │ Sao Tome and Principe 193 18 9.32642 11 │ South Africa 52776 25000 47.37 12 │ Timor-Leste 1133 990 87.3786
This is just a brief introduction to data frames in Julia. Stay tuned :)