Data frames in Julia

Data frames are a data structure for manipulating and analysing tabular data. This article explains, using Julia, how to create data frames from Excel tables, how to perform calculations on them and how to add new columns.

First download the XLS files used in this OpenLearn course and convert them into "modern" XLSX.

We can create a data frame from an XLSX table like so:

df = DataFrame(XLSX.readtable("WHO POP TB some.xlsx", 1))

The output should be:

12×3 DataFrame
 Row │ Country                Population (1000s)  TB deaths 
     │ Any                    Any                 Any       
─────┼──────────────────────────────────────────────────────
   1 │ Angola                 21472               6900
   2 │ Brazil                 200362              4400
   3 │ China                  1393337             41000
   4 │ Equatorial Guinea      757                 67
   5 │ Guinea-Bissau          1704                1200
   6 │ India                  1252140             240000
   7 │ Mozambique             25834               18000
   8 │ Portugal               10608               140
   9 │ Russian Federation     142834              17000
  10 │ Sao Tome and Principe  193                 18
  11 │ South Africa           52776               25000
  12 │ Timor-Leste            1133                990

Having the data in memory, we can, for example, sort the rows by TB deaths.

sort(df, "TB deaths")

The result is:

12×3 DataFrame
 Row │ Country                Population (1000s)  TB deaths 
     │ Any                    Any                 Any       
─────┼──────────────────────────────────────────────────────
   1 │ Sao Tome and Principe  193                 18
   2 │ Equatorial Guinea      757                 67
   3 │ Portugal               10608               140
   4 │ Timor-Leste            1133                990
   5 │ Guinea-Bissau          1704                1200
   6 │ Brazil                 200362              4400
   7 │ Angola                 21472               6900
   8 │ Russian Federation     142834              17000
   9 │ Mozambique             25834               18000
  10 │ South Africa           52776               25000
  11 │ China                  1393337             41000
  12 │ India                  1252140             240000

As can be seen in the header, the columns aren't properly typed (Any). The following code fixes this problem:

df[!, "Population (1000s)"] = convert.(Int64, df[!, "Population (1000s)"])
df[!, "TB deaths"] = convert.(Int64, df[!, "TB deaths"])

Finally, we can add a new column to the data frame like so:

df[!, "TB deaths (per 100,000)"] = df[!, "TB deaths"] * 100 ./ df[!, "Population (1000s)"]

The data frame now looks as follows:

12×4 DataFrame
 Row │ Country                Population (1000s)  TB deaths  TB deaths (per 100,000) 
     │ Any                    Int64               Int64      Float64                 
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │ Angola                              21472       6900                 32.1349
   2 │ Brazil                             200362       4400                  2.19603
   3 │ China                             1393337      41000                  2.94258
   4 │ Equatorial Guinea                     757         67                  8.85073
   5 │ Guinea-Bissau                        1704       1200                 70.4225
   6 │ India                             1252140     240000                 19.1672
   7 │ Mozambique                          25834      18000                 69.6756
   8 │ Portugal                            10608        140                  1.31976
   9 │ Russian Federation                 142834      17000                 11.9019
  10 │ Sao Tome and Principe                 193         18                  9.32642
  11 │ South Africa                        52776      25000                 47.37
  12 │ Timor-Leste                          1133        990                 87.3786

This is just a brief introduction to data frames in Julia. Stay tuned :)

Data ScienceJulia
Avatar for Petr Homola

Written by Petr Homola

Studied physics & CS; PhD in NLP; interested in AI, HPC & PLT

Loading

Fetching comments

Hey! 👋

Got something to say?

or to leave a comment.