Plotting and Data Visualization#

Overview

Questions:

  • How do I visualize data by making graphs?

Objectives:

  • Plot data from a ‘pandas dataframe’.

  • Label and customize graph.

  • Save figures to files.

  • Plot multiple graphs on one figure.

  • Create multiple figures using a ‘for’ loop

Plotting is one of the most effective methods of representing numerical data and illustrating their patterns and trends. There are a number of applications that facilitate graph creation (Excel, Origin, SciDavis, etc.), but these methods can be time consuming, tedious, and at times inflexible. We have already seen the potential of coding for reading/editing/saving multiple files at once, as well as in creating tables from raw data in a ‘CSV’ file. Taking what we have learned so far, we will focus in this module on creating plots from the data in the previous lesson, customizing the plots with color, design, labels and legends, and using loops to create multiple figures at once.

Prepare data for plotting#

First, we need to import pandas and load our data into variables. These lines should look familiar from the previous lesson.

import pandas as pd

distance_file = "data/distance_data_headers.csv"

distances = pd.read_csv(distance_file)

distances
Frame THR4_ATP THR4_ASP TYR6_ATP TYR6_ASP
0 1 8.9542 5.8024 11.5478 9.9557
1 2 8.6181 6.0942 13.9594 11.6945
2 3 9.0066 6.0637 13.0924 11.3043
3 4 9.2002 6.0227 14.5282 10.1763
4 5 9.1294 5.9365 13.5321 10.6279
... ... ... ... ... ...
9995 9996 8.5083 7.7587 9.1789 10.6715
9996 9997 8.9524 7.4681 9.5132 10.9945
9997 9998 8.6625 7.7306 9.5469 10.3063
9998 9999 9.2456 7.8886 9.8151 10.7564
9999 10000 8.8135 7.9170 9.9517 10.7848

10000 rows × 5 columns

Plotting Data#

A data set is plotted using the ‘plot()’ function of matplotlib.pyplot. By using the pandas DataFrame object for our tabular data, we can refer to the desired column by its header name. If you want to see how this simplifies working with our data, check out the “Prepare for Plotting section” at the top of the original lesson from which this one was adapted.

import matplotlib.pyplot as plt
plt.figure()     #This initializes a new figure
plt.plot(distances["THR4_ATP"])
[<matplotlib.lines.Line2D at 0x7f2270ac4c10>]
_images/8cbcb59147bd9bf5d6b30224f8038fb5e783c7de83eff94e1c0b2f6aae1a1b37.png

Check Your Understanding

How would you get make the same plot using the iloc command?

Plotting with x and y#

The data here are relatively straightforward, with the Frame column serving as a simple x value for the rest of the columns. Often however, it is necessary to show trends or patterns in data in relation to a variety of independent variables. If you have data that contains multiple x columns, it can be helpful to specify the x and y values we wish to use. In with ‘pyplot’, the first two parameters in the ‘plot()’ function are ‘x’ and ‘y’ by default.

plt.figure()
plt.plot(distances["Frame"], distances["TYR6_ASP"])
[<matplotlib.lines.Line2D at 0x7f2270984a90>]
_images/4e321d627f59f500721087e027353c65a111a97fadbebc9911b1723995ffa533.png

It may be a little hard to see the effects of changing the style of the plot at this scale. Let’s slice our data so we can only see the first 100 points.

plot_data = distances.iloc[0:100,:]

plt.figure()
plt.plot(plot_data["Frame"], plot_data["TYR6_ASP"])
[<matplotlib.lines.Line2D at 0x7f22709e9910>]
_images/c9abb90e2ccfe6021c0e7746d002d338260c2f00b1cf07c7fb2864dda28f793b.png

Labels and Legends#

If we are creating a plot to add to a report or presentation, we will want to add labels to our axes and a legend to our plot. We can do this using the ‘xlabel()’, ‘ylabel()’, and ‘legend()’ functions. The ‘legend()’ function takes a list of strings as an argument, which will be used to label each line in the order they were plotted.

plt.figure()
plt.xlabel('Frame')
plt.ylabel('Distance (angstrom)') 

plt.plot(plot_data["Frame"], plot_data["THR4_ATP"], label="THR4_ATP")
plt.legend()
<matplotlib.legend.Legend at 0x7f227084a550>
_images/eb7b25ecf4ee41eee92bcf5603166a124315915e3e234e33d27dce5720e1107f.png

Plotting more than one data set at a time#

To plot more than one dataset at a time, we can pass multiple columns to the ‘plot()’ function. We can label the data using the headers of the dataframe by passing the column names to the ‘label’ keyword argument

x_column = plot_data["Frame"]
y_columns = plot_data.iloc[:,1:]
labels = y_columns.columns
plt.figure()
plt.xlabel('Frame')
plt.ylabel('Distance (angstrom)') 

plt.plot(x_column, y_columns, label=labels)
plt.legend()
<matplotlib.legend.Legend at 0x7f22683ca550>
_images/d635cffb956dbee2d90fb216683c34273a9eda8ee0e8afd87bb35026a685c741.png

Different kinds of charts#

Matplotlib can make a variety of plots. If you’re using plt.plot, you can change the line and marker style by adding a string before the “label” argument.

plt.figure()
plt.xlabel('Frame')
plt.ylabel('Distance (angstrom)') 

plt.plot(x_column, y_columns, 'o', label=labels)
plt.legend()
<matplotlib.legend.Legend at 0x7f2268326390>
_images/cbc9f5d6f38b10c415082b8b733b785b3ef617cebb29cda5fbc4eda6ddda718f.png
plt.figure()
plt.xlabel('Frame')
plt.ylabel('Distance (angstrom)') 

plt.plot(x_column, y_columns, '--*', label=labels)
plt.legend()
<matplotlib.legend.Legend at 0x7f2270b1a550>
_images/4123a5cb173917ee0aa2d8e33888715c41b5759e47b6b96c2be24b90011c0afc.png

Here is a list of options you can use.

character

description

'-'

solid line style

'--'

dashed line style

'-.'

dash-dot line style

':'

dotted line style

'.'

point marker

','

pixel marker

'o'

circle marker

'v'

triangle_down marker

'^'

triangle_up marker

'<'

triangle_left marker

'>'

triangle_right marker

'1'

tri_down marker

'2'

tri_up marker

'3'

tri_left marker

'4'

tri_right marker

's'

square marker

'p'

pentagon marker

'*'

star marker

'h'

hexagon1 marker

'H'

hexagon2 marker

'+'

plus marker

'x'

x marker

'D'

diamond marker

'd'

thin_diamond marker

'_'

hline marker

Changing color and image size#

Before saving, let’s first learn how to change the colors and image size of our plots. For more customizations see the matplotlib.pyplot.plot function api

The axes can also be customized using the plt.axes api

Plotting with pandas

pandas has plotting functionality as well, using the syntax dataframe.plot(). It is particularly efficient at creating normal/stacked/nested bar graphs and pie charts. If this is useful to you, check out the DataFrame.plot API

Saving the Figure and Setting the Resolution#

Set the resolution when you save the image using the syntax ‘figure.savefig(filename, dpi)’

plt.figure()
plt.xlabel('Frame')
plt.ylabel('Distance (angstrom)') 

plt.plot(x_column, y_columns, label=labels)
plt.legend()
plt.savefig(f'first_100_frames.png', dpi=300)
_images/d635cffb956dbee2d90fb216683c34273a9eda8ee0e8afd87bb35026a685c741.png

Multiple Figures Using Loops#

Suppose instead in our reporting we wanted to discuss each plot in turn. As long as the settings are consistent, we can generate, modify, and save a series of figures with a single block of code!

# loop through the list of headers, creating a new figure with each cycle
for i in range(1,len(top)): 
    fname = f"{top[i]}.png"
    plt.figure(figsize=(10, 6))
    plt.xlabel('Frame')
    plt.ylabel('distance (\u212B)')
    plt.title(f'{top[i]} per 100 Frames')
    plt.plot(
        distances.iloc[::100,0], 
        distances.iloc[::100,i], 
        '--s', 
        color='#FFCE00',
        markersize=4, 
        markeredgecolor='000000',
        markeredgewidth=1,
        linewidth=1, 
        label=top[i]
    )
    plt.legend()
    plt.savefig(fname, dpi=300)
_images/463729b0f3bed8f186c7d89d8a02a9f21834897ca8e3fe2219d83283b1af8cde.png _images/903ad038649f09dafd4df86b618d4cd4eee716c71b39f98ca9723523036bea81.png _images/fca5337e2f56d2bf6118f48ec0ddd01fd0f5e8fc7c188006ea813b458ec0b182.png _images/ad5dcdb9a7580192263d22d9a781fdcf1632fbb0fd863c3d5e4d8afed596d0e4.png

range(len(list))#

There are five headers in our data, so the ‘len(top)’ function returns 5. Taking the range of the length however, returns a list object iterating over the indices of the list in the length function. So: len(top) == 5 but range(len(top)) == [0,1,2,3,4] so range(1,len(top)) == [1,2,3,4]

Challenge time!

In our data reporting, it has been decided that we need to take each of the plots and overlay an average of all the samples for comparison. We’ll need to create a column of the average distances between the 4 existing columns and graph that on top of each sample plot. This will be too busy though, so we only want to graph one point for every hundred. Don’t have a lot of time either, so best use a for loop.

Key Points

  • Use pandas to generate figures from tabular data with the ‘plot()’ function

  • Create a variety of chart types with either the ‘kind’ keyword or the chart method

  • Add labels, legends, color, and other stylistic choices to figures by passing parameters to plot

  • Work with multiple data sets, either with the ‘iloc[]’ syntax, ‘for’ loops, or simple overlay

  • Use the matplotlib.pyplot functions ‘get_figure()’ and ‘savefig()’ to save the figure to a file