Processing Multiple Files and Writing Files#
Overview
Questions:
How do I analyze multiple files at once?
Objectives:
Process multiple files using a
for
loop.Print output to a new text file.
In our previous lesson, we parsed values from output files. While you might have seen the utility of doing such a thing, you might have also wondered why we didn’t just search the file and cut and paste the values we wanted into a spreadsheet. If you only have 1 or 2 files, this might be a very reasonable thing to do. But what if you had 100 files to analyze? What if you had 1000? In such a case the cutting and pasting method would be very tedious and time consuming.
One of the real powers of writing a program to analyze your data is that you can just as easily analyze 100 files as 1 file. In this example, we are going to parse the output files for a whole series of aliphatic alcohol compounds and parse the energy value for each one. The output files are all saved in a folder called outfiles that you should have downloaded in the setup for this lesson. Make sure the folder is in the same directory as the directory where you are writing and executing your code.
In this lesson, we will be using the glob
library, which will help us read in multiple files from our computer.
import glob
We will make a file path which points to where our .out
files are.
For this file path, we will use *
to specify any word (instead of something like ethanol).
When we use this with the glob
function in the glob
library, we will get all of the files that matches that pattern
file_location = "data/outfiles/*.out"
print(file_location)
data/outfiles/*.out
This specifies that we want to look for all the files in a directory called data/outfiles that end in “.out”. The * is the wildcard character which matches any character.
Next we are going to use a function called glob in the library called glob. It is a little confusing since the function and the library have the same name, but we will see other examples where this is not the case later. The output of the function glob is a list of all the filenames that fit the pattern specified in the input. The input is the file location.
filenames = glob.glob(file_location)
print(filenames)
['data/outfiles/butanol.out', 'data/outfiles/methanol.out', 'data/outfiles/ethanol.out', 'data/outfiles/octanol.out', 'data/outfiles/heptanol.out', 'data/outfiles/nonanol.out', 'data/outfiles/propanol.out', 'data/outfiles/hexanol.out', 'data/outfiles/decanol.out', 'data/outfiles/pentanol.out']
This will give us a list of all the files which end in *.out
in the outfiles directory. Now if we want to parse every file we just read in, we will use a for loop to go through each file.
for f in filenames:
outfile = open(f,'r')
data = outfile.readlines()
outfile.close()
for line in data:
if 'Final Energy' in line:
energy_line = line
words = energy_line.split()
energy = float(words[3])
print(energy)
-232.1655798347283
-115.04800861868374
-154.09130176573018
-388.3110864554743
-349.27397687072676
-427.3465180082815
-193.12836249728798
-310.2385332251633
-466.3836241400086
-271.20138119895074
Notice that in this code we actually used two for loops, one nested inside the other. The outer for loop counts over the filenames we read in earlier. The inner for loop counts over the line in each file, just as we did in our previous file parsing lesson.
The output our code currently generates is not that useful. It doesn’t show us which file each energy value came from.
We want to print the name of the molecule with the energy. We can use os.path.basename
, which is another function in os
to get just the name of the file.
import os
first_file = filenames[0]
print(first_file)
file_name = os.path.basename(first_file)
print(file_name)
data/outfiles/butanol.out
butanol.out
Exercise
How would you extract the molecule name from the example above?
Solution
You can use the split
function introduced in the last lesson, and split at the .
charactr.
split_filename = file_name.split('.')
molecule_name = split_filename[0]
print(molecule_name)
Using the solution from the previous exercise, we can modify our loop so that it prints the file name along with each energy value.
for f in filenames:
# Get the molecule name
file_name = os.path.basename(f)
split_filename = file_name.split('.')
molecule_name = split_filename[0]
# Read the data
with open(f) as outfile:
data = outfile.readlines()
# Loop through the data
for line in data:
if 'Final Energy' in line:
energy_line = line
words = energy_line.split()
energy = float(words[3])
print(molecule_name, energy)
butanol -232.1655798347283
methanol -115.04800861868374
ethanol -154.09130176573018
octanol -388.3110864554743
heptanol -349.27397687072676
nonanol -427.3465180082815
propanol -193.12836249728798
hexanol -310.2385332251633
decanol -466.3836241400086
pentanol -271.20138119895074
Printing to a file#
Finally, it might be useful to print our results in a new file, such that we could share our results with colleagues or or e-mail them to our advisor. Much like when we read in a file, the first step to writing output to a file is opening that file for writing. Opening a file for writing is similar to opening it for reading, except that you use w
instead of r
in the open
function.
You can also use a
for append to an existing file or a+
. The a
mode stands for append
, and when you use this, lines will be added to the end of the file. If you use w
, all existing file contents are overwritten if the file already exists.
Python can only write strings to files. Our current print statement is not a string; it prints two python variables. To convert what we have now to a string, you place a capital F in front of the line you want to print and enclose it in single quotes. Each python variable is placed in braces. Then you can either print the line (as we have done before) or you can use the filehandle.write() command to print it to a file.
To make the printing neater, we will separate the file name from the energy using a tab. To insert a tab, we use the special character \t.
with open('energies.txt','w+') as datafile:
for f in filenames:
# Get the molecule name
file_name = os.path.basename(f)
split_filename = file_name.split('.')
molecule_name = split_filename[0]
# Read the data
with open(f) as outfile:
data = outfile.readlines()
# Loop through the data
for line in data:
if 'Final Energy' in line:
energy_line = line
words = energy_line.split()
energy = float(words[3])
datafile.write(F'{molecule_name} \t {energy} \n')
After you run this command, look in the directory where you ran your code and find the “energies.txt” file. Open it in a text editor and look at the file.
In the file writing line, notice the \n at the end of the line. This is the newline character. Without it, the text in our file would just be all smushed together on one line.
Your Turn#
Exercise
Extend your project from the previous lesson to analyze all of the mdout
files in the folder data/mdout
. Write a new file for each analyzed file called filename_Etot.txt
.
Soluiton
Solution here
Key Points
Use the
glob
function in theglob
library to get files that have similar names.You can nest
for
loops inside of each other (also called double for loop)