Writing scripts in python

"Learning the basics of python scripts’ structure. Turning jupyter notebooks into scripts that can be run from anywhere. Introduction to the argument parser."

Information

The estimated time to complete this training module is 2h.

The prerequisites to take this module are:

the installation module.

Contact François Paugam if you have questions on this module, or if you want to check that you completed successfully all the exercises.

Resources

This module was presented by Greg Kiar during the QLSC 612 course in 2020, with slides from Joel Grus’ talk at JupyterCon 2018.

The video is available below:

Exercise

In a new directory, create a file useful_functions.py.
In this file, implement a function change_case that takes as input a string and that returns this string with the last character in upper case and the rest in lower case (e.g. "myInputString" -> "myinputstrinG"). Hint : use the .upper() and .lower() methods of the string type.
In the same file, implement an other function get_words that takes a string as input and returns a list of strings. The list should contain the words in the input string (e.g. "My input string" -> ["My", "input", "string"]). Hint : use the .split() method.
In the same file, implement a third function join_words that does the reverse as the get_words function, i.e. taking a list of word and returning one string with all the words (e.g. ["My", "input", "string"] -> "My input string").
Create a new file main_script.py. In this file use the Argparse library introduced in the video so that a user can call it with two arguments : a --text argument containing a string and a --case argument which can take the value “upper”, “lower” or “mixed”. The script should print the text argument with the case asked (mixed being the one generated by the change_case function). For example :

./main_script --text "My awesome text." --case mixed
mY awesomE texT.

./main_script --text "My awesome text." --case upper
MY AWESOME TEXT.

Hint : import and use the functions from the useful_functions.py file. Don't forget the if __name__ == "__main__" even though in this example it won't make a difference, it is never too early to get used to good practices.

Follow up with François Paugam to validate you completed the exercise correctly.
🎉 🎉 🎉 you completed this training module! 🎉 🎉 🎉

Bonus exercise (click to show ⬇)

In this exercise we will code a key-based encryption and decryption system. The principle is similar to the Vignere cipher, but instead of just using the keys to just to shift the letters, we will be using a slightly more complex transformation.

The Vignere cipher consists in shifting the letters of the message to encrypt by the index of the corresponding letter in the key. For example the encryption of the letter B with the key D will result of the letter of index = index(B) + index(D) = 2 + 4 = 6, so it will be F.

⚠ Note that here by index I mean the index of the letter in the alphabet (or in the character table you use) and not the index of the letter in the message string.

You pair up the letters of the message with the ones of the key one by one, and repeating the key if it is shorter than the message. For example if the message is Myawesomemessage and the key is mykey, the pairs will be :

(M,m),
(y,y),
(a,k),
(w,e),
(e,y),
(s,m),
(o,y),
(m,k) ...

and so on.

Here we will follow this principle, but instead of directly having index(message_letter) + index(key_letter) as the index of the encrypted letter, we will turn it into an other base, depending on the rest of the division of the key's index by 5. So the base k of the encrypted letter will be : k = 5 + (index(key_letter) % 5). For example if the index of the message letter is 47 and the index of the key letter is 19, the index of the encrypted letter will be 47 + 19 = 66, which should be turned into base 5 + 4 = 9 which finally gives 73 (indeed 7*9 + 3 = 66).

For the indices of the letter, we will not use the the number of the letter in the alphabet, but the unicode index of the letter, which is easily obtained with the native python function ord. The reverse operation of getting a letter from its unicode index is obtained with native python function chr. There are 1114112 unicode characters handled by python, so we'll have to make sure we have indices in the range 0 to 1114111. To ensure that, we can use values modulo 1114112, i.e. encrypted_index = base_k(ord(message_letter) + ord(key_letter), k) % 1114112.

In a file cipher.py, you'll implement the following functions :

base_k(n, k) : return the number n in the base of k. E.g. base_k(78, 7) should return 141.
base_10(n, k) : return in base 10 the number n considered in base k, e.g. base_10(141, 7) should return 78.
encrypt_letter(letter, key) : return the encrypted letter with the key, e.g. encrypt_letter("l", "k") should return 'Ʃ'.
decrypt_letter(letter, key) : return the decrypted letter with the key, e.g. decrypt_letter("Ʃ", "k") should return 'l'.
process_message(message, key, encrypt): return the encrypted message using the letters in key if encrypt is True, and the decrypted message if encrypt is False. For example :

process_message('coucou', 'clef', True)
'ðōϪƕýŕ'
process_message(‘ðōϪƕýŕ’, ‘clef’, False)
‘coucou’

Then complete the script so that you can call your script with arguments as follows :

python cipher.py --message "coucou" --key "clef" --mode enc 'ðōϪƕýŕ'

python cipher.py –message "ðōϪƕýŕ" –key "clef" –mode dec ‘coucou’

Finally, decrypt the following text :

Ô÷ԼВzϾ֍ćЁ¡ȦóІԩţϭĭВƐÉÉչȧôđȒЀĮƩȒЉƛìāԼњyչչĮƔöȖĬЇՀϽϩyԪţƝćքϩöČȖƔóƮȕƝţāԨԬúϾӅÒǾ¡ƧyЊչϷϼĬѷƦàąԳȕöþƓϫþưȝВèßĂԬѢąԟԲþȩ±ȖĄѡѡȠϼþԽƮëÈӈȝýöƛƳĶȜ϶ƦäĎэԵþԠ

with the following key :

This is the (not so) secret key.

Bonus in the bonus (bonusception)

Modify your script so that the message and key arguments can be paths to text files. To do this I suggest you use the isfile function from the os.path package :

import os
os.path.isfile(mystring)

It will return True if mystring is a valid path to a file and False otherwise.

Then if your script detects arguments that are paths to files, it should use the text contained in the file. Also, if the message argument is a path to a file, the processed message should be saved to a new file with the same name appended with “_encrypted” or “_decrypted” depending on the mode argument.

After that you can use the decrypted text from earlier as a key to decrypt the file obtained with :

wget https://raw.githubusercontent.com/BrainhackMTL/psy6983_2021/master/content/en/modules/python_scripts/message_encrypted.txt

On the usefulness of "if name == 'main':" (click to show ⬇)

It is not obvious why you shoud put the if __name__ == "__main__": line in your script. Indeed in a lot of cases, putting it or not won't change anything to how your code runs. But in specific settings with multiple scripts importing from each pother, not putting it in can quickly lead to a nightmare. To give you an insight of how and why it is useful, here is an example (if you don't want to read or if you want complementary explanations, here is a nice youtube video about it).

Suppose you have a script to fit a Ridge model on provided data, judiciously named fit_Ridge.py, which looks like this :

#!/usr/bin/env python
import argparse
import pickle  # pickle is a librairie to save and load python objects.
import numpy as np
from sklearn.linear_model import Ridge

def  fit_Ridge_model(X, Y):
  model = Ridge()
  model.fit(X, Y)
  return model

parser = argparse.ArgumentParser()
parser.add_argument("--X_data_path", type=str)
parser.add_argument("--Y_data_path", type=str)
parser.add_argument("--output_path", type=str)
args = parser.parse_args()

X = np.load(args.X_data_path)
Y = np.load(args.Y_data_path)
model = fit_Ridge_model(X, Y)
pickle.dump(model, open(args.output_path, 'wb'))

This script allows the user to provide the paths to two numpy files as data to fit a Ridge model, and to save the model to the provided path with a command like :

python fit_Ridge.py --X_data_path data_folder/X.npy --Y_data_path data_folder/Y.npy --output_path models/Ridge.pk

There is no if __name__ == "__main__": to be seen but, used on its own, the script works fine.

But later, you write an other script compare_to_Lasso.py that compare Ridge and Lasso models on the same data, so you need to fit a Ridge model again. Eager to apply the good practices of programming, you judiciously decide not to duplicate the code for fitting a ridge model, but to import the fit_Ridge_model function from the fit_Ridge.py. Thus your second script looks like that :

#!/usr/bin/env python
import numpy as np
import argparse
from sklearn.linear_model import Lasso
from fit_Ridge import fit_Ridge_model

parser = argparse.ArgumentParser()
parser.add_argument("--X_data_path", type=str)
parser.add_argument("--Y_data_path", type=str)
args = parser.parse_args()

X = np.load(args.X_data_path)
Y = np.load(args.Y_data_path)

ridge_model = fit_Ridge_model(X, Y)
lasso_model = Lasso()
lasso_model.fit(X, Y)

ridge_score = ridge_model.score(X, Y)
lasso_score = lasso_model.score(X, Y)

if Ridge_score > lasso_score:
    print("Ridge model is better.")
else:
    print("Lasso model is better.")

It seems fine but here when you try to call

python compare_to_Lasso.py --X_data_path data_folder/x.npy --Y_data_path data_folder/Y.npy

you get an error :

Traceback (most recent call last):
  File "compare_lasso_ridge.py", line 5, in <module>
    from fit_Ridge import fit_Ridge_model
  File "/Users/francois/scratch/fit_Ridge.py", line 21, in <module>
    pickle.dump(model, open(args.output_path, 'wb'))
TypeError: expected str, bytes or os.PathLike object, not NoneType

The error shows that the script tried to save a model to the path args.output_path, which was not defined so it was set to None and raised a TypeError. But our compare_to_Lasso.py script never tries to save a model ! Indeed looking at the other lines of the error message, we see that it comes from the import. In fact what happens is that when we try to import the fit_Ridge_model fuction from the fit_Ridge.py file, python will read the entire file and execute everything that is written in it, so it will try to fit a Ridge model and to save it. But we don't want python to execute everything, we just want it to read the definition of the fit_Ridge_model function. That is why here we absolutely need the if __name__ == "__main__":, so we modify the fit_Ridge.py script like that :

#!/usr/bin/env python
import argparse
import pickle  # pickle is a librairie to save and load python objects.
import numpy as np
from sklearn.linear_model import Ridge

def  fit_Ridge_model(X, Y):
    model = Ridge()
    model.fit(X, Y)
    return model

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--X_data_path", type=str)
    parser.add_argument("--Y_data_path", type=str)
    parser.add_argument("--output_path", type=str)
    args = parser.parse_args()

    X = np.load(args.X_data_path)
    Y = np.load(args.Y_data_path)
    model = fit_Ridge_model(X, Y)
    pickle.dump(model, open(args.output_path, 'wb'))

Now when importing from this script, python will read the definition of the function, but after that it will not execute the rest, since during the import the variable __name__ is not set to "__main__" but to "fit_Ridge".

In the end using if __name__ == "__main__": is the only way to safely import functions from our script, and since you never know for sure that you won't have to import something from a script in the future, putting it in all of your script by default is not a bad idea.

More resources

If you are curious to learn more advanced capabilities for the Argparse library, you can check this Argparse tutorial.

To learn more about python in general, you can check the tutorials of the official python documentation and choose the topic you want to learn. I also recommend the porgramiz tutorials which have nice videos. Finally for even nicer and fancier videos there is the excellent python programming playlist from the youtube channel Socratica.

Writing scripts in python

Writing scripts in python

Information

Resources

Exercise

Bonus exercise (click to show ⬇)

Bonus in the bonus (bonusception)

On the usefulness of "if __name__ == '__main__':" (click to show ⬇)

More resources

On the usefulness of "if name == 'main':" (click to show ⬇)