Python Sins and How to Fix Them

Published July 21, 2022

In my time as a software engineer at AWS, I reviewed a surprising amount of bad code.

Amazon's culture typically sacrifices quality for velocity, which naturally selects for engineers who bias in the same direction. This strategy has paid dividends at the company level, allowing Amazon to grab critical market share early. But I have seen product failures, burnout, poor hiring decisions, and even major business opportunities missed due in no small part to unapologetic velocity-focused development.

I don't blame the business or management for this, nor do I really blame the culture. In my view, this is largely a self-inflicted wound by engineers. Managers don't get credit for quality (quality is also hard to measure), they get credit for delivery. Thus, the onus lies with engineers to insist on quality throughout the stack that will keep the team and their product flexible and resilient.

I dare not claim causality, but in my experience, there is a perfect correlation between teams who embody quality and successful teams. There are undoubtedly counterexamples out there, but I've never seen one myself.

I was once part of a team who's enduring disregard for quality mired it in so much technical debt as to render it immobile, resulting in almost no product delivery for 10 months. While on that team, I wrote the following document for junior engineers in an attempt to upskill the team and nudge the culture in a more positive direction.

I'm agnostic as to whether it did any good, but I hope junior engineers out there will find this useful.

I have kept this doc mostly unaltered from its original internal version, save for cleansing some naming. It is intended for beginner software engineers to level-up their code quality.




Python Sins and How to Fix Them

Idiomatic, clear, digestible, testable, clean, communicative code is important, especially when collaborating on code with others (and your future self).

In this doc, I highlight common Python mistakes and offer alternatives.

Note that some (not all) of the recommendations below are opinions that experienced developers may disagree on. Consider, however, that more Pythonic is almost always better than less Pythonic. Speaking a common language rooted in best practices recognized by the community keeps the expectations of the code's readers intact and promotes a more robust, longevity-oriented code base. (See also: the principle of least astonishment).

Many of the below infractions are not critical enough to warrant comments on code reviews, so they can go uncorrected and sneak into the codebase during high-velocity projects. Try to avoid these sins before submitting a review so that reviews can focus on more important business domain decisions.

Sin 1: dict.get and None

None is confusing and obfuscatory and bug-prone and bad! In fact, None is considered a "billion dollar mistake" by its inventor! Modern languages like Rust and Typescript don't let you play fast and loose with None for a reason. I argue that None checks in your code (if x is None) are a great indication that there could be improvements made to the control flow. Don't believe me? Try and think of a case where you need None; it's harder than you think!*

dict.get is a one-way ticket to None-land! dict.get should ONLY be used when BOTH

  1. a missing key is an expected component of the control flow, AND
  2. you need to assign a default value.

Even then - if both are indeed true (which is almost never the case), it may indicate larger architectural or code design problems. There are almost no cases where dict.get is the right choice. Instead, use in checks (if x in dict) or subscripting, and handle the KeyError case if there's a chance that a key is missing (like when you fetch data from an external system or receive user input).

# bonus: avoid "magic numbers", even if the number is 0.
# Unpack tuples where it makes sense.

url = mirrorlist.get(region)[0]    # sad emoji!
url = mirrorlist[region][0]        # better!
url, *_ = mirrorlist[region]       # best!

*Or try explaining None to anyone who isn't either a computer scientist or an Ontologist.

Sin 2: Unnecessary “else” clauses

return and raise ALWAYS stop execution. Don't put them in else blocks!

# AVOID 😢

if _:
    return True
else:
    return False

if _:
    raise Exception
else:
    return False


# GOOD 🥳

if _:
    return True
return False

if _:
    raise Exception
return False

Sin 3: Bad function naming

The get_<thing> function naming pattern is a good indication that your naming could be improved. Prefer verbs like "extract", "construct", "parse", "fetch" "aggregate", "generate", "build", "create", etc.

These alternatives at least tell me something about what the function may or may not do. get_<thing> is truly a black box.

def get_env(): # I literally have no idea what this func does
    ...


def load_configuration_from_json_file(): # I'm 99.9% sure what this func does
    ...


# bonus: don't be afraid to use long function names!
# It's not the 1970s; there's no cost for bigger files.
# Your code is written once and read many times.
# Help your reader to make sense of your code using BFS, not DFS

Sin 4: One-liner functions

Expressions rarely need to be functions. As professional python developers, we know what json.dumps() means - no need to encapsulate that behavior in a named utility function.

When I see json_output() called in the codebase, the first thing I wonder is: "What else could that function be doing besides just serializing the json?"

# avoid these 😢:

def json_output(data):
    return json.dumps(data, sort_keys=True, default=str)

def get_sts_client():
    return boto3.client("sts")

# there is no unit testing benefit here;
# `boto3.client` can easily be patched from the unit test class

# bonus: basic expressions should not be unit tested.
# Unit testing the python language*, standard library,
# or even 3rd party dependencies like boto3 is not our job

*While we shouldn’t unit test the language, it may sometimes make sense to unit test that a particular function is called during execution, especially if that call has important, non-obvious side effects (like emitting a metric). If someone alters the code in the future and eliminates this code path, the test will fail, making them think twice about what they are doing. To write this type of unit test, make use of unittest.mock.Mock.assert_called and its variants.

Sin 5: Not using List Comprehension and Generators

Know how and when to use them! Don’t forget about dict and set comprehension as well (and plain old generators!). Leaving out these constructions does not improve readability in many cases. Professional python programmers are expected to be able to read and understand these syntactical features; they are part of the language for a reason.

Generators are memory efficient and idiomatic. Python programmers can generally get away with not using them, since they are often writing scripts in environments where performance and memory efficiency is not mission critical. Still, it’s good form to use them, and they raise the bar on code quality.

Sin 6: Improper formatting

This can be easily enforced by a common formatter config across our codebases; I may try and do something like this soon.

A standardized format allows others to make properly formatted code changes without having to worry about excess diffs on lines that others worked on. It might seem trivial, but it all contributes to the Developer Experience.

If your code is formatted improperly, I wonder “what else is the author taking shortcuts on?”

Recommended formating / style:

  • flake8 compliant
  • isort
  • black --preview

The --preview flag above (referred to as --experimental-string-processing in earlier versions of black) will break up long strings and can help with auto formatting longer strings that flake8 will flag. I recommend adding this flag if you’re using an autoformatter.

Sin 7: Not using type hinting in python3

Python is dynamically typed, which is great for scripting, but not great for production. Use mypy and typehints everywhere! If you’re using a python version <3.8, you can still use type hints! Just add the following line at the top: from __future__ import annotations

Sin 8: Over-use of constants.py and utils.py

Developers often cite ease of access and reusability as a reason for chucking all kinds of state and behavior from different domains of their application into package-level constants.py and utils.py. I see these types of files all over code bases, and I don't think it aids in code re-use as advertised.

  • constants.py
    • It’s in the name! If it’s a constant, we probably don’t expect it to change, so we don't need "easy" access to it.
    • If we do expect it to change, it should be treated as configuration and managed elsewhere.
    • I argue that it's better to keep constants in the file where they're used; maybe even inline for strings (as opposed to at the top of the file in the global namespace).
    • If the constant is a number, it's reasonable to name it (see: "no magic numbers").
  • utils.py / helpers.py / tools.py (the beast goes by many names)
    • If a function can be reused only within a module, keep it in that module.
    • If a function can truly be used by different modules in your app, someone has probably already created a library for it.

In summary, proximity matters. Keep state and behavior biased towards the leaves of the module dependency tree. This keeps logically separable domains, well... separated. You can always hoist these things up the tree later without much headache.

Embracing tight coupling between local state and the code which it parameterizes is what makes frameworks like React and Tailwind so great. (Ok, frontend is a different universe, but I think the analogy holds).

Sin 9: One-off names and functions

Don't name variables if you don't have to. Good naming is really hard. It's not just assigning a label to a memory location in the namespace - it's communicating to the reader what this thing is / does. If this can already be inferred from context, you're not adding anything by naming it. In fact, you create ambiguity by making the reader wonder if that variable is used again farther down in the local scope.

# :(

filename = "config/auth.json"
with open(filename) as f:
    ...

# :)

with open("config/auth.json") as f:
    ...



# :(

def foo():
    # ... function logic
    thing = construct_thing()
    return thing

# :)

def foo():
    # ... function logic
    return construct_thing()

# :(

foo = bar.baz()
xyz = foo.qux

# :)

xyz = bar.baz().qux

This goes for functions too - don't name the behavior* - just do the behavior. Don't be afraid of longer functions with a descriptive docstring + comments**

# bonus: use "_" in lambdas when you don't need to name the variable

# 😐: "repo" is ambiguous and does not help the reader
# understand anything beyond what the syntax already implies:
# "this thing is an element of `repolist`"

filter(lambda repo: repo.repo_name == repo_config["name"], repolist)

# 😏: great! One less word to keep track of - now I'm free
# to think about this code fragment as an atomic,
# gestalt behavior rather than a coordinated
# interaction between several named variables

filter(lambda _: _.repo_name == repo_config["name"], repolist)

# another way to think about this: if you describe the behavior
# of the above code fragment to a colleague in english,
# you could easily do so # without using the word "repo".
# You might say something like:
#    "This code filters repolist elements
#        whose name matches the name in repo_config"

*One notable exception here is that there may be a chunk of behavior that you want to unit test in isolation. Ideally, you would mock whatever is necessary on the parent function, but this can get tricky. Also, python makes it difficult to unit test nested functions, a good solution in other languages.

In this case, I recommend splitting the behavior into another function, but making that function private with a leading “_”, indicating that this function is really a “sub-function” that is only used locally.

For example:

def handler(context: dict, event: dict):
    # ... complex, hard-to-mock stuff
    tricky_regex_i_wanna_unittest = re.sub(r"\s+", " ", func.__doc__)
    # ... complex, hard-to-mock stuff

# the above func may be hard to unittest in its entirety,
# so we split out the behavior,
# but keep it close in the file to its parent and inidcate its locality with `_`.

def _condense_whitespace(s: str) -> str:
    return re.sub(r"\s+", " ", s)

def handler(context, event: dict):
    # ... complex, hard-to-mock stuff
    tricky_regex_i_wanna_unittest = _condense_whitespace(func.__doc__)
    # ... complex, hard-to-mock stuff

**Comments are a controversial topic in Programming, but for a dynamically typed scripting language like Python (where the guardrails are off), I think a tasteful amount of comments can be a great way to communicate tricky logic to the reader and save them some precious brain cycles.

Sin 10: Poor requests library error handling

The precise behavior of requests.get is important to understand. requests.get will raise exceptions, but only if it can't complete a connection to the url supplied. A successful connection round trip (even if a 4xx or 5xx status_code is returned) will not raise. You can introduce this behavior by calling response.raise_for_status, but that will only raise for http codes >= 400.

Most likely, you want this part of your code to fail if you get any response other than 200. Surround your requests in a try block!

# here's one example, but there are multiple ways to go about this
def fetch_data_from_api(url: str) -> bytes:
    try:
        res = requests.get(url, timeout=5)
    except requests.exceptions.RequestException as e:
        raise MyDomainSpecificException("reason1") from e
    if res.status_code != 200:
        raise MyDomainSpecificException("reason2")
    return res.content

# bonus: consider using a url library to contruct your url, and include a
# timeout value in your request so the behavior of your code in production
# is better understood

# bonus bonus: consider what parts of your code are making external requests.
# Can you separate your model / domain logic from adapters that know about
# external systems?
# Architecture patterns like the Clean Architecture can be helpful here
# https://www.techtarget.com/searchapparchitecture/tip/A-primer-on-the-clean-architecture-pattern-and-its-principles

Sin 11: Not using the standard library

There are some really great utilities in the standard library that are performant, safe, and idiomatic. ALWAYS use them over your own implementation. Here are a few of my favorites:

  • itertools.chain
  • collections.defaultdict
  • collections.Counter
  • itertools.product
  • collections.deque
  • bisect.bisect_left (ok probably wont use this one outside of Leetcode, but I still love it)
  • sys.getsizeof

Sin 12: Not using sets and set comprehension

Python sets provide O(1) lookups and automatic deduplication. Consider using them over list or tuple where it makes sense. If you’re using the in keyword, it’s a good indication you should be using a set.

# 😢
a = [1, 2, 3]
if 2 in a: continue
if 3 in a: break

# 🥳
a = set(1, 2, 3)
if 2 in a: continue
if 3 in a: break

# Note the nuance, however:
# creating a set costs O(n) up-front.
# Thus, set lookups technically only make sense
# when repeated searching occurs.
# However, I argue its good form to just use sets every time

Sin 13: Not understanding Python concurrency models: Event Loop (asyncio) vs Multithreading vs Multiprocessing

Understand the 3 levels of concurrent computing

  1. Event loop - parallelizes I/O work with coroutines. Provided by asyncio
  2. Multithreading - parallelizes I/O work with separate threads. Provided by multithreading
  3. Multiprocessing - parallelizes compute work with separate unix processes. Provided by multiprocessing

Good rule of thumb: use asyncio unless you know for a fact you cant. If you can’t, prefer multithreading unless your workload is CPU bound.

If you really want to use multithreading, that’s fine, but prefer asyncio. No need to worry about thread safety with asyncio!

asyncio / multithreading work great for I/O bound tasks like reaching out to the internet for data. There are very few use cases RPA has for multiprocessing. Ask yourself: why specifically do you need a unix process instead of a thread or coroutine?

Many commonly used libraries like boto3 are thread safe.

Remember that python is single threaded but can accomplish tasks asynchronously. multiprocessing is best used for tasks that are compute bound, which probably isn't true for your application.

Sin 14: Arbitrary top level code

Any code not in a function will execute when its parent module is imported. Avoid over-reliance on global variables, especially if they are not shared by 2 or more functions. Some global variables are quasi-unavoidable and expected in a python context, such as initializing a global logger.

Sin 15: Not being aware of logging module quirks

Python’s logging module is really good, albeit a bit tricky. The source code is worth a read. Here are some important details:

  • There is only one root logger. Don’t touch it unless you have a really good reason to. Instead, always get a logger with the name set to the module name: log = logging.getLogger(__name__)
  • There are two sets of log levels that can affect an emitted log event. One for the logger, and an additional one for each handler (there’s always at least one, including a fallback handler!). These can be different levels; read the docs carefully to avoid missing logs.
  • Be wary of logging.basicConfig; there are lots of gotchas. It’s best to create your own handlers. It’s not as pretty, but you’ll know exactly where your log events are dispatched and how they are filtered.
  • Logging with lambda can be tricky, because the lambda runtime messes with the root logger (which is annoying).

This fantastic video on modern python logging came out after I wrote v1 of this essay, and it's the video I wish I'd had years ago. mcoding's channel is great in general for all things python.

Sin 16: Using standard classes as a data store

Let data just be data!

If your data is pure, there isn’t really any reason to turn it into a class, which fundamentally associates state and behavior. One of my favorite tech videos of all time has a great deep-dive on this topic; it's worth a watch for any programmer.

If you really need to bundle data into a class, consider using an Enum, dataclass, or namedtuple. These constructs are underused and very powerful.

One of my several controversial python takes is that many classes are better off as dataclasses, even if they also have methods. A weird rule of thumb I came up with: If your class has more attributes than user-defined methods, it should be a dataclass.

# if your class looks like this:

class MyThing:
    def __init__(self, param1, param2):
        self.param1 = param1
        self.param2 = param2
    
    def do_stuff(self):
        ...

# I argue that it should be a dataclass!
# Make use of `__post_init___` for any setup work needed upon instantiation

# bonus: this probably shouldn't be a class at all!
#        does `do_stuff` really need to be *associated*
#        with the data (param1,2)?

Sin 17: Not using newer versions of python

It’s 2023! Python 3.9 came out 3 years ago! Use the newer versions; there’s no reason not to. Fantastic new features and performance improvements are added with each release; stay up to date and use them in your code.

Sin 18: Always importing functions and classes instead of modules

This is basically lifted straight from Google’s Python Style Guide: https://google.github.io/styleguide/pyguide.html#22-imports

When you always import specific functions and classes, it may be unclear to a reader where a function or class is coming from. It’s better to be verbose and include the module in the reference!

Remember, you do not hinder performance or increase namespace size by importing the whole module. You do, however, improve readability!

# avoid:

from datetime.datetime import fromtimestamp
...
fromtimestamp(1665769614) # where does this come from?

# better:

import datetime
...
datetime.datetime.fromtimestamp(1665769614)

Other sins and quick-hits

  • Sin: Not using f-strings in Python3. Just do it! They are awesome, readable, and of course, Pythonic! This goes for any of the wonderful features introduced in later versions of python.
  • Sin: Not documenting your code. We are all guilty of this from time to time, but it’s invaluable as a reader to be able to accelerate my understanding of a function / module. This is especially important if you are writing a library which will be consumed by others. LSP implementations read doctrings to annotate hints, and they are discoverable through the help builtin.
  • Regular expression compilations are automatically cached; no need to compile up front and store in a variable
  • open defaults to "r" mode, no need to specify it
  • Sin: Over-parameterized function signatures. If a certain parameter of a function is only ever going to be one thing, remove it! Unless you’re vending the function as a library (API) where your function signature is a contract, you can add more parameters later without too much headache.
  • Pass around native data structures or interfaces between functions. Dependency inject where possible.
  • Application inputs and outputs should be strictly enforced with versioned schemas. Use things like Jinja2, jsonschema, or pydantic for things like templating and schema / input validation; don't write a custom function to check simple rules like string length, presence of keys, value types, etc.
  • Sin: Not using moto!
  • When using a context manager for reading files, if you are reading the file fully into memory, exit as soon as you are done loading the file. No need to keep the file open while you perform other work - this could lead to unexpected behavior. It’s another one of those “it will probably never cause me a problem, but why take the risk?”
❌
with open("hi.txt") as f:
    content = f.read()
    do_stuff(content)

✅
with open("hi.txt") as f:
    # this loads the file into memory entirely, no need to keep the file open!
    content = f.read()
do_stuff(content)

Topics I plan on adding in the future:

  • Modeling the business domain
  • Dependency inversion
  • Interfaces and abstractmethods
  • Clean(er) Architecture

Š 2018-2024 Brendan Schlaman. All rights reserved.