Infra Runbook

This is the infrastructure team runbook. Please refer to this document for practices used by our team to support gCS organization. Here you can find a summary of our work and references to complete documentations.

Table of Content

WIP: To be completed

  1. On-Call

On-Call

We use an on-call schedule to dedicate one of our teammates to address all infra requests each week. This practice lets others to focus on their tasks and a shorter response time to our constumers.

Responsibilities

  1. Monitor and respond to support questions posted in the ecdi-ace-infra-support Slack channel.
  2. Monitor the On-Call ZenHub Issues Board and respond to any new tickets within 24 hours and address in accordance with their priority rating:
    1. PO-P1: These tickets represent outages or non-workaround blockages and should be prioritized above everything else, including meetings.
    2. P2: These tickets are less severe than P0-P1, but the on-call engineer should attempt to resolve during their shift, time permitting.
    3. P3-P4: These tickets represent standard requests and are NOT the responsibility of the on-call engineer to resolve, but they should attempt to triage within their shift if they have enough information and knowledge to do so. Where they need more information, they should reach out to the ticket submitter with questions. Where they need triage assistance, they should take time during the Wednesday infrastructure meeting to collaborate with team SME’s to assess level of effort and timebox.
  3. Perform an explicit Monday morning handoff to the succeeding on-call engineer, during which they will provide context necessary to ensure a smooth transition with no dropped issues.
  4. Any partially completed issues should be reassigned from gredacs1_roche to themselves for completion. When reassigning, also remove the oncall tag so the issue no longer shows up on the oncall board.

Add link to dedicated document on the procedure.

Priorities Defined

Workflow

We use two weeks sprints to plan and progress our work. There are four regular meetings that we are all required to attend:

Weekly Sync (aka Standup)

We meet every Wendsday and Thursday at 9:30 (PST) to sync up with other engineers. These meetings are designed to update everyone about our progress and let each other know if we need help on a task.

Planning Sessions

We meet every other week on Thursday at 11:00 (PST) to plan our next sprint. During this meeting we check on our priorities for the year. Please checkout this doc for 2023 priorities. Each area has an owner who talks about the progress on the area and plan for future. After this meeting engineers clean up their board and add the according tasks based on the plan.

When do we discuss the infra requests? Move the rest into a dedicated page!

Capacity Model

This is in progress and we should discuss it!

Estimating Level of Effort (LOE) with Story Points

Matrix

Below is our first pass at defining a story point to cheatsheet. We agreed on April 20, 2023 that we’re going to take it out for a walk and see how it feels. We’ll collectively decide if/when to make adjustments based on feedback. Our objective here is simply to get us all working off the same script We agreed to the points-to-time conversion before I added the three dimensions (uncertainty, complexity, effort) to the table. As a result, it’s somewhat incomplete as doesn’t yet account for size mixture for each dimension. I will work on this later.

Story PointsUncertaintyComplexityEffortEstimated Time Conversion^1^
12x-small2x-small2x-smallabout 30 minutes
2x-smallx-smallx-smallabout an hour
3smallsmallsmallabout 1/2 day
5smallsmallmediumabout 1 day
8mediummediummedium2-3 days
13largelargelargeabout a week (Perhaps should be broken into smaller tasks)
21x-largex-largex-largeabout two weeks (Really should be broken into smaller tasks
40unused

^1^ Story points are intended as a unit of measurement of the uncertainty, complexity and effort of a task. They are not intended to convert directly to time and some Agile experts consider it an antipattern to do so. That said, we’ve not quite shaken this urge so I’ve added an Estimated Time Conversion column for reference.

Theory

Work in progress

Todd wants to share some chunks of knowledge from his research on this topic as there’s still a bit of contention on the topic of story points.

What is a story point

A story point is a unitless measurement of size based on:

  • complexity
  • uncertainty
  • effort

It is not intended to serve as an estimate of the amount of time it will take. Time gets figured out over time via velocity charts.

Reference

For great insight on the topic please read the following articles:

  1. Estimating Work Using Story Points from ZenHub: Includes lessons learned from an award winning research paper that Microsoft published on this topic.
  2. TeamHood Story Point Estimation: Provides tips and examples for doing story point estimation.
  3. CheatSheet for Story Point Sizing:
  4. Fibonacci scale: Primer on the story point number system used by ZenHub.

Weekly Design Reviews

Every week on Wendsdays at 11:00 (PST) we have our design sessions. More information on this later.

Biweekly Demos

Every other week on Fridays at 9:00 (PST) we have our demo sessions. More information on this later.

Biweekly Retrospectives

Every other week on Fridays at 9:00 (PST) we have our retrospective sessions. More information on this later.

CIDM

The infrastructure team is not the owner of CIDM, but it is connected to Okta, and all group access is controlled there. Working with CIDM can be confusing, so I’m going to provide some information here to speed up our process.

1. To check which user belongs to which group: Please go to cidm.roche.com. On the page, search for all groups where the user is a member with the user ID or email address of the user. You can use the Employee search on your Chrome extension to find this information.

2. To check groups: Please go to cidm.roche.com. On the page, search by group name.

3. If you want to request a new group for yourself: Please follow these steps:

  • Go to cidm.roche.com
  • Navigate to the left-hand side of the page and click on “Manage Access”
  • Select “Request Access”
  • Search for the group you want to join
  • If you see a “No Search Result” error, double-check the name of the group you are looking for. If the name is correct this means the group is not requestable, it means that the owners should add you manually. To find the owners, please follow the question 2 and look for the group in the “Owner” tab. You can find the owners’ contact information and request them via Slack.