codestyle — tnpl.me

Config vs Setting vs Feature flag

Sat, 04 Dec 2021 04:32:42 +0000

Where to logs

Sat, 20 Nov 2021 04:27:58 +0000

I would like to write in more detail about where to put log, the benefits, disadvantages, and how to mitigate that.

I want to put a context that we're developing web services, whether public access or internal use. No matter what protocol is used, HTTP with JSON, gRPC, Thrift, etc. I think my guide may adapt to all of these kinds of technology.

So, where to put logs?

1. Incoming requests – aka. access log

Usually, there are two types of requests: read and write. You can write a middleware, interceptor, AOP, or whatever is suitable for your situation and log all necessary data such as HTTP request data, response status, process time, caller information, controller and handler name, etc.

This is very useful for many of situations:

To monitor anomaly of 4xx or 5xx HTTP status
Investigates on which endpoint is slow for which user
Who perform what action and when

Someone might take it to a very extreme by constantly logging the entire request and response body. Well, it depends on context and company, but I suggest logging those only if there is a serious error such as HTTP 500 so that it won’t violate user privacy, reduce security risk, and save the cost.

2. At the end of handlers that are not read-only

You should put a log to record an event (what happened) at the end of every handler that is not read-only, such as POST, PUT, PATCH, DELETE, or depending on the RPC technology you are using.

I don’t suggest putting it on read-only handlers because most of the systems are very read-heavy, and you already have an access log from above, so it doesn’t add much value to do that on all handlers.

What should be logged on those non-read-only handlers?

Who performs the action
Name of the event, this should be unique across all of your systems
Request or data of the action
Result or data of what have been changed
Any metrics, numerical values

In order to log so much of those information, structured log such as JSON is a must.

The benefit of doing this is building real-time business metrics to show to all stakeholders. It is pretty exciting to put data, visualize, and see the impact of what you did in real-time.

3. When handlers got an error or exception

Access log with response status is not enough for investigation when there is an error or exception on a request. Handling errors and putting a log in each handler is very helpful to know what's wrong with that request.

You should log an error message, stack trace, why the error happens, on which block of code. Don't write a generic error log or catch a generic exception because you won't get enough detail to understand the problem.

4. When the system calls to external dependencies

In a microservice architecture, it is common that your system has to invoke an API call to other services. You may remember that there are 2 types of requests: read and write. If you send a request to get the data back without any side effects, it is a read, or you may call a query request.

As your system is a client who calls to another service, I suggest you write a log for all requests being sent out from your service. Again, the information to be logged is quite the same as the access log.

There are disadvantages to logging every call, especially for the query request type. The most significant drawback is overhead. Around 80% of requests are query requests; logging all of them will cost you the storage, compute power, and increased response time. In this case, metrics are a better solution as you can count the number of requests, duration histogram.

By the way, if a query request has got an error, I still suggest logging it for investigation. You may also apply a log sampling or rate limit to mitigate the log flood problem in case of 100% of requests are errors.

5. After the system did something important

Even though you had event logging at the end of handlers, there is a missing gap here such as a cron job, watchdog, housekeeping. Those are still not logged and in many cases, it is useful to know what the system does so you can correlate the event in case of something bad happened.

I had an experience on a system where cron jobs are running in the same process as the web server. The application was deployed to many physical servers. Once it is time to run a cron, every server triggers a job simultaneously, which causes a sudden spike in system resources and database.

I'm lucky enough that once I take a look at the centralized log system around that time, it is so obvious that the cron job is a problem.

Here's a list of what should be log

When a background job or cron job was started and finished
Events related to database connection such as connection lost, failover
Garbage collection event
It's setting up something with other systems, e.g., declaring queue, registering to service discovery, configuration updates.

6. When you want to record what happened

You can log anywhere in the code, so don't limit yourself to logging only inside handlers. I usually log about security information, such as someone trying to login using the wrong password, requesting an OTP, delete a record without permission.

You can also log for audit purposes: database record with before & after, login success, email verified, etc.

This might sound a bit duplicate with No 2. but it is not. As a single request can perform many tasks, the log at the end of the handler will record an overall event. But this kind of log is more granular, so it records what happened within a single request.

Closing

I found that logging is more helpful than metrics when you need very detailed information, but it comes with writing, transit, processing, and storing costs. For a system under heavy load, you should reduce the number of logs by using metrics or sampling the logs. But for non-read-only requests, you should always write logs as you might get the benefit when investigating the problems and real-time business metrics for monitoring.

These guidelines on where to logs are what I think you should always do. By the way, feel free to put any log in any log level on the locations that it might be helpful for you.

#codestyle

Opinionated guide on how to write a log

Wed, 21 Apr 2021 04:13:08 +0000

เคยลองเปิด log message ที่ตัวเองเขียนไหมครับ? รู้สึกอย่างไรกับ log เหล่านั้น? มีประโยชน์ตามที่ต้องการหรือเปล่า? และเมื่อระบบมีปัญหาลองเปิด log message อ่านดูอีกครั้ง คิดว่า log เหล่านั้นให้ประโยชน์กับเราขนาดไหน?

ผมลองใช้เทคนิคเหล่านี้ในการเขียน log มันทำให้ log มีประโยชน์มากขึ้น

1. Use structured log

ส่วนตัวผมแนะนำ log เป็น JSON format เนื่องจากสามารถใส่ข้อมูลชนิดต่างๆ ได้หลากหลาย มีโครงสร้างชัดเจน หากจะใช้ structure แบบอื่นก็ทำได้ ขอให้ parse ข้อมูลใน log ได้ง่าย แต่สิ่งที่ไม่ควรทำคือ log แบบไม่มีโครงสร้างที่ชัดเจน

ตัวอย่าง:

{"timestamp":"2021-04-21T10.30.15.412Z","level":"INFO","message":"Order created","orderId":"ORD1","userId":"U1","subsystem":"SvcName.Package.Class.Method"}

2. Log with context

ในระบบ centralized log จะมี log จำนวนมากจากระบบต่างๆ ถึงแม้ว่าจะ filter แยกตาม service ได้ก็ตาม แต่ 1 service อาจจะ serve หลาย request พร้อมกัน ส่งผลให้เราอ่าน log รอบๆ ได้ยาก

ดังนั้นให้ log ข้อมูลประกอบออกมาให้ครบถ้วนใน 1 line เราควรจะตัดสินใจ/identify ปัญหาได้โดยไม่ต้องอาศัย log อื่นๆ ประกอบ รวมถึงเขียน message อธิบายให้ครบถ้วน อย่ากังวลเรื่องขนาดของ 1 log line ในช่วงแรกเพราะเมื่อเทียบดูแล้วการเสียพื้นที่เก็บ log ปริมาณมากที่ใช้ประโยชน์ไม่ได้จริงนั้นแย่กว่าการเก็บ log ขนาดใหญ่เพียง 1 บรรทัดที่มีประโยชน์

ตัวอย่าง:

ถ้าเป็นระบบ order ควรใส่ userId, orderId, action ที่ทำ
ถ้าเป็นระบบ login ควรใส่ attempt, ip, sessionId, พยายาม login เป็น userId อะไร, success ไหม
ถ้าเป็น http access log ควรใส่ client ip, http method, full path, domain, response code ฯลฯ

3. Prefer static log message

ผมเคยเขียน log message ที่ human friendly มากๆ เช่น

Order id 1001 cannot transition from CREATED to PROCESSING state due to order expired

ปัญหาของ log message นี้คือ

Search ไม่ได้
Parse data ออกมาไม่ได้
เสีย CPU time เพื่อ format log message

ผมรู้สึกว่า static log message นั้นเรียบง่ายและไร้ปัญหาด้านบน และไม่ต้องเสียเวลาเช็ค grammar อีกด้วย

message ใหม่หน้าตาเป็นแบบนี้

Order state transition error - expired

ส่วนข้อมูล order id, state before, state after, expired time ก็ใช้ structured log ร่วมกับ context

4. ใส่ Subsystem ด้วย

เมื่อเริ่มใช้ static log message ที่กระชับ จะมีโอกาสที่ code มากกว่า 2 จุดจะใช้ message เดียวกัน ทำให้เราไม่รู้ว่า log นี้ถูกเขียนออกมาจากจุดไหนใน code

ดังนั้น ควรใส่ข้อมูล subsystem เข้าไปใน log ด้วยเพื่อให้รู้ว่า log ออกมาจากจุดไหนใน code ข้อมูลที่ควรใส่มีดังนี้ ตามลำดับ

Hostname
Application name
Module name
Package name
Class name
Method name

เราสามารถตัดข้อที่มีข้อมูลอยู่แล้วไปได้ เช่น Hostname, Application name ถ้าเราใช้ filebeat + logstash + docker + kubernetes จะมีข้อมูลนี้ส่งไปให้อยู่แล้ว

log framework เช่น log4j จะใส่ชื่อ Package กับ Class ให้อยู่แล้ว ดังนั้นไม่จำเป็นต้องใส่ แต่ถ้า log framework ที่ใช้ไม่ได้ใส่ให้เราก็ต้องใส่เอง

ถ้าหากใน class หรือ package มี log message ซ้ำกัน ก็ควรใส่ method name เข้าไปด้วย (แนะนำให้ใส่ตลอดถึงแม้ว่า message จะไม่ซ้ำ)

ตัวอย่าง subsystem com.example.app.service.OrderService.updateState ถ้าเห็นว่ายาวไปก็อาจจะตัดเหลือแค่ c.e.a.s.OrderService.updateState หรือ OrderService.updateState แต่ต้องระวังว่าถ้าใน code มี class OrderService อยู่คนละ package ก็จะมีปัญหาได้

5. ห้ามใช้ Log level ผิด

ปกติ log level จะมี FATAL, ERROR, WARN, INFO, DEBUG, และ TRACE เรียงตามความ critical/serious จากมากไปน้อยตามลำดับ

การห้ามใช้ผิดคืออย่าใช้ level ที่ critical กับ log ที่ไม่ critical เช่น ถ้าอยากจะ log ดูข้อมูลเฉยๆ ห้ามใช้ FATAL, ERROR, และ WARN เป็นต้น แต่จะใช้ INFO, DEBUG, หรือ TRACE ผมไม่มีความเห็นเป็นพิเศษ

6. ใช้ Centralized log system

แนะนำให้ใช้ ELK stack โดยสามารถ parse JSON log ใน logstash ได้เลย ส่วน log ทั้งหมดสามารถ search ผ่าน ElasticSearch และ Kibana ได้ง่าย สามารถ filter แยกตาม field ได้ เช่น hostname, application name, subsystem หรือจะทำ full text search ผ่าน message ก็ได้

7. Log ลงไฟล์หรือ stdout/stderr เท่านั้น

by default ควร log ลงไฟล์ ยกเว้นถ้าใช้ docker ให้ log ลง stdout/stderr ได้ (ทำงานเร็ว, overhead ต่ำ)

สาเหตุ

ถ้า ship log ผ่าน TCP จะเกิดปัญหา app freeze เมื่อ centralized log มีปัญหาเพราะ TCP buffer ฝั่งผู้ส่ง log เต็ม
ถ้า ship log ผ่าน UDP จะเกิดปัญหา log message lost เพราะขนาดใหญ่เกิน 1 UDP packet
ถ้าไม่ใช้ docker แล้วเขียนลง stdout/stderr จะเกิดปัญหา app ทำงานช้า เพราะ stdout/stderr ไม่ใช่ buffered IO และ flush ตลอด

ปัญหาเหล่านี้แก้ไขได้แต่ต้องอาศัยความระวังเป็นพิเศษ และเป็น common mistake ที่ต้องเจอ ดังนั้นเลือกทางที่ปลอดภัยตั้งแต่แรกดีกว่า ข้อควรระวังอีกเรื่องคือเมื่อเขียนลง file หรือ docker log ให้ระวัง disk เต็ม ควรจะ rotate log ให้เหมาะสมด้วย

8. Mask sensitive data ด้วยเสมอ

สมมติทำระบบ login ด้วย phone number เราไม่ควร log phone number ทั้งหมดออกมา เสี่ยงต่อการถูกขโมยข้อมูลและละเมิด privacy

เราควร mask data ตามความเหมาะสมด้วย เช่น เบอร์โทร จากเดิม 029876543 ให้เหลือ 02xxxx543 อย่างนี้จะยังช่วยให้เรารู้ข้อมูลบางส่วนและค้นหา log ได้อยู่ แต่ข้อมูลบางประเภทควรจะ mask ทั้งหมด เช่น ข้อมูลรหัสผ่าน, authentication token, session id เป็นต้น

สุดท้าย

ระบบ log ต่างๆ เหล่านี้ต้องใช้เงินจำนวนมากในการ operate ระบบ ไม่ว่าจะเป็น log storage, log parser, log query service, log shipper, และ data transfer

แต่ระบบ log ที่ดีรวมถึงการเขียน log ที่ดีจะช่วยให้ชีวิตของ engineer ดีขึ้นเป็นอย่างมาก มี productivity สูงขึ้น และนำ log data ไปต่อยอดในเรื่องต่างๆ ได้มหาศาลในอนาคต

#codestyle