Skip to content

The --exact-rowcount option is buggy on large tables #830

@zhxnlai

Description

@zhxnlai

Context

During a schema change, gho-st copies every row from the original table to a ghost table. This results in a lock on the original table and a long running query similar to the following

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      1 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594199 | my_service_ddl | 10.137.166.91:35626  | my_service         | Query       |    0 | Sending data                                                  | select  /* gh-ost `my_service`.`my_table` iteration:108 */
						`id`
					from
						`my_service`.`my_table`
					where ((`id` > _binary'1131891')) and ((`id` < _binary'310834793') or ((`id` = _binary'310834793')))
					order by
						`id` asc
					limit 1
					offset 9999 |

It appears that when --exact-rowcount is enabled, ghost grabs another lock on the original table and starts an addition query to calculate a more accurate progress ETA. By default, row counting happens in parallel to row copying.

mysql> show open tables where in_use>0;
+------------+----------------------+--------+-------------+
| Database   | Table                | In_use | Name_locked |
+------------+----------------------+--------+-------------+
| my_service | my_table             |      2 |           0 |
+------------+----------------------+--------+-------------+
1 row in set (0.00 sec)

mysql> show full processlist;
...
| 1594046 | my_service_ddl | 10.137.166.91:57666  | my_service         | Query       |   71 | Sending data                  | select /* gh-ost */ count(*) as rows from `my_service`.`my_table`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
...

Problem

After all the rows are copied to the ghost table, gh-ost attempts to cut-over the original table.

Usually, the row counting query finishes before all the rows are copied to the ghost table. At the time of cut-over, because both queries have finished, there is no lock on the origin table. Hence the schema change is successful.

However, we repeatedly found that on a large table (> 400GB), row counting query takes longer than row copying. At the time of cut-over, the row counting query is still running and holding a lock on the original table.

This causes the schema change to fail with this error: "Error 1205: Lock wait timeout exceeded; try restarting transaction", which was described in #773

Proposals

If you run into this issue, try disabling --exact-rowcount.

Improvements that we could make

  • Document this bug; add a warning.
  • Teach gh-ost to terminate the row counting query before cutting over the original table.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions