8000 HDFS-16939. Fix the thread safety bug in LowRedundancyBlocks. by zhangshuyan0 · Pull Request #5450 · apache/hadoop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

HDFS-16939. Fix the thread safety bug in LowRedundancyBlocks. #5450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you 8000 account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 6, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,10 @@ class LowRedundancyBlocks implements Iterable<BlockInfo> {
private final List<LightWeightLinkedSet<BlockInfo>> priorityQueues
= new ArrayList<>(LEVEL);

/** The number of corrupt blocks with replication factor 1 */

private final LongAdder lowRedundancyBlocks = new LongAdder();
private final LongAdder corruptBlocks = new LongAdder();
/** The number of corrupt blocks with replication factor 1 */
private final LongAdder corruptReplicationOneBlocks = new LongAdder();
private final LongAdder lowRedundancyECBlockGroups = new LongAdder();
private final LongAdder corruptECBlockGroups = new LongAdder();
Expand Down Expand Up @@ -369,11 +369,11 @@ synchronized boolean remove(BlockInfo block,
* @return true if the block was found and removed from one of the priority
* queues
*/
boolean remove(BlockInfo block, int priLevel) {
synchronized boolean remove(BlockInfo block, int priLevel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) be made synchronized instead. Its other callers are synchronized methods:

Copy link
Contributor
@saxenapranav saxenapranav Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just ensure only one thread getting into the remove method. But what if there is one thread trying to remove, other trying to add or size(). I feel following should help:

private Object syncOnBlockInfoSet(int priority, BlockInfo blockInfo, String operation) {
    LightWeightLinkedSet<BlockInfo> blockInfos =priorityQueues.get(priority); 
    synchronized (blockInfos) {
      if("size".equalsIgnoreCase(operation)) {
        return blockInfos.size();
      }
      if("remove".equalsIgnoreCase(operation)) {
        return blockInfos.remove(blockInfo);
      }
      //implement other required methods.
      else {
        return null;
      }
    }
  }

all code-pieces where we do operation on an element in priorityQueues, we call syncOnBlockInfoSet to do that operation. for ex: size=syncOnBlockInfoSet(priority, null, "size")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) be made synchronized instead. Its other callers are synchronized methods:

Sorry I don't quite understand. Since the callers are already synchronized, why is it necessary to made boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) synchronized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just ensure only one thread getting into the remove method. But what if there is one thread trying to remove, other trying to add or size().

add() and size() are already synchronized in the current code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) be made synchronized instead. Its other callers are synchronized methods:

Sorry I don't quite understand. Since the callers are already synchronized, why is it necessary to made boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) synchronized?

Since boolean remove(BlockInfo block, int priLevel) 10000 forwards the call to boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas), I am suggesting that lets make boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) synchronized instead.
Callers of boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) are synchronized means that we can have boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) as synchronized without any perf-loss.

Copy link
Contributor
@saxenapranav saxenapranav Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just ensure only one thread getting into the remove method. But what if there is one thread trying to remove, other trying to add or size().

add() and size() are already synchronized in the current code.

these method being synchronized means that only one thread can enter into the synchronized method, but doesn't make the object its working on synchronized. There could be one thread calling size which leads to priorityQueues.get(i).size(), and other calling add which leads to priorityQueues.get(priLevel).add(blockInfo) simultaneously.
Example of probable issue is:
for adding element

protected boolean addElem(final T element) {
// validate element
if (element == null) {
throw new IllegalArgumentException("Null element is not supported.");
}
// find hashCode & index
final int hashCode = element.hashCode();
final int index = getIndex(hashCode);
// return false if already present
if (getContainedElem(index, element, hashCode) != null) {
return false;
}
modification++;
size++;
// update bucket linked list
DoubleLinkedElement<T> le = new DoubleLinkedElement<T>(element, hashCode);
le.next = entries[index];
entries[index] = le;
// insert to the end of the all-element linked list
le.after = null;
le.before = tail;
if (tail != null) {
tail.after = le;
}
tail = le;
if (head == null) {
head = le;
bookmark.next = head;
}
// Update bookmark, if necessary.
if (bookmark.next == null) {
bookmark.next = le;
}
return true;
}
is the code. let say thread goes till and then context switch happens and the thread doesn't get CPU in future for some time. Now size has been incremented but addition has not happend. the thread calling size() gets the CPU, it will get size which is more than what exactly is in the set. Thats why feel that priorityQueues is still not thread-safe and hence suggesting the change.
Please feel free to disagree.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, I'll make both boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) and boolean remove(BlockInfo block, int priLevel) synchronized to ensure correctness.

However, I think you misunderstood the semantics of synchronized a method. Refer to java doc:

When one thread is executing a synchronized method for an object, all other threads that invoke synchronized methods for the same object block (suspend execution) until the first thread is done with the object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saxenapranav Thanks for your pretty review comment. I am not sure if get your points totally. IMO, this improvement is safe and self-contained, because synchronized is reentrant and exclusive. So I am confused if it could involve other consistency issues.
I would like to give my +1 if you were worried about perf-loss only for synchronized-synchronized, for this case I think it could be acceptable, anyway totally agree that both changes about performance we should given the benchmark comparison.
Thanks @zhangshuyan0 and @saxenapranav , Please feel free to correct me if something wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing with both @Hexiaoqiao @zhangshuyan0.
Thanks @zhangshuyan0 for taking suggestion.

return remove(block, priLevel, block.getReplication());
}

boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) {
synchronized boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) {
if(priLevel >= 0 && priLevel < LEVEL
&& priorityQueues.get(priLevel).remove(block)) {
NameNode.blockStateChangeLog.debug(
Expand Down
0