8000 HDFS-16939. Fix the thread safety bug in LowRedundancyBlocks. by zhangshuyan0 · Pull Request #5450 · apache/hadoop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

HDFS-16939. Fix the thread safety bug in LowRedundancyBlocks. #5450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 6, 2023

Conversation

zhangshuyan0
Copy link
Contributor

The remove method in LowRedundancyBlocks is not protected by synchronized. This method is private and is called by BlockManager. As a result, priorityQueues has the risk of being accessed concurrently by multiple threads.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ docker 1m 9s Docker failed to build run-specific yetus/hadoop:tp-3985}.
Subsystem Report/Notes
GITHUB PR #5450
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5450/1/console
versions git=2.17.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@slfan1989
Copy link
Contributor

LGTM.

@@ -369,7 +369,7 @@ synchronized boolean remove(BlockInfo block,
* @return true if the block was found and removed from one of the priority
* queues
*/
boolean remove(BlockInfo block, int priLevel) {
synchronized boolean remove(BlockInfo block, int priLevel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) be made synchronized instead. Its other callers are synchronized methods:

Copy link
Contributor
@saxenapranav saxenapranav Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just ensure only one thread getting into the remove method. But what if there is one thread trying to remove, other trying to add or size(). I feel following should help:

private Object syncOnBlockInfoSet(int priority, BlockInfo blockInfo, String operation) {
    LightWeightLinkedSet<BlockInfo> blockInfos =priorityQueues.get(priority); 
    synchronized (blockInfos) {
      if("size".equalsIgnoreCase(operation)) {
        return blockInfos.size();
      }
      if("remove".equalsIgnoreCase(operation)) {
        return blockInfos.remove(blockInfo);
      }
      //implement other required methods.
      else {
        return null;
      }
    }
  }

all code-pieces where we do operation on an element in priorityQueues, we call syncOnBlockInfoSet to do that operation. for ex: size=syncOnBlockInfoSet(priority, null, "size")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) be made synchronized instead. Its other callers are synchronized methods:

Sorry I don't quite understand. Since the callers are already synchronized, why is it necessary to made boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) synchronized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just ensure only one thread getting into the remove method. But what if there is one thread trying to remove, other trying to add or size().

add() and size() are already synchronized in the current code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) be made synchronized instead. Its other callers are synchronized methods:

Sorry I don't quite understand. Since the callers are already synchronized, why is it necessary to made boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) synchronized?

Since boolean remove(BlockInfo block, int priLevel) forwards the call to boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas), I am suggesting that lets make boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) synchronized instead.
Callers of boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) are synchronized means that we can have boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) as synchronized without any perf-loss.

Copy link
Contributor
@saxenapranav saxenapranav Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just ensure only one thread getting into the remove method. But what if there is one thread trying to remove, other trying to add or size().

add() and size() are already synchronized in the current code.

these method being synchronized means that only one thread can enter into the synchronized method, but doesn't make the object its working on synchronized. There could be one thread calling size which leads to priorityQueues.get(i).size(), and other calling add which leads to priorityQueues.get(priLevel).add(blockInfo) simultaneously.
Example of probable issue is:
for adding element

protected boolean addElem(final T element) {
// validate element
if (element == null) {
throw new IllegalArgumentException("Null element is not supported.");
}
// find hashCode & index
final int hashCode = element.hashCode();
final int index = getIndex(hashCode);
// return false if already present
if (getContainedElem(index, element, hashCode) != null) {
return false;
}
modification++;
size++;
// update bucket linked list
DoubleLinkedElement<T> le = new DoubleLinkedElement<T>(element, hashCode);
le.next = entries[index];
entries[index] = le;
// insert to the end of the all-element linked list
le.after = null;
le.before = tail;
if (tail != null) {
tail.after = le;
}
tail = le;
if (head == null) {
head = le;
bookmark.next = head;
}
// Update bookmark, if necessary.
if (bookmark.next == null) {
bookmark.next = le;
}
return true;
}
is the code. let say thread goes till and then context switch happens and the thread doesn't get CPU in future for some time. Now size has been incremented but addition has not happend. the thread calling size() gets the CPU, it will get size which is more than what exactly is in the set. Thats why feel that priorityQueues is still not thread-safe and hence suggesting the change.
Please feel free to disagree.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, I'll make both boolean remove(BlockInfo block, int priLevel, int oldExpectedReplicas) and boolean remove(BlockInfo block, int priLevel) synchronized to ensure correctness.

However, I think you misunderstood the semantics of synchronized a method. Refer to java doc:

When one thread is executing a synchronized method for an object, all other threads that invoke synchronized methods for the same object block (suspend execution) until the first thread is done with the object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saxenapranav Thanks for your pretty review comment. I am not sure if get your points totally. IMO, this improvement is safe and self-contained, because synchronized is reentrant and exclusive. So I am confused if it could involve other consistency issues.
I would like to give my +1 if you were worried about perf-loss only for synchronized-synchronized, for this case I think it could be acceptable, anyway totally agree that both changes about performance we should given the benchmark comparison.
Thanks @zhangshuyan0 and @saxenapranav , Please feel free to correct me if something wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing with both @Hexiaoqiao @zhangshuyan0.
Thanks @zhangshuyan0 for taking suggestion.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 18m 32s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 42m 32s trunk passed
+1 💚 compile 1m 35s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 1m 24s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 1m 7s trunk passed
+1 💚 mvnsite 1m 33s trunk passed
+1 💚 javadoc 1m 18s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 1m 31s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 3m 44s trunk passed
+1 💚 shadedclient 25m 49s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 20s the patch passed
+1 💚 compile 1m 25s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 1m 25s the patch passed
+1 💚 compile 1m 21s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 1m 21s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 53s the patch passed
+1 💚 mvnsite 1m 22s the patch passed
+1 💚 javadoc 0m 53s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 1m 28s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 3m 31s the patch passed
+1 💚 shadedclient 25m 38s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 233m 9s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 43s The patch does not generate ASF License warnings.
367m 45s
Reason Tests
Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5450/2/artifact/out/Dockerfile
GITHUB PR #5450
Optio 8000 nal Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 618ebff1f792 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c1ab8ce
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5450/2/testReport/
Max. process+thread count 2488 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5450/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 56s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 33s trunk passed
+1 💚 compile 1m 37s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 1m 24s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 1m 4s trunk passed
+1 💚 mvnsite 1m 30s trunk passed
+1 💚 javadoc 1m 10s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 1m 37s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 3m 46s trunk passed
+1 💚 shadedclient 23m 29s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 23s the patch passed
+1 💚 compile 1m 28s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 1m 28s the patch passed
+1 💚 compile 1m 17s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 1m 17s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 50s the patch passed
+1 💚 mvnsite 1m 22s the patch passed
+1 💚 javadoc 0m 52s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 1m 27s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 3m 33s the patch passed
+1 💚 shadedclient 23m 23s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 206m 1s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 47s The patch does not generate ASF License warnings.
318m 23s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5450/3/artifact/out/Dockerfile
GITHUB PR #5450
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 157688d5dfb2 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 6f00130
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5450/3/testReport/
Max. process+thread count 3945 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5450/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor
@Hexiaoqiao Hexiaoqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 from my side.

@Hexiaoqiao Hexiaoqiao merged commit 2cb0c35 into apache:trunk Mar 6, 2023
@Hexiaoqiao
Copy link
Contributor

Committed to trunk. @zhangshuyan0 Thanks for your contributions. @saxenapranav And thanks for your reviews.

@Hexiaoqiao
Copy link
Contributor

@zhangshuyan0 Would you mind to check if we need to backport to other active branches.

zhangshuyan0 added a commit to zhangshuyan0/hadoop that referenced this pull request Mar 10, 2023
…#5450). Contributed by Shuyan Zhang.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Hexiaoqiao pushed a commit that referenced this pull request Mar 11, 2023
…5471). Contributed by Shuyan Zhang.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Hexiaoqiao pushed a commit that referenced this pull request Mar 11, 2023
…5471). Contributed by Shuyan Zhang.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
(cherry picked from commit 8cc57f5)
@zhangshuyan0 zhangshuyan0 deleted the fixbug branch March 17, 2023 09:29
ferdelyi pushed a commit to ferdelyi/hadoop that referenced this pull request May 26, 2023
…#5450). Contributed by Shuyan Zhang.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
ShimmerGlass pushed a commit to ShimmerGlass/hadoop that referenced this pull request Oct 30, 2023
…ks. (apache#5450 apache#5471).

Contributed by Shuyan Zhang.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
(cherry picked from commit 8cc57f5)
ShimmerGlass added a commit to criteo-forks/hadoop that referenced this pull request Oct 30, 2023
…ks. (apache#5450 apache#5471). (#60)

Contributed by Shuyan Zhang.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
(cherry picked from commit 8cc57f5)

Co-authored-by: zhangshuyan <81411509+zhangshuyan0@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0