Description
Alluxio Version:
v2.9.4
Describe the bug
First, i deployed the Alluxio with Helm in a K8S cluster which has 1 master node and 7 worker nodes.
Second, when i entered an Alluxio Worker Pod, i tried like this "alluxio fs setReplication --max 3 --min 3 /test_ufs.txt", and it worked pretty good for the first time. However, when i tried another time with "alluxio fs setReplication --max 4 --min 4 /test_ufs.txt", it didn't work, the replication num remained to be 3.
Third, I found some information in Alluxio-master logs:
2025-03-30 06:33:39,802 WARN Master Replication Check - Unexpected exception encountered when starting a REPLICATE job (uri=/test_ufs.txt, block ID=16777216, num replicas=5) : alluxio.exception.status.NotFoundException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:44:39,783 WARN Master Replication Check - Unexpected exception encountered when starting a REPLICATE job (uri=/test_ufs.txt, block ID=16777216, num replicas=5) : alluxio.exception.status.NotFoundException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
and information in Alluxio-job-master logs:
2025-03-30 06:43:39,782 WARN grpc-default-executor-0 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:44:39,782 WARN grpc-default-executor-3 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:45:39,783 WARN grpc-default-executor-3 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:46:39,783 WARN grpc-default-executor-3 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:47:39,782 WARN grpc-default-executor-5 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
Forth, I entered an Alluxio-worker pod and checked the alluxio job list:
sh-4.2# alluxio job ls
1743316123474 Persist COMPLETED
1743316123475 Replicate COMPLETED
it indicated that all the jobs were completed.
My confusion is why the job list says all tasks are completed, but the logs still show that there are setReplication jobs running? This problem prevents me from repeatedly adjusting the number of replicas for a file in Alluxio.
To Reproduce
Steps to reproduce the behavior (as minimally and precisely as possible)
Expected behavior
A clear and concise description of what you expected to happen.
Urgency
Describe the impact and urgency of the bug.
Are you planning to fix it
Yes.
Additional context
properties in values.yaml
properties:
alluxio.security.stale.channel.purge.interval: 365d
alluxio.conf.dynamic.update.enabled: true
alluxio.user.file.metadata.sync.interval: 0
alluxio.master.mount.table.root.ufs: "hdfs://<haodop-ip>:9001/alluxio/ufs"
alluxio.underfs.address: "hdfs://<hadoop-ip>:9001/alluxio/ufs"
alluxio.underfs.hdfs.configuration: "/secrets/hdfsConfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml"
alluxio.master.journal.ufs.option.alluxio.underfs.hdfs.configuration: "/secrets/hdfsConfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml"
alluxio.master.journal.ufs.folder: "hdfs://<hadoop-ip>:9001/alluxio/journal"
alluxio.security.authentication.type: "NOSASL"
alluxio.security.authorization.permission.enabled: false
alluxio.debug: true
alluxio.proxy.s3.v2.version.enabled: false
alluxio.proxy.s3.v2.async.processing.enabled: false
alluxio.underfs.hdfs.user: "root"
alluxio.user.metadata.cache.enabled: true
alluxio.security.login.username: "root"