Allow multiple private keys and extract all files from streams.
Place files in the folder with `.enc` removed.
Do basic checks so streams cannot traverse outside of the folder.
This commit adds the `MINIO_KMS_REPLICATE_KEYID` env. variable.
By default - if not specified or not set to `off` - MinIO will
replicate the KMS key ID of an object.
If `MINIO_KMS_REPLICATE_KEYID=off`, MinIO does not include the
object's KMS Key ID when replicating an object. However, it always
sets the SSE-KMS encryption header. This ensures that the object
gets encrypted using SSE-KMS. The target site chooses the KMS key
ID that gets used based on the site and bucket config.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
This commit allows clients to provide a set of intermediate CA
certificates (up to `MaxIntermediateCAs`) that the server will
use as intermediate CAs when verifying the trust chain from the
client leaf certificate up to one trusted root CA.
This is required if the client leaf certificate is not issued by
a trusted CA directly but by an intermediate CA. Without this commit,
MinIO rejects such certificates.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
If object is uploaded with tags, the internal tagging-timestamp tracked
for replication will be missing. Default to ModTime in such cases to
allow tags to be synced correctly.
Also fixing a regression in fetching tags and tag comparison
* fix: proxy requests to honor global transport
Load the globalProxyEndpoint properly
also, currently, the proxy requests will fail silently for batch cancel
even if the proxy fails; instead,d properly send the corresponding error back
for such proxy failures if opted
* pass the transport to the GetProxyEnpoints function
---------
Co-authored-by: Praveen raj Mani <praveen@minio.io>
Reject new lock requests immediately when 1000 goroutines are queued
for the local lock mutex.
We do not reject unlocking, refreshing, or maintenance; they add to the count.
The limit is set to allow for bursty behavior but prevent requests from
overloading the server completely.
Currently, DeleteObjects() tries to find the object's pool before
sending a delete request. This only works well when an object has
multiple versions in different pools since looking for the pool does
not consider the version-id. When an S3 client wants to
remove a version-id that exists in pool 2, the delete request will be
directed to pool one because it has another version of the same object.
This commit will remove looking for pool logic and will send a delete
request to all pools in parallel. This should not cause any performance
regression in most of the cases since the object will unlikely exist
in only one pool, and the performance price will be similar to
getPoolIndex() in that case.
Earlier, cluster and bucket metrics were named
`minio_usage_last_activity_nano_seconds`.
The bucket level is now named as
`minio_bucket_usage_last_activity_nano_seconds`
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
When compression is enabled, the final object size is not calculated in
that case, we need to make sure that the provided buffer is always
more significant than the shard size, the bitrot will always calculate
the hash of blocks with shard size, except the last block.
Before https://github.com/minio/minio/pull/20575, files could pick up indices
from unrelated files if no index was added.
This would result in these files not being consistent across a set.
When loading, search for the compression indicators and check if they
are within the problematic date range, and clean up any parts that have
an index but shouldn't.
The test validates that the signature matches the one in files stored without an index.
Bumps xlMetaVersion, so this check doesn't have to be made for future versions.
It is possible delete marker was received on old pool as decom
move in progress, this PR allows decom retry to ensure these
delete markers are moved to new pool so that decommission can
be completed.
Fixes#20819
Add profiling potential crash wourkaround
Using admin traces could potentially crash the server (or handler more likely) due to upstream divide by 0: https://github.com/felixge/fgprof/pull/34
Ensure the profile always runs 100ms before stopping, so sample count isn't 0 (default sample rate ~10ms/sample, but allow for cpu starvation)
If one object has many parts where all parts are readable but some parts
are missing from some drives, this object can be sometimes un-healable,
which is wrong.
This commit will avoid reading from drives that have missing, corrupted or
outdated xl.meta. It will also check if any part is unreadable to avoid
healing in that case.
we do not need to hold the read locks at the higher
layer instead before reading the body, instead hold
the read locks properly at the time of renamePart()
for protection from racy part overwrites to compete
with concurrent completeMultipart().
CheckParts call can take time to verify 10k parts of a single object in a single drive.
To avoid an internal dealine of one minute in the single handler RPC, this commit will
switch to streaming RPC instead.
This API had missing permissions checking, allowing a user to change
their policy mapping by:
1. Craft iam-info.zip file: Update own user permission in
user_mappings.json
2. Upload it via `mc admin cluster iam import nobody iam-info.zip`
Here `nobody` can be a user with pretty much any kind of permission (but
not anonymous) and this ends up working.
Some more detailed steps - start from a fresh setup:
```
./minio server /tmp/d{1...4} &
mc alias set myminio http://localhost:9000 minioadmin minioadmin
mc admin user add myminio nobody nobody123
mc admin policy attach myminio readwrite nobody nobody123
mc alias set nobody http://localhost:9000 nobody nobody123
mc admin cluster iam export myminio
mkdir /tmp/x && mv myminio-iam-info.zip /tmp/x
cd /tmp/x
unzip myminio-iam-info.zip
echo '{"nobody":{"version":1,"policy":"consoleAdmin","updatedAt":"2024-08-13T19:47:10.1Z"}}' > \
iam-assets/user_mappings.json
zip -r myminio-iam-info-updated.zip iam-assets/
mc admin cluster iam import nobody ./myminio-iam-info-updated.zip
mc admin service restart nobody
```
`mc admin heal ALIAS/bucket/object` does not have any flag to heal
object noncurrent versions, this commit will make healing of the object
noncurrent versions implicitly asked.
This also fixes the 'mc admin heal ALIAS/bucket/object' that does not work
correctly when the bucket is versioned. This has been broken since Apr 2023.
Golang http.Server will call SetReadDeadline overwriting the previous
deadline configuration set after a new connection Accept in the custom
listener code. Therefore, --idle-timeout was not correctly respected.
Make http.Server read/write timeout similar to --idle-timeout.
The code assigns corrupted state to a drive for any unexpected error,
which is confusing for users. This change will make sure to assign
corrupted state only for corrupted parts or xl.meta. Use unknown state
with a explanation for any unexpected error, like canceled, deadline
errors, drive timeout, ...
Also make sure to return the bucket/object name when the object is not
found or marked not found by the heal dangling code.
* Add the policy name to the audit log tags when doing policy-based API calls
* Audit log the retention settings requested in the API call
* Audit log of retention on PutObjectRetention API path too
The RemoveUser API only removes internal users, and it reports success
when it didnt find the internal user account for deletion. When provided
with a service account, it should not report success as that is misleading.
Update github.com/cosnicolaou/pbzip2 to latest version for
significant performance improvements. This update brings a 45%
reduction in processing time.
Previously, not setting http.Config.HTTPTimeout for logger webhook
was resulting in a timeout of 0, and causing "deadline exceeded"
errors in log webhook.
This change introduces a new env variable for configuring log webhook
timeout and more importantly sets it when the config is initialised.
Manual heal can return XMinioHealInvalidClientToken if the manual
healing is started in the first node, and the next mc call to get the
heal status is landed on another node. The reason is that redirection
based on the token ID is not able to redirect requests to the first node
due to a typo.
This also affects the batch cancel command if the batch is being done in
the first node, the user will never be able to cancel it due to the same
bug.
HTTP likes to slap an infinite read deadline on a connection and
do a blocking read while the response is being written.
This effectively means that a reading deadline becomes the
request-response deadline.
Instead of enforcing our timeout, we pass it through and keep
"infinite deadline" is sticky on connections.
However, we still "record" when reads are aborted, so we never overwrite that.
The HTTP server should have `ReadTimeout` and `IdleTimeout` set for the deadline to be effective.
Use --idle-timeout for incoming connections.
Since DeadlineConn would send deadline updates directly upstream,
it would race with Read/Write operations. The stdlib will perform a read,
but do an async SetReadDeadLine(unix(1)) to cancel the Read in
`abortPendingRead`. In this case, the Read may override the
deadline intended to cancel the read.
Stop updating deadlines if a deadline in the past is seen and when Close is called.
A mutex now protects all upstream deadline calls to avoid races.
This should fix the short-term buildup of...
```
365 @ 0x44112e 0x4756b9 0x475699 0x483525 0x732286 0x737407 0x73816b 0x479601
# 0x475698 sync.runtime_notifyListWait+0x138 runtime/sema.go:569
# 0x483524 sync.(*Cond).Wait+0x84 sync/cond.go:70
# 0x732285 net/http.(*connReader).abortPendingRead+0xa5 net/http/server.go:729
# 0x737406 net/http.(*response).finishRequest+0x86 net/http/server.go:1676
# 0x73816a net/http.(*conn).serve+0x62a net/http/server.go:2050
```
AFAICT Only affects internode calls that create a connection (non-grid).
These are needed checks for the functions to be un-crashable with any input
given to `msgUnPath` (tested with fuzzing).
Both conditions would result in a crash, which prevents that. Some
additional upstream checks are needed.
Fixes#20610
This PR fixes a regression introduced in https://github.com/minio/minio/pull/19797
by restoring the healing ability of transitioned objects
Bonus: support for transitioned objects to carry original
The object name is for future reverse lookups if necessary.
Also fix parity calculation for tiered objects to n/2 for n/2 == (parity)
Existing implementation runs IAM purge routines for expired LDAP and
OIDC accounts with a probability of 0.25 after every IAM refresh. This
change ensures that they are run once in each hour.
xlStorage.Healing() returns nil if there is an error reading
.healing.bin or if this latter is empty. healing.bin update()
call returns early if .healing.bin is empty; hence, no further update
of .healing.bin is possible.
A .healing.bin can be empty if os.Open() with O_TRUNC is successful
but the next Write returns an error.
To avoid this weird situation, avoid making healingTracker.update()
to return early if .healing.bin is empty, so write again.
This commit also fixes wrong error log printing when an object is
healed in another drive in the same erasure set but not in the drive
that is actively healing by fresh drive healing code. Currently, it prints
<nil> instead of a factual error.
* heal: Scan .minio.sys metadata only during site-wide heal (#137)
mc admin heal always invoke .minio.sys heal, but sometimes, this latter
contains a lot of data, many service accounts, STS accounts etc, which
makes mc admin heal command very slow.
Only invoke .minio.sys healing when no bucket was specified in `mc admin
heal` command.
Healing a large object with a normal scan mode where no parts read
is involved can still fail after 30 seconds if an object has
There are too many parts when hard disks are being used mainly.
The reason is there is a general deadline that checks for all parts we
do a deadline per part.
When listing objects with metadata, avoid returning an "expires" time
metadata value when its value is the zero time as this means that no
expires value is set on the object.
The condition were incorrect as we were comparing the filter
value against the modification time object.
For example if created after filter date is after modification
time of object, that means object was created before the filter
time and should be skipped while replication because as per the
filter we need only the objects created after the filter date.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
When Keycloak vendor is set, the code will start to clean up service
accounts that parents do not exist anymore. However, the code will also
look for the parent user of site-replicator-0, MINIO_ROOT_USER, which
obviously does not exist in Keycloak. Therefore, the site-replicator-0
will be removed automatically.
This commit will avoid cleaning up service accounts generated from
the root user.
The verify file handler response format was changed from gob to msgp
since two months but we forgot updating the verify handler client.
VerifyFile is only called during a heal deep scan (bitrot check).
HealObject() will fail in that case and will mark all disks corrupted and
will return early (as unrecoverable object but it will also not be
removed)
It is a bit rare for HealObject to be called with a deep scan flag. It
is called when a HealObject with a normal scan (e.g. new drive healing)
detects a bitrot corruption, therefore healing objects with a detected
bitrot corruption will fail.
in cases where we cannot possibly know a way to read and
construct the object, it is impossible to achieve any form of
quorum via xl.meta while we have sufficient responses from
all the drives, we should return object not found.
- PutObjectMetadata()
- PutObjectTags()
- DeleteObjectTags()
- TransitionObject()
- RestoreTransitionObject()
Also improve the behavior of multipart code across
pool locks, hold locks only once per upload ID for
- CompleteMultipartUpload()
- AbortMultipartUpload()
- ListObjectParts() (read-lock)
- GetMultipartInfo() (read-lock)
- PutObjectPart() (read-lock)
This avoids lock attempts across pools for no
reason, this increases O(n) when there are n-pools.
multi-object deletion may or may not compete with locks
granted for other callers, causing concurrent operations
to succeed on each other.
A continuation of the PR https://github.com/minio/minio/pull/20356
When custom authorization via plugin is enabled, the console will now
render the UI as if all actions are allowed. Since server cannot
determine the exact policy allowed for a user via the plugin, this is
acceptable to do. If a particular action is actually not allowed by the
plugin the call will result in an error.
Previously the server was evaluating a policy when custom authZ is
enabled - this is fixed now.
The "unlimited" value on PPC wasn't exactly the same as amd64.
Instead compare against an "unreasonably big value".
Would cause OOM in anything using the concurrent request limit.
Add TTFB for all requests in metrics-v3 in addition to the existing
GetObject. Also for the requests that do not return a body in the
response, calculate TTFB as the HTTP status code and the headers are
sent.
A batch job will fail if the retry attempt is not provided. The reason
is that the code mistakenly gets the retry attempts from the job status
rather than the job yaml file.
This will also set a default empty prefix for batch expiration.
Also this will avoid trimming the prefix since the yaml decoder already
does that if no quotes were provided, and we should not trim if quotes
were provided and the user provided a leading or a trailing space.
Tests if imported service accounts have
required access to buckets and objects.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
- PutObject() for multi-pooled was holding large
region locks, which was not necessary. This affects
almost all slowpoke clients and lengthy uploads.
- Re-arrange locks for CompleteMultipart, PutObject
to be close to rename()
Currently, it is not possible to remove a tier if it is not accessible
or contains some data, add a force flag to make the removal successful
in that case.
When the encryption and compression are both enabled, the
the server will avoid compressing the data for no apparent reason
This commit will enable it and update unit tests.
Download static cURL into release Docker image for all supported architectures.
Currently, the static cURL executable is only downloaded for the `amd64` architecture. However, `arm64` and `ppc64le` variants are also [available](https://github.com/moparisthebest/static-curl/releases/tag/v8.7.1).
postUpload() incorrectly saves actual size as '-1'
we should save correct size when its possible.
Bonus: fix the PutObjectPart() write locker, instead
of holding a lock before we read the client stream.
We should hold it only when we need to commit the parts.
this cache will be honored only when `prefix=""` while
performing ListMultipartUploads() operation.
This is mainly to satisfy applications like alluxio
for their underfs implementation and tests.
replaces https://github.com/minio/minio/pull/20181
AFAICT we send a canceled context to unlock (and thereby releaseAll). This will cause network calls to fail.
Instead use background and add 30s timeout.
Current implementation retries forever until our
log buffer is full, and we start dropping events.
This PR allows you to set a value until we give
up on existing audit/logger batches to proceed to
process the new ones.
Bonus:
- do not blow up buffers beyond batchSize value
- do not leak the ticker if the worker returns
The items will be saved per target batch and will
be committed to the queue store when the batch is full
Also, periodically commit the batched items to the queue store
based on configured commit_timeout; default is 30s;
Bonus: compress queue store multi writes
rebalance metadata is good to have only,
if it cannot be loaded when starting MinIO
for some reason we can possibly ignore it
and move on and let user start rebalance
again if needed.
readParts requires that both part.N and part.N.meta files be present.
This change addresses an issue with how an error to return to the upper
layers was picked from most drives where a UploadPart operation
had failed.
By default, even if MINIO_BROWSER=off set code tries to get free
port available for the console.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
S3 spec does not accept an ILM XML document containing both <Filter>
and <Prefix> XML tags, even if both are empty. That is why we added
a 'set' field in some lifecycle structures to decide when and when not to
show a tag. However, we forgot to disallow marshaling of Filter when
'set' is set to false.
This will fix ILM document replication in a site replication
configuration in some cases.
When the prefix field is not provided in the remote source of a yaml
replication job, the code fails to do listing and makes replication
successful. This commit fixes it.
This change adds a consistent nonce to ensure
that multipart uploads are deterministic on a
per-part basis.
Thanks to @klauspost for the work here minio/sio@3cd3734
locks handed by different pools would become non-compete for
multi-object delete request, this is wrong for obvious
reasons.
New locking implementation and revamp will rewrite multi-object
lock anyway, this is a workaround for now.
Currently, the bucket events and replication targets are only reloaded
with buckets that failed to load during the first cluster startup,
which is wrong because if one bucket change was done in one node but
that node was not able to notify other nodes; the other nodes will
reload the bucket metadata config but fails to set the events and bucket
targets in the memory.
when a hung drive is hot-unplugged, the server might go
into a loop where the previous `format.json` is somehow
still accessible to the process, we try to re-init() drives,
but that seems to cause a previous goroutine to hang around
since it is not canceled away when the drive is closed.
Bonus: add deadline for immediate purge routine, to unblock
it if the drive is blocking mutations.
if a user policy is found, avoid reading from the drives
for missing group mappings, group mappings are not mandatory
and conditional.
This PR restores the older behavior while making sure that
if a direct user policy is not found, we would still attempt
to load from the group from the drives.
This commit simplifies and optimizes the decryption of large (multipart)
objects. This PR does two things:
- Re-write the init logic for the decryption reader
- Reduce the number of OEK decryptions
Before, the init logic copied some SSE HTTP request headers to
parse them later. This is simplified to parsing them right away. This
removes some fields from the decryption reader struct.
Further, the decryption reader decrypted the OEK using the client-provided
key (SSE-C) or the KMS (SSE-S3 / SSE-KMS) for each part. This is redundant
since the OEK is the same for all parts. In particular, a KMS call might be a
network request. Now, the OEK is decrypted once for the entire multipart object.
This should improve latency when reading encrypted multipart objects
and reduce requests to the KMS.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
Use Walk(), which is a recursive listing with versioning, to check if
the bucket has some objects before being removed. This is beneficial
because the bucket can contain multiple dangling objects in multiple
drives.
Also, this will prevent a bug where a bucket is deleted in a deployment
that has many erasure sets but the bucket contains one or few objects
not spread to enough erasure sets.
Currently, retry healing of a new drive healing does not reset
HealedBuckets means that the next healing retry will skip those
buckets. The commit will fix this behavior.
Also, the skipped objects counter will include objects uploaded
that are uploaded after the healing is started.
sftp sends local requests to the S3 port while passing the session token
header when the account corresponds to a service account. However, this
is not permitted and will throw an error: "The security token included in the
request is invalid"
This commit will avoid passing the session token to the upper layer that
initializes MinIO client to avoid this error.
Sometimes, we need historical information in .healing.bin, such as the
number of expired objects that the healing avoids to heal and that can
create drive usage disparency in the same erasure set. For that reason,
this commit will not remove .healing.bin anymore and it will have a new
field called Finished so we know healing is finished in that drive.
Services are unfrozen before `initBackgroundReplication` is finished. This means that
the globalReplicationStats write is racy. Switch to an atomic pointer.
Provide the `ReplicationPool` with the stats, so it doesn't have to be grabbed
from the atomic pointer on every use.
All other loads and checks are nil, and calls return empty values when stats
still haven't been initialized.
* Allow a maximum of 10 seconds to start profiling operations.
* Download up to 16 profiles concurrently, but only allow 10 seconds for
each (does not include write time).
* Add cluster info as the first operation.
* Ignore remote download errors.
* Stop remote profiles if the request is terminated.
If the site replication is enabled and the code tries to extract jwt
claims while the site replication service account credentials are still
not loaded yet, the code will enter an infinite loop, causing in a
high CPU usage.
Another possibility of the infinite loop is having some service accounts
created by an old deployment version where the service account JWT was
signed by the root credentials, but not anymore.
This commit will remove the possibility of the infinite loop in the code
and add root credential fallback to extract claims from old service
accounts.
move away from map[string]interface{} to map[string]string
to simplify the audit, and also provide concise information.
avoids large allocations under load(), reduces the amount
of audit information generated, as the current implementation
was a bit free-form. instead all datastructures must be
flattened.
Previously, we checked if we had a quorum on the DataDir value.
We are removing this check, which allows reading objects with different
DataDir values in a few drives (due to a rebalance-stop race bug)
provided their eTags or ModTimes match.
Since a lot of operations load from storage, do remote calls, add a 10 second timeout to each operation.
This should make `mc admin info` return values even under extreme conditions.
- optimize writing part.N.meta by writing both part.N
and its meta in sequence without network component.
- remove part.N.meta, part.N which were partially success
ful, in quorum loss situations during renamePart()
- allow for strict read quorum check arbitrated via ETag
for the given part number, this makes it double safer
upon final commit.
- return an appropriate error when read quorum is missing,
instead of returning InvalidPart{}, which is non-retryable
error. This kind of situation can happen when many
nodes are going offline in rotation, an example of such
a restart() behavior is statefulset updates in k8s.
fixes#20091
during rebalance stop, it can possibly happen that
Put() would race by overwriting the same object again.
This may very well if done "successfully" it can
potentially proceed to delete the object from the pool,
causing data loss.
This PR enhances #20233 to handle more scenarios such
as these.
Rebalance-stop can race with ongoing rebalance operations. This change
prevents these operations from overwriting objects by checking the source
and destination pool indices are different.
This commit replaces the LDAP client TLS config and
adds a custom list of TLS cipher suites which support
RSA key exchange (RSA kex).
Some LDAP server connections experience a significant slowdown
when these cipher suites are not available. The Go TLS stack
disables them by default. (Can be enabled via GODEBUG=tlsrsakex=1).
fixes https://github.com/minio/minio/issues/20214
With a custom list of TLS ciphers, Go can pick the TLS RSA key-exchange
cipher. Ref:
```
if c.CipherSuites != nil {
return c.CipherSuites
}
if tlsrsakex.Value() == "1" {
return defaultCipherSuitesWithRSAKex
}
```
Ref: https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/crypto/tls/common.go;l=1017
Signed-off-by: Andreas Auernhammer <github@aead.dev>
this allows for de-duplicating the callers when called
concurrently, allowing for bucketmetadata reads to be
single call. All concurrent callers will get the same data
as the first one.
Fix a regression in #19733 where TTFB metrics for all APIs except
GetObject were removed in v2 and v3 metrics. This causes breakage for
existing v2 metrics users. Instead we continue to send TTFB for all APIs
in V2 but only send for GetObject in V3.
allow multipart uploads expiration to be dyamic
It would seem like the new values will take effect
only after a restart for changes in multipart_expiration.
This PR fixes this by making it dynamic as it should have
been.
deadlines per moveToTrash() allows for a more granular timeout
approach for syscalls, instead of an aggregate timeout.
This PR also enhances multipart state cleanup to be optimal by
removing 100's of multipart network rename() calls into single
network call.
context deadline was introduced to avoid a slow transfer from blocking
replication queue(s) shared by other buckets that may not be under throttling.
This PR removes this context deadline for larger objects since they are
anyway restricted to a limited set of workers. Otherwise, objects would
get dequeued when the throttle limit is exceeded and cannot proceed
within the deadline.
epoll contention on TCP causes latency build-up when
we have high volume ingress. This PR is an attempt to
relieve this pressure.
upstream issue https://github.com/golang/go/issues/65064
It seems to be a deeper problem; haven't yet tried the fix
provide in this issue, but however this change without
changing the compiler helps.
Of course, this is a workaround for now, hoping for a
more comprehensive fix from Go runtime.
- ReadVersion
- ReadFile
- ReadXL
Further changes include to
- Compact internode resource RPC paths
- Compact internode query params
To optimize on parsing by gorilla/mux as the
length of this string increases latency in
gorilla/mux - reduce to a meaningful string.
the main reason is to let Go net/http perform necessary
book keeping properly, and in essential from consistency
point of view its GETs all the way.
Deprecate sendFile() as its buggy inside Go runtime.
For a non-tiered object, MinIO requires that EcM (# of data blocks) of
xl.meta agree, corresponding to the number of data blocks needed to
read this object.
OTOH, tiered objects have metadata in the hot tier and data in the
warm tier. The data and its integrity are offloaded to the warm tier. This
allows us to reduce the read quorum from EcM (typically > N/2, where N -
erasure stripe width) to N/2 + 1. The simple majority of metadata
ensures consensus on what the object is and where it is
located.
When a drive is in a failed state when a single node multiple drives
deployment is started, a replacement of a fresh disk will not be
properly healed unless the user restarts the node.
Fix this by always adding the new fresh disk to globalLocalDrivesMap. Also
remove globalLocalDrives for simplification, a map to store local node
drives can still be used since the order of local drives of a node is
not defined.
kms: Expose API available when bucket federation is enabled
When bucket federation feature is enabled, KMS API will not work, such
as `mc admin kms key list`
The commit will fix the issue by disabling bucket forwarding when this
is a KMS request.
Tracing syscalls, opening and reading an `xl.meta` looks like this:
```
openat(AT_FDCWD, "/mnt/drive1/ss8-old/testbucket/ObjSize4MiBThreads72/(554O51H/peTb(0iztdbTKw59.csv/xl.meta", O_RDONLY|O_NOATIME|O_CLOEXEC) = 34 <0.000>
fcntl(34, F_GETFL) = 0x48000 (flags O_RDONLY|O_LARGEFILE|O_NOATIME) <0.000>
fcntl(34, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_NOATIME) = 0 <0.000>
epoll_ctl(4, EPOLL_CTL_ADD, 34, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3172471557, u64=8145488475984499461}}) = -1 EPERM (Operation not permitted) <0.000>
fcntl(34, F_GETFL) = 0x48800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_NOATIME) <0.000>
fcntl(34, F_SETFL, O_RDONLY|O_LARGEFILE|O_NOATIME) = 0 <0.000>
fstat(34, {st_mode=S_IFREG|0644, st_size=354, ...}) = 0 <0.000>
read(34, "XL2 \1\0\3\0\306\0\0\1P\2\2\1\304$\225\304\20\0\0\0\0\0\0\0\0\0\0\0"..., 354) = 354 <0.000>
close(34) = 0 <0.000>
```
Everything until `fstat` is the `os.Open` call.
Looking at the code: https://github.com/golang/go/blob/master/src/os/file_unix.go#L212-L243
It seems for every file it "tries" to see if it is pollable. This causes `syscall.SetNonblock(fd, true)` to be called. This is the first `F_SETFL`.
It then calls `f.pfd.Init("file", true)`. This will attempt to set it as pollable using `epoll_ctl`. This will always fail for files. It therefore calls `syscall.SetNonblock(fd, false)` resulting in the second `F_SETFL`.
If we set the `O_NONBLOCK` call on the initial open, we should avoid the 4 `fcntl` syscalls per file.
I don't see any way to avoid the `epoll_ctl` call, since kind is either `kindOpenFile` or `kindNonBlock`, so "pollable" will always be true. However avoiding 4 of 6 syscalls still seems worth it.
This should not have any effect, since files will end up with "nonblock" anyway.
allow non-inlined on disk to be inlined via
an unversioned ReadVersion() call, we only
need ReadXL() to resolve objects with multiple
versions only.
The choice of this block makes it to be dynamic
and chosen by the user via `mc admin config set`
Other bonus things
- Start measuring internode TTFB performance.
- Set TCP_NODELAY, TCP_CORK for low latency
Use `runtime.Gosched()` if we have less than maxMergeMessages and the
queue is empty. Up maxMergeMessages to 50 to merge more messages into
a single write.
Add length check for an early bailout on readAllInto when we know packet length.
This commit enforces FIPS-compliant TLS ciphers in FIPS mode
by importing the `fipsonly` module.
Otherwise, MinIO still accepts non-FIPS compliant TLS connections.
removes contentious usage of mutexes in LRU, which
were never really reused in any manner; we do not
need it.
To trust hosts, the correct way is TLS certs; this PR completely
removes this dependency, which has never been useful.
```
0 0% 100% 25.83s 26.76% github.com/hashicorp/golang-lru/v2/expirable.(*LRU[...])
0 0% 100% 28.03s 29.04% github.com/hashicorp/golang-lru/v2/expirable.(*LRU[...])
```
Bonus: use `x-minio-time` as a nanosecond to avoid unnecessary
parsing logic of time strings instead of using a more
straightforward mechanism.
- Also, fix failure reporting at the end.
- Also, avoid parsing report objects when listing or resuming jobs, this
does not cause any bugs, it is only printing, not useful errors.
Split the read and write sides of handleMessages into two separate functions
Cosmetic. The only non-copy-and-paste change is that `cancel(ErrDisconnected)` is moved
into the defer on `readStream`.
add unit test for v3 metrics for all its exposed endpoints
Bonus:
- support OpenMetrics encoding
- adds boot time for prometheus
- continueOnError is better to serve as
much metrics as possible.
```
curl http://localhost:9000/minio/metrics/v3/cluster/usage/buckets
```
Did not work as documented, due to the fact that there was a typo
in the bucket usage metrics registration group. This endpoint is
a cluster endpoint and does not require any `buckets` argument.
When creating the async listing, if the first request does not return within 3
minutes, it is stopped, since it isn't being kept alive.
Keep updating `lastHandout` while we are waiting for the initial request to be fulfilled.
When a fresh drive healing is finished, add more checks for the drive listing
errors. If any, re-list and heal again. Although this is an infrequent use
case to have listPathRaw() returning nil when minDisks is set to 1, we
still need to handle all possible use cases to avoid missing healing
any object.
Also, check for HealObject result to decide of an object is healed in the
fresh disk since HealObject returns nil if an object is healed in any
disk, and not in the new fresh drive.
In regular listing, this commit will avoid showing an object when its
latest version has a pending or failed deletion. In replicated setup.
It will also prevent showing older versions in the same case.
Also, sort the error map for multiple sites in ascending order
of deployment IDs, so that the error message generated is always
definitive order and same.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Add `ConnDialer` to abstract connection creation.
- `IncomingConn(ctx context.Context, conn net.Conn)` is provided as an entry point for
incoming custom connections.
- `ConnectWS` is provided to create web socket connections.
If `SkipReader` is called with a small initial buffer it may be doing a huge number if Reads to skip the requested number of bytes. If a small buffer is provided grab a 32K buffer and use that.
Fixes slow execution of `testAPIGetObjectWithMPHandler`.
Bonuses:
* Use `-short` with `-race` test.
* Do all suite test types with `-short`.
* Enable compressed+encrypted in `testAPIGetObjectWithMPHandler`.
* Disable big file tests in `testAPIGetObjectWithMPHandler` when using `-short`.
The code was advertenly passing max openfds to debug.SetMemoryLimit(),
fixing this accelerate go test in my machine.
This is only a testing bug, since the server context has always a valid
MaxMem, so the buggy code was never called in users environments.
Currently the status of a completed or failed batch is held in the
memory, a simple restart will lose the status and the user will not
have any visibility of the job that was long running.
In addition to the metrics, add a new API that reads the batch status
from the drives. A batch job will be cleaned up three days after
completion.
Also add the batch type in the batch id, the reason is that the batch
job request is removed immediately when the job is finished, then we
do not know the type of batch job anymore, hence a difficulty to locate
the job report
Hot load a policy document when during account authorization evaluation
to avoid returning 403 during server startup, when not all policies are
already loaded.
Add this support for group policies as well.
you can attempt a rebalance first i.e, start with 2 pools.
```
mc admin rebalance start alias/
```
and after that you can add a new pool, this would
potentially crash.
```
Jun 27 09:22:19 xxx minio[7828]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 27 09:22:19 xxx minio[7828]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x22cc225]
Jun 27 09:22:19 xxx minio[7828]: goroutine 1 [running]:
Jun 27 09:22:19 xxx minio[7828]: github.com/minio/minio/cmd.(*erasureServerPools).findIndex(...)
```
without atomic load() it is possible that for
a slow receiver we would get into a hot-loop, when
logCh is full and there are many incoming callers.
to avoid this as a workaround enable BATCH_SIZE
greater than 100 to ensure that your slow receiver
receives data in bulk to avoid being throttled in
some manner.
this PR however fixes the unprotected access to
the current workers value.
specifying customer HTTP client makes the gcs SDK
ignore the passed credentials, instead let the GCS
SDK manage the transport.
this PR fixes#19922 a regression from #19565
Currently, bucket metadata is being loaded serially inside ListBuckets
Objet API. Fix that by loading the bucket metadata as the number of
erasure sets * 10, which is a good approximation.
When a reconnection happens, `handleMessages` must be able to complete and exit.
This can be prevented in a full queue.
Deadlock chain (May 10th release)
```
1 @ 0x44110e 0x453125 0x109f88c 0x109f7d5 0x10a472c 0x10a3f72 0x10a34ed 0x4795e1
# 0x109f88b github.com/minio/minio/internal/grid.(*Connection).send+0x3eb github.com/minio/minio/internal/grid/connection.go:548
# 0x109f7d4 github.com/minio/minio/internal/grid.(*Connection).queueMsg+0x334 github.com/minio/minio/internal/grid/connection.go:586
# 0x10a472b github.com/minio/minio/internal/grid.(*Connection).handleAckMux+0xab github.com/minio/minio/internal/grid/connection.go:1284
# 0x10a3f71 github.com/minio/minio/internal/grid.(*Connection).handleMsg+0x231 github.com/minio/minio/internal/grid/connection.go:1211
# 0x10a34ec github.com/minio/minio/internal/grid.(*Connection).handleMessages.func1+0x6cc github.com/minio/minio/internal/grid/connection.go:1019
---> blocks ---> via (Connection).handleMsgWg
1 @ 0x44110e 0x454165 0x454134 0x475325 0x486b08 0x10a161a 0x10a1465 0x2470e67 0x7395a9 0x20e61af 0x20e5f1f 0x7395a9 0x22f781c 0x7395a9 0x22f89a5 0x7395a9 0x22f6e82 0x7395a9 0x22f49a2 0x7395a9 0x2206e45 0x7395a9 0x22f4d9c 0x7395a9 0x210ba06 0x7395a9 0x23089c2 0x7395a9 0x22f86e9 0x7395a9 0xd42582 0x2106c04
# 0x475324 sync.runtime_Semacquire+0x24 runtime/sema.go:62
# 0x486b07 sync.(*WaitGroup).Wait+0x47 sync/waitgroup.go:116
# 0x10a1619 github.com/minio/minio/internal/grid.(*Connection).reconnected+0xb9 github.com/minio/minio/internal/grid/connection.go:857
# 0x10a1464 github.com/minio/minio/internal/grid.(*Connection).handleIncoming+0x384 github.com/minio/minio/internal/grid/connection.go:825
```
Add a queue cleaner in reconnected that will pop old messages so `handleMessages` can
send messages without blocking and exit appropriately for the connection to be re-established.
Messages are likely dropped by the remote, but we may have some that can succeed,
so we only drop when running out of space.
Add the inlined data as base64 encoded field and try to add a string version if feasible.
Example:
```
λ xl-meta -data xl.meta
{
"8e03504e-1123-4957-b272-7bc53eda0d55": {
"bitrot_valid": true,
"bytes": 58,
"data_base64": "Z29sYW5nLm9yZy94L3N5cyB2MC4xNS4wIC8=",
"data_string": "golang.org/x/sys v0.15.0 /"
}
```
The string will have quotes, newlines escaped to produce valid JSON.
If content isn't valid utf8 or the encoding otherwise fails, only the base64 data will be added.
`-export` can still be used separately to extract the data as files (including bitrot).
* Prevents blocking when losing quorum (standard on cluster restarts).
* Time out to prevent endless buildup. Timed-out remote locks will be canceled because they miss the refresh anyway.
* Reduces latency for all calls since the wall time for the roundtrip to remotes no longer adds to the requests.
since the connection is active, the
response recorder body can grow endlessly
causing leak, as this bytes buffer is
never given back to GC due to an goroutine.
skip healing properly in scanner when drive is hotplugged
due to how the state is passed around the SkipHealing
might not be the true state() of the system always, causing
a situation where we might healing from the scanner on the
same drive which is being. Due to this competing heals get
triggered that slow each other down.
due to a historic bug in CopyObject() where
an inlined object loses its metadata, the
check causes an incorrect fallback verifying
data-dir.
CopyObject() bug was fixed in ffa91f9794 however
the occurrence of this problem is historic, so
the aforementioned check is stretching too much.
Bonus: simplify fileInfoRaw() to read xl.json as well,
also recreate buckets properly.
* Multipart SSEC checksums were not transferred.
* Remove key mismatch logging. This key is user-controlled with SSEC.
* If the source is SSEC and the destination reports ErrSSEEncryptedObject,
assume replication is good.
avoid concurrent callers for LoadUser() to even initiate
object read() requests, if an on-going operation is in progress.
this avoids many callers hitting the drives causing I/O
spikes, also allows for loading credentials faster.
This commit fixes an issue in the KES client configuration
that can cause the following error when connecting to KES:
```
ERROR Failed to connect to KMS: failed to generate data key with KMS key: tls: client certificate is required
```
The Go TLS stack seems to not send a client certificate if it
thinks the client certificate cannot be validated by the peer.
In case of an API key, we don't care about this since we use
public key pinning and the X.509 certificate is just a transport
encoding.
The `GetClientCertificate` seems to be honored always such that
this error does not occur.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
the reason for this is to avoid STS mappings to be
purged without a successful load of other policies,
and all the credentials only loaded successfully
are properly handled.
This also avoids unnecessary cache store which was
implemented earlier for optimization.
Directory objects are used by applications that simulate the folder
structure of an on-disk filesystem. These are zero-byte objects with names
ending with '/'. They are only used to check whether a 'folder' exists in
the namespace.
StartSize starts with the raw free space of all disks in the given pool,
however during the status, CurrentSize is not showing the current free
raw space, as expected at least by `mc admin decom status` since it was
written.
Go's net/http is notoriously difficult to have a streaming
deadlines per READ/WRITE on the net.Conn if we add them they
interfere with the Go's internal requirements for a HTTP
connection.
Remove this support for now
fixes#19853
Add partial shard reconstruction
* Add partial shard reconstruction
* Fix padding causing the last shard to be rejected
* Add md5 checks on single parts
* Move md5 verified to `verified/filename.ext`
* Move complete (without md5) to `complete/filename.ext.partno`
It's not pretty, but at least now the md5 gives some confidence it works correctly.
In the very rare case when all drives in a erasure set need to be healed,
remove .healing.bin from all drives, otherwise it will be stuck in a
loop
Also, fix a unit test that fails sometimes due to wrong test.
since #19688 there was a regression introduced during drive
lookups for single node multi-drive setups, drive replacement
would not work correctly without this PR.
This does not fix any current issue, but merging https://github.com/minio/madmin-go/pull/282
can lose the validation of the service account expiration time.
Add more defensive code for now. In the future, we should avoid doing
validation in another library.
precondition check was being honored before, validating
if anonymous access is allowed on the metadata of an
object, leading to metadata disclosure of the following
headers.
```
Last-Modified
Etag
x-amz-version-id
Expires:
Cache-Control:
```
although the information presented is minimal in nature,
and of opaque nature. It still simply discloses that an
object by a specific name exists or not without even having
enough permissions.
fix: authenticate LDAP via actual DN instead of normalized DN
Normalized DN is only for internal representation, not for
external communication, any communication to LDAP must be
based on actual user DN. LDAP servers do not understand
normalized DN.
fixes#19757
This change uses the updated ldap library in minio/pkg (bumped
up to v3). A new config parameter is added for LDAP configuration to
specify extra user attributes to load from the LDAP server and to store
them as additional claims for the user.
A test is added in sts_handlers.go that shows how to access the LDAP
attributes as a claim.
This is in preparation for adding SSH pubkey authentication to MinIO's SFTP
integration.
Add combination of multiple parts.
Parts will be reconstructed and saved separately and can manually be combined to the complete object.
Parts will be named `(version_id)-(filename).(partnum).(in)complete`.
Do not log errors on oneway streams when sending ping fails. Instead, cancel the stream.
This also makes sure pings are sent when blocked on sending responses.
This commit will fix one rare case of a multipart object that
can be read in theory but GetObject API returned an error.
It turned out that a six years old code was marking a drive offline
when the bitrot streaming fails to read a part in a disk with any error.
This can affect reading a subsequent part, though having enough shards,
but unable to construct because one drive was marked offline earlier.
This commit will remove the drive marking offline code. It will also
close the bitrotstreaming reader before marking it as nil.
Adds `-xver` which can be used with `-export` and `-combine` to attempt to combine files across versions if data is suspected to be the same. Overlapping data is compared.
Bonus: Make `inspect` accept wildcards.
Currently, on enabling callhome (or restarting the server), the callhome
job gets scheduled. This means that one has to wait for 24hrs (the
default frequency duration) to see it in action and to figure out if it
is working as expected.
It will be a better user experience to perform the first callhome
execution immediately after enabling it (or on server start if already
enabled).
Also, generate audit event on callhome execution, setting the error
field in case the execution has failed.
* Store ModTime in the upload ID; return it when listing instead of the current time.
* Use this ModTime to expire and skip reading the file info.
* Consistent upload sorting in listing (since it now has the ModTime).
* Exclude healing disks to avoid returning an empty list.
```
==================
WARNING: DATA RACE
Read at 0x0000082be990 by goroutine 205:
github.com/minio/minio/cmd.setCommonHeaders()
Previous write at 0x0000082be990 by main goroutine:
github.com/minio/minio/cmd.lookupConfigs()
```
Recent Veeam is very picky about storage class names. Add `_MINIO_VEEAM_FORCE_SC` env var.
It will override the storage class returned by the storage backend if it is non-standard
and we detect a Veeam client by checking the User Agent.
Applies to HeadObject/GetObject/ListObject*
add deadlines that can be dynamically changed via
the drive max timeout values.
Bonus: optimize "file not found" case and hung drives/network - circuit break the check and return right
away instead of waiting.
Do not log errors on oneway streams when sending ping fails. Instead cancel the stream.
This also makes sure pings are sent when blocked on sending responses.
I will do a separate PR that includes this and adds pings to two-way streams as well as tests for pings.
as that is the only API where the TTFB metric is beneficial, and
capturing this for all APIs exponentially increases the response size in
large clusters.
Replace the `io.Pipe` from streamingBitrotWriter -> CreateFile with a fixed size ring buffer.
This will add an output buffer for encoded shards to be written to disk - potentially via RPC.
This will remove blocking when `(*streamingBitrotWriter).Write` is called, and it writes hashes and data.
With current settings, the write looks like this:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ Parr. │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Pipe │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (unbuffered) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have
been delivered to the TCP buffer, and the next Read hits the Pipe.
Then we write the shard data. This will typically be bigger than 64KB, so it will block until two blocks
have been read from the pipe.
When we insert a ring buffer:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Ring Buffer │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (2MB) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a
memcopy. Reads can fill the 64KB buffer if there is data for it.
If the network is congested, the ring buffer will become filled, and all syscalls will be on full buffers.
Only when the ring buffer is filled will erasure coding start blocking.
Since there is always "space" to write output data, we remove the parallel writing since we are
always writing to memory now, and the goroutine synchronization overhead probably not worth taking.
If the output were blocked in the existing, we would still wait for it to unblock in parallel write, so it would
make no difference there - except now the ring buffer smoothes out the load.
There are some micro-optimizations we could look at later. The biggest is that, in most cases,
we could encode directly to the ring buffer - if we are not at a boundary. Also, "force filling" the
Read requests (i.e., blocking until a full read can be completed) could be investigated and maybe
allow concurrent memory on read and write.
Metrics being added:
- read_tolerance: No of drive failures that can be tolerated without
disrupting read operations
- write_tolerance: No of drive failures that can be tolerated without
disrupting write operations
- read_health: Health of the erasure set in a pool for read operations
(1=healthy, 0=unhealthy)
- write_health: Health of the erasure set in a pool for write operations
(1=healthy, 0=unhealthy)
Adds regression test for #19699
Failures are a bit luck based, since it requires objects to be placed on different sets.
However this generates a failure prior to #19699
* Revert "Revert "Fix incorrect merging of slash-suffixed objects (#19699)""
This reverts commit f30417d9a8.
* Don't override when suffix doesn't match. Instead rely on quorum for each.
Instead of having "online" and "healing" as two metrics, replace with a
single metric "health" which can have following values:
0 = offline
1 = healthy
2 = healing
If two objects share everything but one object has a slash prefix, those would be merged in listings,
with secondary properties used for a tiebreak.
Example: An object with the key `prefix/obj` would be merged with an object named `prefix/obj/`.
While this violates the [no object can be a prefix of another](https://min.io/docs/minio/linux/operations/concepts/thresholds.html#conflicting-objects), let's resolve these.
If we have an object with 'name' and a directory named 'name/' discard the directory only - but allow objects
of 'name' and 'name/' (xldir) to be uniquely returned.
Regression from #15772
canceled callers might linger around longer,
can potentially overwhelm the system. Instead
provider a caller context and canceled callers
don't hold on to them.
Bonus: we have no reason to cache errors, we should
never cache errors otherwise we can potentially have
quorum errors creeping in unexpectedly. We should
let the cache when invalidating hit the actual resources
instead.
LastPong is saved as nanoseconds after a connection or reconnection but
saved as seconds when receiving a pong message. The code deciding if
a pong is too old can be skewed since it assumes LastPong is only in
seconds.
Accept multipart uploads where the combined checksum provides the expected part count.
It seems this was added by AWS to make the API more consistent, even if the
data is entirely superfluous on multiple levels.
Improves AWS S3 compatibility.
This commit adds support for MinKMS. Now, there are three KMS
implementations in `internal/kms`: Builtin, MinIO KES and MinIO KMS.
Adding another KMS integration required some cleanup. In particular:
- Various KMS APIs that haven't been and are not used have been
removed. A lot of the code was broken anyway.
- Metrics are now monitored by the `kms.KMS` itself. For basic
metrics this is simpler than collecting metrics for external
servers. In particular, each KES server returns its own metrics
and no cluster-level view.
- The builtin KMS now uses the same en/decryption implemented by
MinKMS and KES. It still supports decryption of the previous
ciphertext format. It's backwards compatible.
- Data encryption keys now include a master key version since MinKMS
supports multiple versions (~4 billion in total and 10000 concurrent)
per key name.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
If used, 'opts.Marker` will cause many missed entries since results are returned
unsorted, and pools are serialized.
Switch to fully concurrent listing and merging across pools to return sorted entries.
It is expected that whoever is using the credentials which has
the proper set of permissions must be able to run.
`mc support perf object`
While the root login is disabled.
fixes#19648
AWS S3 returns the actual object size as part of XML
response for InvalidRange error, this is used apparently
by SDKs to retry the request without the range.
'opts.Marker` is causing many missed entries if used since results are returned unsorted. Also since pools are serialized.
Switch to do fully concurrent listing and merging across pools to return sorted entries.
Returning errors on listings is impossible with the current API, so document that.
Return an error at once if no drives are found instead of just returning an empty listing and no error.
This is to support deployments migrating from a multi-pooled
wider stripe to lower stripe. MINIO_STORAGE_CLASS_STANDARD
is still expected to be same for all pools. So you can satisfy
adding custom drive count based pools by adjusting the storage
class value.
```
version: v2
address: ':9000'
rootUser: 'minioadmin'
rootPassword: 'minioadmin'
console-address: ':9001'
pools: # Specify the nodes and drives with pools
-
args:
- 'node{11...14}.example.net/data{1...4}'
-
args:
- 'node{15...18}.example.net/data{1...4}'
-
args:
- 'node{19...22}.example.net/data{1...4}'
-
args:
- 'node{23...34}.example.net/data{1...10}'
set-drive-count: 6
```
ILM actions due to ExpiredObjectDeleteAllVersions and
DelMarkerExpiration are ignored when object locking is enabled on a
bucket.
Note: This applies to object versions which may not have retention
configured on them. This applies to all object versions in this bucket,
including those created before the retention config was applied.
Per-bucket metrics endpoints always start with /bucket and the bucket
name is appended to the path. e.g. if the collector path is /bucket/api,
the endpoint for the bucket "mybucket" would be
/minio/metrics/v3/bucket/api/mybucket
Change the existing bucket api endpoint accordingly from /api/bucket to
/bucket/api
The `Token` parameter is a sensitive value that should not be output in the Audit log for STS AssumeRoleWithCustomToken API.
Bonus: Add a simple tool that echoes audit logs to the console.
When listing, with drives returning `errFileNotFound,` `errVolumeNotFound`, or `errUnformattedDisk,`,
we could get below `minDisks` drives being left.
This would result in a quorum never being reachable for any object. Therefore, the listing
would continue, but no results would ever be produced.
Include `fnf` in the mindisk check since it is incremented on these errors. This will stop
listing when minDisks are left.
Allow `opts.minDisks` to not return errVolumeNotFound or errFileNotFound and return that.
That will allow for good results even if disks return something else.
We switch `errUnformattedDisk` to a regular error. If we have enough of those, we should just fail.
Typically not all drives are connected, so we delay 3 minutes before resuming.
This greatly reduces risk of starting to list unconnected drives, or drives we risk being disconnected soon.
This delay is not applied when starting with an admin call.
ConsoleUI like applications rely on combination of
ListServiceAccounts() and InfoServiceAccount() to populate
UI elements, however individually these calls can be slow
causing the entire UI to load sluggishly.
i.e., this rule element doesn't apply to DEL markers.
This is a breaking change to how ExpiredObejctDeleteAllVersions
functions today. This is necessary to avoid the following highly probable
footgun scenario in the future.
Scenario:
The user uses tags-based filtering to select an object's time to live(TTL).
The application sometimes deletes objects, too, making its latest
version a DEL marker. The previous implementation skipped tag-based filters
if the newest version was DEL marker, voiding the tag-based TTL. The user is
surprised to find objects that have expired sooner than expected.
* Add DelMarkerExpiration action
This ILM action removes all versions of an object if its
the latest version is a DEL marker.
```xml
<DelMarkerObjectExpiration>
<Days> 10 </Days>
</DelMarkerObjectExpiration>
```
1. Applies only to objects whose,
• The latest version is a DEL marker.
• satisfies the number of days criteria
2. Deletes all versions of this object
3. Associated rule can't have tag-based filtering
Includes,
- New bucket event type for deletion due to DelMarkerExpiration
calling a remote target remove with a perfectly
well constructed ARN can lead to a crash for a bucket
with no replication configured.
This PR fixes, and adds a crash check for ImportMetadata
as well.
Algorithms are comma separated.
Note that valid values does not in all cases represent default values.
`--sftp=pub-key-algos=...` specifies the supported client public key
authentication algorithms. Note that this doesn't include certificate types
since those use the underlying algorithm. This list is sent to the client if
it supports the server-sig-algs extension. Order is irrelevant.
Valid values
```
ssh-ed25519
sk-ssh-ed25519@openssh.comsk-ecdsa-sha2-nistp256@openssh.com
ecdsa-sha2-nistp256
ecdsa-sha2-nistp384
ecdsa-sha2-nistp521
rsa-sha2-256
rsa-sha2-512
ssh-rsa
ssh-dss
```
`--sftp=kex-algos=...` specifies the supported key-exchange algorithms in preference order.
Valid values:
```
curve25519-sha256
curve25519-sha256@libssh.org
ecdh-sha2-nistp256
ecdh-sha2-nistp384
ecdh-sha2-nistp521
diffie-hellman-group14-sha256
diffie-hellman-group16-sha512
diffie-hellman-group14-sha1
diffie-hellman-group1-sha1
```
`--sftp=cipher-algos=...` specifies the allowed cipher algorithms.
If unspecified then a sensible default is used.
Valid values:
```
aes128-ctr
aes192-ctr
aes256-ctr
aes128-gcm@openssh.comaes256-gcm@openssh.comchacha20-poly1305@openssh.com
arcfour256
arcfour128
arcfour
aes128-cbc
3des-cbc
```
`--sftp=mac-algos=...` specifies a default set of MAC algorithms in preference order.
This is based on RFC 4253, section 6.4, but with hmac-md5 variants removed because they have
reached the end of their useful life.
Valid values:
```
hmac-sha2-256-etm@openssh.comhmac-sha2-512-etm@openssh.com
hmac-sha2-256
hmac-sha2-512
hmac-sha1
hmac-sha1-96
```
This would reduce the size of data in response of metrics
listing. While graphing we can default these metrics with
a zero value if not found.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Unfreeze as soon as the incoming connection is terminated and don't wait for everything to complete.
We don't want to keep the services frozen if something becomes stuck.
- handle errFileCorrupt properly
- micro-optimization of sending done() response quicker
to close the goroutine.
- fix logger.Event() usage in a couple of places
- handle the rest of the client to return a different error other than
lastErr() when the client is closed.
listPathRaw() counts errDiskNotFound as a valid error to indicate a
listing stream end. However, storage.WalkDir() is allowed to return
errDiskNotFound anytime since grid.ErrDisconnected is converted to
errDiskNotFound.
This affects fresh disk healing and should affect S3 listing as well.
endpoint: /minio/metrics/v3/system/process
metrics:
- locks_read_total
- locks_write_total
- cpu_total_seconds
- go_routine_total
- io_rchar_bytes
- io_read_bytes
- io_wchar_bytes
- io_write_bytes
- start_time_seconds
- uptime_seconds
- file_descriptor_limit_total
- file_descriptor_open_total
- syscall_read_total
- syscall_write_total
- resident_memory_bytes
- virtual_memory_bytes
- virtual_memory_max_bytes
Since the standard process collector implements only a subset of these
metrics, remove it and implement our own custom process collector that
captures all the process metrics we need.
Since the object is being permanently deleted, the lack of read quorum should not
matter as long as sufficient disks are online to complete the deletion with parity
requirements.
If several pools have the same object with insufficient read quorum, attempt to
delete object from all the pools where it exists
At server startup, LDAP configuration is validated against the LDAP
server. If the LDAP server is down at that point, we need to cleanly
disable LDAP configuration. Previously, LDAP would remain configured but
error out in strange ways because initialization did not complete
without errors.
When importing access keys (i.e. service accounts) for LDAP accounts,
we are requiring groups to exist under one of the configured group base
DNs. This is not correct. This change fixes this by only checking for
existence and storing the normalized form of the group DN - we do not
return an error if the group is not under a base DN.
Test is updated to illustrate an import failure that would happen
without this change.
Existing IAM import logic for LDAP creates new mappings when the
normalized form of the mapping key differs from the existing mapping key
in storage. This change effectively replaces the existing mapping key by
first deleting it and then recreating with the normalized form of the
mapping key.
For e.g. if an older deployment had a policy mapped to a user DN -
`UID=alice1,OU=people,OU=hwengg,DC=min,DC=io`
instead of adding a mapping for the normalized form -
`uid=alice1,ou=people,ou=hwengg,dc=min,dc=io`
we should replace the existing mapping.
This ensures that duplicates mappings won't remain after the import.
Some additional cleanup cases are also covered. If there are multiple
mappings for the name normalized key such as:
`UID=alice1,OU=people,OU=hwengg,DC=min,DC=io`
`uid=alice1,ou=people,ou=hwengg,DC=min,DC=io`
`uid=alice1,ou=people,ou=hwengg,dc=min,dc=io`
we check if the list of policies mapped to all these keys are exactly
the same, and if so remove all of them and create a single mapping with
the normalized key. However, if the policies mapped to such keys differ,
the import operation returns an error as the server cannot automatically
pick the "right" list of policies to map.
When inspecting files like `.minio.sys/pool.bin` that may be present on multiple sets, use signature to separate them.
Also fixes null versions to actually be useful with `-export -combine`.
`minio_cluster_webhook_queue_length` was wrongly defined as `counter`
where-as it should be `gauge`
Following were wrongly defined as `gauge` when they should actually be
`counter`:
- minio_bucket_replication_sent_bytes
- minio_bucket_replication_received_bytes
- minio_bucket_replication_total_failed_bytes
- minio_bucket_replication_total_failed_count
When LDAP is enabled, previously we were:
- rejecting creation of users and groups via the IAM import functionality
- throwing a `not a valid DN` error when non-LDAP group mappings are present
This change allows for these cases as we need to support situations
where the MinIO server contains users, groups and policy mappings
created before LDAP was enabled.
instead upon any error in renameData(), we still
preserve the existing dataDir in some form for
recoverability in strange situations such as out
of disk space type errors.
Bonus: avoid running list and heal() instead allow
versions disparity to return the actual versions,
uuid to heal. Currently limit this to 100 versions
and lesser disparate objects.
an undo now reverts back the xl.meta from xl.meta.bkp
during overwrites on such flaky setups.
Bonus: Save N depth syscalls via skipping the parents
upon overwrites and versioned updates.
Flaky setup examples are stretch clusters with regular
packet drops etc, we need to add some defensive code
around to avoid dangling objects.
RenameData could start operating on inline data after timing out
and the call returned due to WithDeadline.
This could cause a buffer to write to the inline data being written.
Since no writes are in `RenameData` and the call is canceled,
this doesn't present a corruption issue. But a race is a race and
should be fixed.
Copy inline data to a fresh buffer.
This PR makes a feasible approach to handle all the scenarios
that we must face to avoid returning "panic."
Instead, we must return "errServerNotInitialized" when a
bucketMetadataSys.Get() is called, allowing the caller to
retry their operation and wait.
Bonus fix the way data-usage-cache stores the object.
Instead of storing usage-cache.bin with the bucket as
`.minio.sys/buckets`, the `buckets` must be relative
to the bucket `.minio.sys` as part of the object name.
Otherwise, there is no way to decommission entries at
`.minio.sys/buckets` and their final erasure set positions.
A bucket must never have a `/` in it. Adds code to read()
from existing data-usage.bin upon upgrade.
This PR fixes a few things
- FIPS support for missing for remote transports, causing
MinIO could end up using non-FIPS Ciphers in FIPS mode
- Avoids too many transports, they all do the same thing
to make connection pooling work properly re-use them.
- globalTCPOptions must be set before setting transport
to make sure the client conn deadlines are honored properly.
- GCS warm tier must re-use our transport
- Re-enable trailing headers support.
This reverts commit 928c0181bf.
This change was not correct, reverting.
We track 3 states with the ProxyRequest header - if replication process wants
to know if object is already replicated with a HEAD, it shouldn't proxy back
- Poorna
AWS S3 trailing header support was recently enabled on the warm tier
client connection to MinIO type remote tiers. With this enabled, we are
seeing the following error message at http transport layer.
> Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 400 Bad Request\r\nContent-Type: text/plain; charset=utf-8\r\nConnection: close\r\n\r\n400 Bad Request"; err=<nil>
This is an interim fix until we identify the root cause for this behaviour in the
minio-go client package.
Keep the EC in header, so it can be retrieved easily for dynamic quorum calculations.
To not force a full metadata decode on every read the value will be 0/0 for data written in previous versions.
Size is expected to increase by 2 bytes per version, since all valid values can be represented with 1 byte each.
Example:
```
λ xl-meta xl.meta
{
"Versions": [
{
"Header": {
"EcM": 4,
"EcN": 8,
"Flags": 6,
"ModTime": "2024-04-17T11:46:25.325613+02:00",
"Signature": "0a409875",
"Type": 1,
"VersionID": "8e03504e11234957b2727bc53eda0d55"
},
...
```
Not used for operations yet.
Follow up for #19528
If there are multiple existing DN mappings for the same normalized DN,
if they all have the same policy mapping value, we pick one of them of
them instead of returning an import error.
This is a change to IAM export/import functionality. For LDAP enabled
setups, it performs additional validations:
- for policy mappings on LDAP users and groups, it ensures that the
corresponding user or group DN exists and if so uses a normalized form
of these DNs for storage
- for access keys (service accounts), it updates (i.e. validates
existence and normalizes) the internally stored parent user DN and group
DNs.
This allows for a migration path for setups in which LDAP mappings have
been stored in previous versions of the server, where the name of the
mapping file stored on drives is not in a normalized form.
An administrator needs to execute:
`mc admin iam export ALIAS`
followed by
`mc admin iam import ALIAS /path/to/export/file`
The validations are more strict and returns errors when multiple
mappings are found for the same user/group DN. This is to ensure the
mappings stored by the server are unambiguous and to reduce the
potential for confusion.
Bonus **bug fix**: IAM export of access keys (service accounts) did not
export key name, description and expiration. This is fixed in this
change too.
Reading the list metacache is not protected by a lock; the code retries when it fails
to read the metacache object, however, it forgot to re-read the metacache object
from the drives, which is necessary, especially if the metacache object is inlined.
This commit will ensure that we always re-read the metacache object from the drives
when it is retrying.
When resuming a versioned listing where `version-id-marker=null`, the `null` object would
always be returned, causing duplicate entries to be returned.
Add check against empty version
unlinking() at two different locations on a disk when there
are lots to purge, this can lead to huge IOwaits, instead
rely on rename() to .trash to avoid running multiple unlinks()
in parallel.
since mid 2018 we do not have any deployments
without deployment-id, it is time to put this
code to rest, this PR removes this old code as
its no longer valuable.
on setups with 1000's of drives these are all
quite expensive operations.
The rest of the peer clients were not consistent across nodes. So, meta cache requests
would not go to the same server if a continuation happens on a different node.
As node metrics should be scraped per node basis, use a sample
configuartion using all the nodes in targets.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
When no results match or another error occurs, add an error to the stream. Keep the "inspect-input.txt" as the only thing in the zip for reference.
Example:
```
λ mc support inspect --airgap myminio/testbucket/fjghfjh/**
mc: Using public key from C:\Users\klaus\mc\support_public.pem
File data successfully downloaded as inspect-data.enc
λ inspect inspect-data.enc
Using private key from support_private.pem
output written to inspect-data.zip
2024/04/11 14:10:51 next stream: GetRawData: No files matched the given pattern
λ unzip -l inspect-data.zip
Archive: inspect-data.zip
Length Date Time Name
--------- ---------- ----- ----
222 2024-04-11 14:10 inspect-input.txt
--------- -------
222 1 file
λ
```
Modifies inspect to read until end of stream to report the error.
Bonus: Add legacy commandline params
Add following metrics:
- used_inodes
- total_inodes
- healing
- online
- reads_per_sec
- reads_kb_per_sec
- reads_await
- writes_per_sec
- writes_kb_per_sec
- writes_await
- perc_util
To be able to calculate the `per_sec` values, we capture the IOStats-related
data in the beginning (along with the time at which they were captured),
and compare them against the current values subsequently. This is because
dividing by "time since server uptime." doesn't work in k8s environments.
the disk location never changes in the lifetime of a
MinIO cluster, even if it did validate this close to the
disk instead at the higher layer.
Return appropriate errors indicating an invalid drive, so
that the drive is not recognized as part of a valid
drive.
we have had numerous reports on some config
values not having default values, causing
features misbehaving and not having default
values set properly.
This PR tries to address all these concerns
once and for all.
Each new sub-system that gets added
- must check for invalid keys
- must have default values set
- must not "return err" when being saved into
a global state() instead collate as part of
other subsystem errors allow other sub-systems
to independently initialize.
* Allow specifying the local server, with env variable _MINIO_SERVER_LOCAL, in systems where the hostname cannot be resolved to local IP
* Limit scope of the _MINIO_SERVER_LOCAL solution to only containerized implementations
Return an error when the user specifies endpoints for both source
and target. This can generate many type of errors as the code considers
a deployment remote if its endpoint is specified.
HealObject() does not return an error in some cases, for example, when
an object is successfully reconstructed in one disk but fails with other
disks, another case is when a disk does not have the object is temporarily
disconnected
Add the After heal drives result in the audit output for better
analysis.
Set object's modTime when being restored
restored here refers to making a temporary local copy in the hot tier
for a tiered object using the RestoreObject API
we have been using an LRU caching for internode
auth tokens, migrate to using a typed implementation
and also do not cache auth tokens when its an error.
This fixes a regression from #19358 which prevents policy mappings
created in the latest release from being displayed in policy entity
listing APIs.
This is due to the possibility that the base DNs in the LDAP config are
not in a normalized form and #19358 introduced normalized of mapping
keys (user DNs and group DNs). When listing, we check if the policy
mappings are on entities that parse as valid DNs that are descendants of
the base DNs in the config.
Test added that demonstrates a failure without this fix.
Create new code paths for multiple subsystems in the code. This will
make maintaing this easier later.
Also introduce bugLogIf() for errors that should not happen in the first
place.
Using oidc.redirectUri in the values.yaml only works for the deployment.
When using the statefulset the environment variable
MINIO_IDENTITY_OPENID_REDIRECT_URI is not set. This leads to errors with
oicd providers. For example keycloak throws the error 'invalid
redirect_uri'.
This pull request fixes that.
This commit replaces the `KMS.Stat` API call with a
`KMS.GenerateKey` call. This approach is more reliable
since data key generation also works when the KMS backend
is unavailable (temp. offline), but KES has cached the
key. Ref: KES offline caching.
With this change, it is less likely that MinIO readiness
checks fail in cases where the KMS backend is offline.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
Make sure to pass a nil pointer as a Transport to minio-go when the API config
is not initialized, this will make sure that we do not pass an interface
with a known type but a nil value.
This will also fix the update of the API remote_transport_deadline
configuration without requiring the cluster restart.
Use `ODirectPoolSmall` buffers for inline data in PutObject.
Add a separate call for inline data that will fetch a buffer for the inline data before unmarshal.
This fixes a bug where STS Accounts map accumulates accounts in memory
and never removes expired accounts and the STS Policy mappings were not
being refreshed.
The STS purge routine now runs with every IAM credentials load instead
of every 4th time.
The listing of IAM files is now cached on every IAM load operation to
prevent re-listing for STS accounts purging/reload.
Additionally this change makes each server pick a time for IAM loading
that is randomly distributed from a 10 minute interval - this is to
prevent server from thundering while performing the IAM load.
On average, IAM loading will happen between every 5-15min after the
previous IAM load operation completes.
Fix issue [minio#19314], resolve the absence of the sed command in ubi-micro by replacing it with echo.
Signed-off-by: Andreas Bräu <ab@andi95.de>
Co-authored-by: jiuker <2818723467@qq.com>
If site replication enabled across sites, replicate the SSE-C
objects as well. These objects could be read from target sites
using the same client encryption keys.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
User doesn't need to remember and enter the server values,
rather they can select from the pre populated list.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Instead of relying on user input values, we use the DN value returned by
the LDAP server.
This handles cases like when a mapping is set on a DN value
`uid=svc.algorithm,OU=swengg,DC=min,DC=io` with a user input value (with
unicode variation) of `uid=svc﹒algorithm,OU=swengg,DC=min,DC=io`. The
LDAP server on lookup of this DN returns the normalized value where the
unicode dot character `SMALL FULL STOP` (in the user input), gets
replaced with regular full stop.
Bonus: remove persistent md5sum calculation, turn-off
sha256 as well. Instead we always enable crc32c which
is enough for payload verification also support for
trailing headers checksum.
As total drives count, online vs offline are per node basis, its
corect to select node for which graphs need to be rendered.
Set prometheus scrape jobs to fetch metrics from all nodes. A sample
scrape job for node metrics could be as below
```
- job_name: minio-job-node
bearer_token: <token>
metrics_path: /minio/v2/metrics/node
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: [tenant1-ss-0-0.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-1.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-2.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-3.tenant1-hl.tenant-ns.svc.cluster.local:9000]
```
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Fix races in IAM cache
Fixes#19344
On the top level we only grab a read lock, but we write to the cache if we manage to fetch it.
a03dac41eb/cmd/iam-store.go (L446) is also flipped to what it should be AFAICT.
Change the internal cache structure to a concurrency safe implementation.
Bonus: Also switch grid implementation.
we must attempt to convert all errors at storage-rest-client
into StorageErr() regardless of what functionality is being
called in, this PR fixes this for multiple callers including
some internally used functions.
- old version was unable to retain messages during config reload
- old version could not go from memory to disk during reload
- new version can batch disk queue entries to single for to reduce I/O load
- error logging has been improved, previous version would miss certain errors.
- logic for spawning/despawning additional workers has been adjusted to trigger when half capacity is reached, instead of when the log queue becomes full.
- old version would json marshall x2 and unmarshal 1x for every log item. Now we only do marshal x1 and then we GetRaw from the store and send it without having to re-marshal.
panic seen due to premature closing of slow channel while listing is still sending or
list has already closed on the sender's side:
```
panic: close of closed channel
goroutine 13666 [running]:
github.com/minio/minio/internal/ioutil.SafeClose[...](0x101ff51e4?)
/Users/kp/code/src/github.com/minio/minio/internal/ioutil/ioutil.go:425 +0x24
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1()
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:2142 +0x170
created by github.com/minio/minio/cmd.(*erasureServerPools).Walk in goroutine 1189
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:1985 +0x228
```
Object names of directory objects qualified for ExpiredObjectAllVersions
must be encoded appropriately before calling on deletePrefix on their
erasure set.
e.g., a directory object and regular objects with overlapping prefixes
could lead to the expiration of regular objects, which is not the
intention of ILM.
```
bucket/dir/ ---> directory object
bucket/dir/obj-1
```
When `bucket/dir/` qualifies for expiration, the current implementation would
remove regular objects under the prefix `bucket/dir/`, in this case,
`bucket/dir/obj-1`.
In handlers related to health diagnostics e.g. CPU, Network, Partitions,
etc, globalMinioHost was being passed as the addr, resulting in empty
value for the same in the health report.
Using globalLocalNodeName instead fixes the issue.
IAM loading is a lazy operation, allow these
fallbacks to be in place when we cannot find
in-memory state().
this allows us to honor the request even if pay
a small price for lookup and populating the data.
When objects have more versions than their ILM policy expects to retain
via NewerNoncurrentVersions, but they don't qualify for expiry due to
NoncurrentDays are configured in that rule.
In this case, applyNewerNoncurrentVersionsLimit method was enqueuing empty
tasks, which lead to a panic (panic: runtime error: index out of range [0] with
length 0) in newerNoncurrentTask.OpHash method, which assumes the task
to contain at least one version to expire.
When returning the status of a decommissioned pool, a pool with zero
time StartedTime will be considered an active pool, which is unexpected.
This commit will always ensure that a pool's canceled/failed/completed
status is returned.
This commit changes how MinIO generates the object encryption key (OEK)
when encrypting an object using server-side encryption.
This change is fully backwards compatible. Now, MinIO generates
the OEK as following:
```
Nonce = RANDOM(32) // generate 256 bit random value
OEK = HMAC-SHA256(EK, Context || Nonce)
```
Before, the OEK was computed as following:
```
Nonce = RANDOM(32) // generate 256 bit random value
OEK = SHA256(EK || Nonce)
```
The new scheme does not technically fix a security issue but
uses a more familiar scheme. The only requirement for the
OEK generation function is that it produces a (pseudo)random value
for every pair (`EK`,`Nonce`) as long as no `EK`-`Nonce` combination
is repeated. This prevents a faulty PRNG from repeating or generating
a "bad" key.
The previous scheme guarantees that the `OEK` is a (pseudo)random
value given that no pair (`EK`,`Nonce`) repeats under the assumption
that SHA256 is indistinguable from a random oracle.
The new scheme guarantees that the `OEK` is a (pseudo)random value
given that no pair (`EK`, `Nonce`) repeats under the assumption that
SHA256's underlying compression function is a PRF/PRP.
While the later is a weaker assumption, and therefore, less likely
to be false, both are considered true. SHA256 is believed to be
indistinguable from a random oracle AND its compression function
is assumed to be a PRF/PRP.
As far as the OEK generating is concerned, the OS random number
generator is not required to be pseudo-random but just non-repeating.
Apart from being more compatible to standard definitions and
descriptions for how to generate crypto. keys, this change does not
have any impact of the actual security of the OEK key generation.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
avoids error during upgrades such as
```
API: SYSTEM()
Time: 19:19:22 UTC 03/18/2024
DeploymentID: 24e4b574-b28d-4e94-9bfa-03c363a600c2
Error: Invalid api configuration: found invalid keys (expiry_workers=100 ) for 'api' sub-system, use 'mc admin config reset myminio api' to fix invalid keys (*fmt.wrapError)
11: internal/logger/logger.go:260:logger.LogIf()
...
```
we were prematurely not writing 4k pages while we
could have due to the fact that most buffers would
be multiples of 4k upto some number and there shall
be some remainder.
We only need to write the remainder without O_DIRECT.
at scale customers might start with failed drives,
causing skew in the overall usage ratio per EC set.
make this configurable such that customers can turn
this off as needed depending on how comfortable they
are.
Cosmetic change, but breaks up a big code block and will make a goroutine
dumps of streams are more readable, so it is clearer what each goroutine is doing.
Currently, the code relies on object parity to decide whether it is a
delete marker or a regular object. In the case of a delete marker, the
return quorum is half of the disks in the erasure set. However, this
calculation must be corrected with objects with EC = 0, mainly
because EC is not a one-time fixed configuration.
Though all data are correct, the manifested symptom is a 503 with an
EC=0 object. This bug was manifested after we introduced the
fast Get Object feature that does not read all data from all disks in
case of inlined objects
Metrics v3 is mainly a reorganization of metrics into smaller groups of
metrics and the removal of internal aggregation of metrics received from
peer nodes in a MinIO cluster.
This change adds the endpoint `/minio/metrics/v3` as the top-level metrics
endpoint and under this, various sub-endpoints are implemented. These
are currently documented in `docs/metrics/v3.md`
The handler will serve metrics at any path
`/minio/metrics/v3/PATH`, as follows:
when PATH is a sub-endpoint listed above => serves the group of
metrics under that path; or when PATH is a (non-empty) parent
directory of the sub-endpoints listed above => serves metrics
from each child sub-endpoint of PATH. otherwise, returns a no
resource found error
All available metrics are listed in the `docs/metrics/v3.md`. More will
be added subsequently.
When an object qualifies for both tiering and expiration rules and is
past its expiration date, it should be expired without requiring to tier
it, even when tiering event occurs before expiration.
Merging same-object - multiple versions from different pools would not always result in correct ordering.
When merging keep inputs separate.
```
λ mc ls --versions local/testbucket
------ before ------
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v3 PUT hosts
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
------ after ------
λ mc ls --versions local/testbucket
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v3 PUT hosts
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
```
configure batch size to send audit/logger events
in batches instead of sending one event per connection.
this is mainly to optimize the number of requests
we make to webhook endpoint.
Currently, the progress of the batch job is saved in inside the job
request object, which is normally not supported by MinIO. Though there
is no apparent bug, it is better to fix this now.
Batch progress is saved in .minio.sys/batch-jobs/reports/
Co-authored-by: Anis Eleuch <anis@min.io>
our PoolNumber calculation was costly,
while we already had this information per
endpoint, we needed to deduce it appropriately.
This PR addresses this by assigning PoolNumbers
field that carries all the pool numbers that
belong to a server.
properties.PoolNumber still carries a valid value
only when len(properties.PoolNumbers) == 1, otherwise
properties.PoolNumber is set to math.MaxInt (indicating
that this value is undefined) and then one must rely
on properties.PoolNumbers for server participation
in multiple pools.
addresses the issue originating from #11327
simplify audit webhook worker model
fixes couple of bugs like
- ping(ctx) was creating a logger without updating
number of workers leading to incorrect nWorkers
scaling, causing an additional worker that is not
tracked properly.
- h.logCh <- entry could potentially hang for when
the queue is full on heavily loaded systems.
there can be a sudden spike in tiny allocations,
due to too much auditing being done, also don't hang
on the
```
h.logCh <- entry
```
after initializing workers if you do not have a way to
dequeue for some reason.
This commits adds support for using the `--endpoint` arg when creating a
tier of type `azure`. This is needed to connect to Azure's Gov Cloud
instance. For example,
```
mc ilm tier add azure TARGET TIER_NAME \
--account-name ACCOUNT \
--account-key KEY \
--bucket CONTAINER \
--endpoint https://ACCOUNT.blob.core.usgovcloudapi.net
--prefix PREFIX \
--storage-class STORAGE_CLASS
```
Prior to this, the endpoint was hardcoded to `https://ACCOUNT.blob.core.windows.net`.
The docs were even explicit about this, stating that `--endpoint` is:
"Required for `s3` or `minio` tier types. This option has no effect for any
other value of `TIER_TYPE`."
Now, if the endpoint arg is present it will be used. If not, it will
fall back to the same default behavior of `ACCOUNT.blob.core.windows.net`.
Remove api.expiration_workers config setting which was inadvertently left behind. Per review comment
https://github.com/minio/minio/pull/18926, expiration_workers can be configured via ilm.expiration_workers.
ext4, xfs support this behavior however
btrfs, nfs may not support it properly.
in-case when we see Nlink < 2 then we know
that we need to fallback on readdir()
fixes a regression from #19100fixes#19181
The middleware sets up tracing, throttling, gzipped responses and
collecting API stats.
Additionally, this change updates the names of handler functions in
metric labels to be the same as the name derived from Go lang reflection
on the handler name.
The metric api labels are now stored in memory the same as the handler
name - they will be camelcased, e.g. `GetObject` instead of `getobject`.
For compatibility, we lowercase the metric api label values when emitting the metrics.
- Use a shared worker pool for all ILM expiry tasks
- Free version cleanup executes in a separate goroutine
- Add a free version only if removing the remote object fails
- Add ILM expiry metrics to the node namespace
- Move tier journal tasks to expiryState
- Remove unused on-disk journal for tiered objects pending deletion
- Distribute expiry tasks across workers such that the expiry of versions of
the same object serialized
- Ability to resize worker pool without server restart
- Make scaling down of expiryState workers' concurrency safe; Thanks
@klauspost
- Add error logs when expiryState and transition state are not
initialized (yet)
* metrics: Add missed tier journal entry tasks
* Initialize the ILM worker pool after the object layer
With this commit, MinIO generates root credentials automatically
and deterministically if:
- No root credentials have been set.
- A KMS (KES) is configured.
- API access for the root credentials is disabled (lockdown mode).
Before, MinIO defaults to `minioadmin` for both the access and
secret keys. Now, MinIO generates unique root credentials
automatically on startup using the KMS.
Therefore, it uses the KMS HMAC function to generate pseudo-random
values. These values never change as long as the KMS key remains
the same, and the KMS key must continue to exist since all IAM data
is encrypted with it.
Backward compatibility:
This commit should not cause existing deployments to break. It only
changes the root credentials of deployments that have a KMS configured
(KES, not a static key) but have not set any admin credentials. Such
implementations should be rare or not exist at all.
Even if the worst case would be updating root credentials in mc
or other clients used to administer the cluster. Root credentials
are anyway not intended for regular S3 operations.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
just like client-conn-read-deadline, added a new flag that does
client-conn-write-deadline as well.
Both are not configured by default, since we do not yet know
what is the right value. Allow this to be configurable if needed.
we should do this to ensure that we focus on
data healing as primary focus, fixing metadata
as part of healing must be done but making
data available is the main focus.
the main reason is metadata inconsistencies can
cause data availability issues, which must be
avoided at all cost.
will be bringing in an additional healing mechanism
that involves "metadata-only" heal, for now we do
not expect to have these checks.
continuation of #19154
Bonus: add a pro-active healthcheck to perform a connection
This change makes the label names consistent with the handler names.
This is in preparation to use reflection based API handler function
names for the api labels so they will be the same as tracing, auditing
and logging names for these API calls.
Moved different dashboards to their specific directories. Also
mentioned that these dashbards are examples of how to create
graphs using MinIO provided and metrics and customers should
change / add graphs on their specific need basis.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Streams can return errors if the cancelation is picked up before the response
stream close is picked up. Under extreme load, this could lead to missing
responses.
Send server mux ack async so a blocked send cannot block newMuxStream
call. Stream will not progress until mux has been acked.
in k8s things really do come online very asynchronously,
we need to use implementation that allows this randomness.
To facilitate this move WriteAll() as part of the
websocket layer instead.
Bonus: avoid instances of dnscache usage on k8s
New disk healing code skips/expires objects that ILM supposed to expire.
Add more visibility to the user about this activity by calculating those
objects and print it at the end of healing activity.
This PR fixes a bug that perhaps has been long introduced,
with no visible workarounds. In any deployment, if an entire
erasure set is deleted, there is no way the cluster recovers.
Currently, if one object tag matches with one lifecycle tag filter, ILM
will select it, however, this is wrong. All the Tag filters in the
lifecycle document should be satisfied.
This change is to decouple need for root credentials to match between
site replication deployments.
Also ensuring site replication config initialization is re-tried until
it succeeds, this deoendency is critical to STS flow in site replication
scenario.
Currently, we read from `/proc/diskstats` which is found to be
un-reliable in k8s environments. We can read from `sysfs` instead.
Also, cache the latest drive io stats to find the diff and update
the metrics.
* Remove lock for cached operations.
* Rename "Relax" to `ReturnLastGood`.
* Add `CacheError` to allow caching values even on errors.
* Add NoWait that will return current value with async fetching if within 2xTTL.
* Make benchmark somewhat representative.
```
Before: BenchmarkCache-12 16408370 63.12 ns/op 0 B/op
After: BenchmarkCache-12 428282187 2.789 ns/op 0 B/op
```
* Remove `storageRESTClient.scanning`. Nonsensical - RPC clients will not have any idea about scanning.
* Always fetch remote diskinfo metrics and cache them. Seems most calls are requesting metrics.
* Do async fetching of usage caches.
It also fixes a long-standing bug in expiring transitioned objects.
The expiration action was deleting the current version in the case'
of tiered objects instead of adding a delete marker.
only enable md5sum if explicitly asked by the client, otherwise
its not necessary to compute md5sum when SSE-KMS/SSE-C is enabled.
this is continuation of #17958
If network conditions have filled the output queue before a reconnect happens blocked sends could stop reconnects from happening. In short `respMu` would be held for a mux client while sending - if the queue is full this will never get released and closing the mux client will hang.
A) Use the mux client context instead of connection context for sends, so sends are unblocked when the mux client is canceled.
B) Use a `TryLock` on "close" and cancel the request if we cannot get the lock at once. This will unblock any attempts to send.
data-dir not being present is okay, however we can still
rely on the `rename()` atomic call instead of relying on
write xl.meta write which may truncate the io.EOF.
Add a new function logger.Event() to send the log to Console and
http/kafka log webhooks. This will include some internal events such as
disk healing and rebalance/decommissioning
the PR in #16541 was incorrect and hand wrong assumptions
about the overall setup, revert this since this expectation
to have offline servers is wrong and we can end up with a
bigger chicken and egg problem.
This reverts commit 5996c8c4d5.
Bonus:
- preserve disk in globalLocalDrives properly upon connectDisks()
- do not return 'nil' from newXLStorage(), getting it ready for
the next set of changes for 'format.json' loading.
The previous logic of calculating per second values for disk io stats
divides the stats by the host uptime. This doesn't work in k8s
environment as the uptime is of the pod, but the stats (from
/proc/diskstats) are from the host.
Fix this by storing the initial values of uptime and the stats at the
timme of server startup, and using the difference between current and
initial values when calculating the per second values.
globalLocalDrives seem to be not updated during the
HealFormat() leads to a requirement where the server
needs to be restarted for the healing to continue.
a/prefix
a/prefix/1.txt
where `a/prefix` is an object which does not have `/` at the end,
we do not have to aggressively recursively delete all the sub-folders
as well. Instead convert the call into self contained to deleting
'xl.meta' and then subsequently attempting to Remove the parent.
Bonus: enable audit alerts for object versions
beyond the configured value, default is '100'
versions per object beyond which scanner will
alert for each such objects.
when we expand via pools, there is no reason to stick
with the same distributionAlgo as the rest. Since the
algo only makes sense with-in a pool not across pools.
This allows for newer pools to use newer codepaths to
avoid legacy file lookups when they have a pre-existing
deployment from 2019, they can expand their new pool
to be of a newer distribution format, allowing the
pool to be more performant.
Fix reported races that are actually synchronized by network calls.
But this should add some extra safety for untimely disconnects.
Race reported:
```
WARNING: DATA RACE
Read at 0x00c00171c9c0 by goroutine 214:
github.com/minio/minio/internal/grid.(*muxClient).addResponse()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:519 +0x111
github.com/minio/minio/internal/grid.(*muxClient).error()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:470 +0x21d
github.com/minio/minio/internal/grid.(*Connection).handleDisconnectClientMux()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:1391 +0x15b
github.com/minio/minio/internal/grid.(*Connection).handleMsg()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:1190 +0x1ab
github.com/minio/minio/internal/grid.(*Connection).handleMessages.func1()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:981 +0x610
Previous write at 0x00c00171c9c0 by goroutine 1081:
github.com/minio/minio/internal/grid.(*muxClient).roundtrip()
e:/gopath/src/github.com/minio/minio/internal/grid/muxclient.go:94 +0x324
github.com/minio/minio/internal/grid.(*muxClient).traceRoundtrip()
e:/gopath/src/github.com/minio/minio/internal/grid/trace.go:74 +0x10e4
github.com/minio/minio/internal/grid.(*Subroute).Request()
e:/gopath/src/github.com/minio/minio/internal/grid/connection.go:366 +0x230
github.com/minio/minio/internal/grid.(*SingleHandler[go.shape.*github.com/minio/minio/cmd.DiskInfoOptions,go.shape.*github.com/minio/minio/cmd.DiskInfo]).Call()
e:/gopath/src/github.com/minio/minio/internal/grid/handlers.go:554 +0x3fd
github.com/minio/minio/cmd.(*storageRESTClient).DiskInfo()
e:/gopath/src/github.com/minio/minio/cmd/storage-rest-client.go:314 +0x270
github.com/minio/minio/cmd.erasureObjects.getOnlineDisksWithHealingAndInfo.func1()
e:/gopath/src/github.com/minio/minio/cmd/erasure.go:293 +0x171
```
This read will always happen after the write, since there is a network call in between.
However a disconnect could come in while we are setting up the call, so we protect against that with extra checks.
- bucket metadata does not need to look for legacy things
anymore if b.Created is non-zero
- stagger bucket metadata loads across lots of nodes to
avoid the current thundering herd problem.
- Remove deadlines for RenameData, RenameFile - these
calls should not ever be timed out and should wait
until completion or wait for client timeout. Do not
choose timeouts for applications during the WRITE phase.
- increase R/W buffer size, increase maxMergeMessages to 30
We have observed cases where a blocked stream will block for cancellations.
This happens when response channel is blocked and we want to push an error.
This will have the response mutex locked, which will prevent all other operations until upstream is unblocked.
Make this behavior non-blocking and if blocked spawn a goroutine that will send the response and close the output.
Still a lot of "dancing". Added a test for this and reviewed.
Depending on when the context cancelation is picked up the handler may return and close the channel before `SubscribeJSON` returns, causing:
```
Feb 05 17:12:00 s3-us-node11 minio[3973657]: panic: send on closed channel
Feb 05 17:12:00 s3-us-node11 minio[3973657]: goroutine 378007076 [running]:
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub.(*PubSub[...]).SubscribeJSON.func1()
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub/pubsub.go:139 +0x12d
Feb 05 17:12:00 s3-us-node11 minio[3973657]: created by github.com/minio/minio/internal/pubsub.(*PubSub[...]).SubscribeJSON in goroutine 378010884
Feb 05 17:12:00 s3-us-node11 minio[3973657]: github.com/minio/minio/internal/pubsub/pubsub.go:124 +0x352
```
Wait explicitly for the goroutine to exit.
Bonus: Listen for doneCh when sending to not risk getting blocked there is channel isn't being emptied.
this fixes rare bugs we have seen but never really found a
reproducer for
- PutObjectRetention() returning 503s
- PutObjectTags() returning 503s
- PutObjectMetadata() updates during replication returning 503s
These calls return errors, and this perpetuates with
no apparent fix.
This PR fixes with correct quorum requirement.
To force limit the duration of STS accounts, the user can create a new
policy, like the following:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["sts:AssumeRoleWithWebIdentity"],
"Condition": {"NumericLessThanEquals": {"sts:DurationSeconds": "300"}}
}]
}
And force binding the policy to all OpenID users, whether using a claim name or role
ARN.
disk tokens usage is not necessary anymore with the implementation
of deadlines for storage calls and active monitoring of the drive
for I/O timeouts.
Functionality kicking off a bad drive is still supported, it's just that
we do not have to serialize I/O in the manner tokens would do.
Recycle would always be called on the dummy value `any(newRT())` instead of the actual value given to the recycle function.
Caught by race tests, but mostly harmless, except for reduced perf.
Other minor cleanups. Introduced in #18940 (unreleased)
Interpret `null` inline policy for access keys as inheriting parent
policy. Since MinIO Console currently sends this value, we need to honor it
for now. A larger fix in Console and in the server are required.
Fixes#18939.
Allow internal types to support a `Recycler` interface, which will allow for sharing of common types across handlers.
This means that all `grid.MSS` (and similar) objects are shared across in a common pool instead of a per-handler pool.
Add internal request reuse of internal types. Add for safe (pointerless) types explicitly.
Only log params for internal types. Doing Sprint(obj) is just a bit too messy.
With this change, only a user with `UpdateServiceAccountAdminAction`
permission is able to edit access keys.
We would like to let a user edit their own access keys, however the
feature needs to be re-designed for better security and integration with
external systems like AD/LDAP and OpenID.
This change prevents privilege escalation via service accounts.
for actionable, inspections we have `mc support inspect`
we do not need double logging, healing will report relevant
errors if any, in terms of quorum lost etc.
Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
- healing must not set the write xattr
because that is the job of active healing
to update. what we need to preserve is
permanent deletes.
- remove older env for drive monitoring and
enable it accordingly, as a global value.
local disk metrics were polluting cluster metrics
Please remove them instead of adding relevant ones.
- batch job metrics were incorrectly kept at bucket
metrics endpoint, move it to cluster metrics.
- add tier metrics to cluster peer metrics from the node.
- fix missing set level cluster health metrics
Do not rely on `connChange` to do reconnects.
Instead, you can block while the connection is running and reconnect
when handleMessages returns.
Add fully async monitoring instead of monitoring on the main goroutine
and keep this to avoid full network lockup.
it is entirely possible that a rebalance process which was running
when it was asked to "stop" it failed to write its last statistics
to the disk.
After this a pool expansion can cause disruption and all S3 API
calls would fail at IsPoolRebalancing() function.
This PRs makes sure that we update rebalance.bin under such
conditions to avoid any runtime crashes.
add new update v2 that updates per node, allows idempotent behavior
new API ensures that
- binary is correct and can be downloaded checksummed verified
- committed to actual path
- restart returns back the relevant waiting drives
do not need to be defensive in our approach,
we should simply override anything everything
in import process, do not care about what
currently exists on the disk - backup is the
source of truth.
Right now the format.json is excluded if anything within `.minio.sys` is requested.
I assume the check was meant to exclude only if it was actually requesting it.
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
to make a NoOp DiskInfo call.
Add separate reconnection mutex
Give more safety around reconnects and make sure a state change isn't missed.
Tested with several runs of `λ go test -race -v -count=500`
Adds separate mutex and doesn't mix in the testing mutex.
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.
Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.
Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.
Fixes races like:
```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
github.com/minio/minio/cmd.readFileInfo()
github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
github.com/minio/minio/cmd/erasure-object.go:828 +0x61
Previous write at 0x00c130b29d10 by goroutine 470298:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
Previous read at 0x00c0ba6e6c00 by goroutine 94463:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
github.com/minio/minio/cmd.getBucketInfoLocal.func1()
github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```
Probably back from #17701
protection was in place. However, it covered only some
areas, so we re-arranged the code to ensure we could hold
locks properly.
Along with this, remove the DataShardFix code altogether,
in deployments with many drive replacements, this can affect
and lead to quorum loss.
Also limit the amount of concurrency when sending
binary updates to peers, avoid high network over
TX that can cause disconnection events for the
node sending updates.
Race checks would occasionally show race on handleMsgWg WaitGroup by debug messages (used in test only).
Use the `connMu` mutex to protect this against concurrent Wait/Add.
Fixes#18827
If site replication is enabled, we should still show the size and
version distribution histogram metrics at bucket level.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
New API now verifies any hung disks before restart/stop,
provides a 'per node' break down of the restart/stop results.
Provides also how many blocked syscalls are present on the
drives and what users must do about them.
Adds options to do pre-flight checks to provide information
to the user regarding any hung disks. Provides 'force' option
to forcibly attempt a restart() even with waiting syscalls
on the drives.
When rejecting incoming grid requests fill out the rejection reason and log it once.
This will give more context when startup is failing. Already logged after a retry on caller.
On a policy detach operation, if there are no policies remaining
attached to the user/group, remove the policy mapping file, instead of
leaving a file containing an empty list of policies.
Healing dangling buckets is conservative, and it is a typical use case to
fail to remove a dangling bucket because it contains some data because
healing danging bucket code is not allowed to remove data: only healing
the dangling object is allowed to do so.
reference format is constant for any lifetime of
a minio cluster, we do not have to ever replace
it during HealFormat() as it will never change.
additionally we should simply reject reference
formats that we do not understand early on.
GetActualSize() was heavily relying on o.Parts()
to be non-empty to figure out if the object is multipart or not,
However, we have many indicators of whether an object is multipart
or not.
Blindly assuming that o.Parts == nil is not a multipart, is an
incorrect expectation instead, multipart must be obtained via
- Stored metadata value indicating this is a multipart encrypted object.
- Rely on <meta>-actual-size metadata to get the object's actual size.
This value is preserved for additional reasons such as these.
- ETag != 32 length
support proxying of tagging requests in active-active replication
Note: even if proxying is successful, PutObjectTagging/DeleteObjectTagging
will continue to report a 404 since the object is not present locally.
New intervals:
[1024B, 64KiB)
[64KiB, 256KiB)
[256KiB, 512KiB)
[512KiB, 1MiB)
The new intervals helps us see object size distribution with higher
resolution for the interval [1024B, 1MiB).
- HealFormat() was leaking healthcheck goroutines for
disks, we are only interested in enabling healthcheck
for the newly formatted disk, not for existing disks.
- When disk is a root-disk a random disk monitor was
leaking while we ignored the drive.
- When loading the disk for each erasure set, we were
leaking goroutines for the prepare-storage.go disks
which were replaced via the globalLocalDrives slice
- avoid disk monitoring utilizing health tokens that
would cause exhaustion in the tokens, prematurely
which were meant for incoming I/O. This is ensured
by avoiding writing O_DIRECT aligned buffer instead
write 2048 worth of content only as O_DSYNC, which is
sufficient.
Add a hidden configuration under the scanner sub section to configure if
the scanner should sleep between two objects scan. The configuration has
only effect when there is no drive activity related to s3 requests or
healing.
By default, the code will keep the current behavior which is doing
sleep between objects.
To forcefully enable the full scan speed in idle mode, you can do this:
`mc admin config set myminio scanner idle_speed=full`
fixes#18724
A regression was introduced in #18547, that attempted
to file adding a missing `null` marker however we
should not skip returning based on versionID instead
it must be based on if we are being asked to create
a DEL marker or not.
The PR also has a side-affect for replicating `null`
marker permanent delete, as it may end up adding a
`null` marker while removing one.
This PR should address both scenarios.
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups.
To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap.
Once this has been enabled, you need to also set `list_quorum`.
```
~ mc admin config set alias/ api list_quorum=auto`
```
A new debugging tool is available to check for any missing counters.
Following policies if present
```
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"54.240.143.0/24",
"2001:DB8:1234:5678::/64"
]
}
}
```
And client is making a request to MinIO via IPv6 can
potentially crash the server.
Workarounds are turn-off IPv6 and use only IPv4
This PR also increases per node bpool memory from 1024 entries
to 2048 entries; along with that, it also moves the byte pool
centrally instead of being per pool.
minio_node_tier_ttlb_seconds - Distribution of time to last byte for streaming objects from warm tier
minio_node_tier_requests_success - Number of requests to download object from warm tier that were successful
minio_node_tier_requests_failure - Number of requests to download object from warm tier that failed
SUBNET now has a v2 of license that is returned in the new key
`license_v2`. mc will start reading and storing the same. (The old key
`license` is deprecated but is still available in SUBNET response to
ensure that the current released version of minio doesn't break)
`(*xlStorageDiskIDCheck).CreateFile` wraps the incoming reader in `xioutil.NewDeadlineReader`.
The wrapped reader is handed to `(*xlStorage).CreateFile`. This performs a Read call via `writeAllDirect`,
which reads into an `ODirectPool` buffer.
`(*DeadlineReader).Read` spawns an async read into the buffer. If a timeout is hit while reading,
the read operation returns to `writeAllDirect`. The operation returns an error and the buffer is reused.
However, if the async `Read` call unblocks, it will write to the now recycled buffer.
Fix: Remove the `DeadlineReader` - it is inherently unsafe. Instead, rely on the network timeouts.
This is not a disk timeout, anyway.
Regression in https://github.com/minio/minio/pull/17745
This patch adds the targetID to the existing notification target metrics
and deprecates the current target metrics which points to the overall
event notification subsystem
historically, we have always kept storage-rest-server
and a local storage API separate without much trouble,
since they both can independently operate due to no
special state() between them.
however, over some time, we have added state()
such as
- drive monitoring threads now there will be "2" of
them per drive instead of just 1.
- concurrent tokens available per drive are now twice
instead of just single shared, allowing unexpectedly
high amount of I/O to go through.
- applying serialization by using walkMutexes can now
be adequately honored for both remote callers and local
callers.
Regression from #18285. CopyObject options were inheriting source MTime
for metadata timestamps if unspecified, removing this prevented metadata
updates from being applied on target.
The metrics `minio_bucket_replication_received_bytes` and
`minio_bucket_replication_sent_bytes` are additive in nature
and rendering the value as is looks fine.
Also added sort order for few graphs for better reading of tool
tips as keeping ones with highest value at top helps.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
By default the cpu load is the cumulative of all cores. Capture the
percentage load (load * 100 / cpu-count)
Also capture the percentage memory used (used * 100 / total)
use memory for async events when necessary and dequeue them as
needed, for all synchronous events customers must enable
```
MINIO_API_SYNC_EVENTS=on
```
Async events can be lost but is upto to the admin to
decide what they want, we will not create run-away number
of goroutines per event instead we will queue them properly.
Currently the max async workers is set to runtime.GOMAXPROCS(0)
which is more than sufficient in general, but it can be made
configurable in future but may not be needed.
there is potential for danglingWrites when quorum failed, where
only some drives took a successful write, generally this is left
to the healing routine to pick it up. However it is better that
we delete it right away to avoid potential for quorum issues on
version signature when there are many versions of an object.
it is okay if the warm-tier cannot keep up, we should continue
to take I/O at hot-tier, only fail hot-tier or block it when
we are disk full.
Bonus: add metrics counter for these missed tasks, we will
know for sure if one of the node is lagging behind or is
losing too many tasks during transitioning.
A disk that is not able to initialize when an instance is started
will never have a handler registered, which means a user will
need to restart the node after fixing the disk;
This will also prevent showing the wrong 'upgrade is needed.'
error message in that case.
When the disk is still failing, print an error every 30 minutes;
Disk reconnection will be retried every 30 seconds.
Co-authored-by: Anis Elleuch <anis@min.io>
`OpMuxConnectError` was not handled correctly.
Remove local checks for single request handlers so they can
run before being registered locally.
Bonus: Only log IAM bootstrap on startup.
```
using deb packager...
created package: minio-release/linux-amd64/minio_20231120224007.0.0.hotfix.e96ac7272_amd64.deb
using rpm packager...
created package: minio-release/linux-amd64/minio-20231120224007.0.0.hotfix.e96ac7272-1.x86_64.rpm
```
While healing the latest changes of expiry rules across sites
if target had pre existing transition rules, they were getting
overwritten as cloned latest expiry rules from remote site were
getting written as is. Fixed the same and added test cases as
well.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
moveToTrash() function moves a folder to .trash, for example, when
doing some object deletions: a data dir that has many parts will be
renamed to the trash folder; However, ENOSPC is a valid error from
rename(), and it can cripple a user trying to free some space in an
entire disk situation.
Therefore, this commit will try to do a recursive delete in that case.
This allows batch replication to basically do not
attempt to copy objects that do not have read quorum.
This PR also allows walk() to provide custom
values for quorum under batch replication, and
key rotation.
this PR allows following policy
```
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Deny a presigned URL request if the signature is more than 10 min old",
"Effect": "Deny",
"Action": "s3:*",
"Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET1/*",
"Condition": {
"NumericGreaterThan": {
"s3:signatureAge": 600000
}
}
}
]
}
```
This is to basically disable all pre-signed URLs that are older than 10 minutes.
AWS S3 closes keep-alive connections frequently
leading to frivolous logs filling up the MinIO
logs when the transition tier is an AWS S3 bucket.
Ignore such transient errors, let MinIO retry
it when it can.
When minio runs with MINIO_CI_CD=on, it is expected to communicate
with the locally running SUBNET. This is happening in the case of MinIO
via call home functionality. However, the subnet-related functionality inside the
console continues to talk to the SUBNET production URL. Because of this,
the console cannot be tested with a locally running SUBNET.
Set the env variable CONSOLE_SUBNET_URL correctly in such cases.
(The console already has code to use the value of this variable
as the subnet URL)
Optionally allows customers to enable
- Enable an external cache to catch GET/HEAD responses
- Enable skipping disks that are slow to respond in GET/HEAD
when we have already achieved a quorum
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
This PR adds a WebSocket grid feature that allows servers to communicate via
a single two-way connection.
There are two request types:
* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
roundtrips with small payloads.
* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
which allows for different combinations of full two-way streams with an initial payload.
Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.
Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.
If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte
slices or use a higher-level generics abstraction.
There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.
The request path can be changed to a new one for any protocol changes.
First, all servers create a "Manager." The manager must know its address
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.
```
func (m *Manager) Connection(host string) *Connection
```
All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight
requests and responses must also be given for streaming requests.
The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.
* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
performs a single request and returns the result. Any deadline provided on the request is
forwarded to the server, and canceling the context will make the function return at once.
* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
will initiate a remote call and send the initial payload.
```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
// Responses from the remote server.
// Channel will be closed after an error or when the remote closes.
// All responses *must* be read by the caller until either an error is returned or the channel is closed.
// Canceling the context will cause the context cancellation error to be returned.
Responses <-chan Response
// Requests sent to the server.
// If the handler is defined with 0 incoming capacity this will be nil.
// Channel *must* be closed to signal the end of the stream.
// If the request context is canceled, the stream will no longer process requests.
Requests chan<- []byte
}
type Response struct {
Msg []byte
Err error
}
```
There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
With an odd number of drives per erasure set setup, the write/quorum is
the half + 1; however the decommissioning listing will still list those
objects and does not consider those as stale.
Fix it by using (N+1)/2 formula.
Co-authored-by: Anis Elleuch <anis@min.io>
Immediate transition use case and is mostly used to fill warm
backend with a lot of data when a new deployment is created
Currently, if the transition queue is complete, the transition will be
deferred to the scanner; change this behavior by blocking the PUT request
until the transition queue has a new place for a transition task.
Currently, once the audit becomes offline, there is no code that tries
to reconnect to the audit, at the same time Send() quickly returns with
an error without really trying to send a message the audit endpoint; so
the audit endpoint will never be online again.
Fixing this behavior; the current downside is that we miss printing some
logs when the audit becomes offline; however this information is
available in prometheus
Later, we can refactor internal/logger so the http endpoint can send errors to
console target.
Currently if the object does not exist in quorum disks of an erasure
set, the dangling code is never called because the returned error will
be errFileNotFound or errFileVersionNotFound;
With this commit, when errFileNotFound or errFileVersionNotFound is
returning when trying to calculate the quorum of a given object, the
code checks if a disk returned nil, which means a stale object exists in
that disk, that will trigger deleteIfDangling() function
This commit splits the liveness and readiness
handler into two separate handlers. In K8S, a
liveness probe is used to determine whether the
pod is in "live" state and functioning at all.
In contrast, the readiness probe is used to
determine whether the pod is ready to serve
requests.
A failing liveness probe causes pod restarts while
a failing readiness probe causes k8s to stop routing
traffic to the pod. Hence, a liveness probe should
be as robust as possible while a readiness probe
should be used to load balancing.
Ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Signed-off-by: Andreas Auernhammer <github@aead.dev>
This patch takes care of loading the bucket configs of failed buckets
during the periodic refresh. This makes sure the event notifiers and
remote bucket targets are properly initialized.
users might use MinIO on NFS, GPFS that provide dynamic
inodes and may not even have a concept of free inodes.
to allow users to use MinIO on top of GPFS relax the
free inode check.
* creating a byte buffer for SFTP file segments
* Adding an error condition for when there are
remaining segments in the queue
* Simplification of the queue using a map
it is possible that ILM or Deletes got triggered on batch
of objects that we are attempting to batch replicate, ignore
this scenario as valid behavior.
sendfile implementation to perform DMA on all platforms
Go stdlib already supports sendfile/splice implementations
for
- Linux
- Windows
- *BSD
- Solaris
Along with this change however O_DIRECT for reads() must be
removed as well since we need to use sendfile() implementation
The main reason to add O_DIRECT for reads was to reduce the
chances of page-cache causing OOMs for MinIO, however it would
seem that avoiding buffer copies from user-space to kernel space
this issue is not a problem anymore.
There is no Go based memory allocation required, and neither
the page-cache is referenced back to MinIO. This page-
cache reference is fully owned by kernel at this point, this
essentially should solve the problem of page-cache build up.
With this now we also support SG - when NIC supports Scatter/Gather
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)
`monitorAndConnectEndpoints` will continue to attempt to reconnect offline disks.
Since disks were never closed, a `MarkOffline` would continue to try to check these disks forever.
Close previous disks.
replace io.Discard usage to fix NUMA copy() latencies
On NUMA systems copying from 8K buffer allocated via
io.Discard leads to large latency build-up for every
```
copy(new8kbuf, largebuf)
```
can in-cur upto 1ms worth of latencies on NUMA systems
due to memory sharding across NUMA nodes.
Fix various regressions from #18029
* If context is canceled the token is never returned. This will lead to scanner being unable to save and deadlocking.
* Fix backup not being able to get any data (hr empty)
* Reduce backup timeout.
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6
Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.
This PR also does some cleanup in terms of implementation,
fixes#18070
https://github.com/minio/minio/pull/18307 partially removed the duplicate upload id check.
While I can't really see how ListDir can return duplicate entries, let's re-add it, since it is a cheap sanity check.
This commit changes the container base image
from ubi-minimal to ubi-micro.
The docker build process happens now in two stages.
The build stage:
- downloads the latest CA certificate bundle
- downloads MinIO binary (for requested version/os/arch)
- downloads MinIO binary signature and verifies it
using minisign
Then it creates an image based on ubi-micro with just
the minio binary was downloaded and verified during the
build stage.
The build stage is simplified to just verifying the
minisign signature.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
There can be rare situations where errors seen in bucket metadata
load on startup or subsequent metadata updates can result in missing
replication remotes.
Attempt a refresh of remote targets backed by a good replication config
lazily in 5 minute intervals if there ever occurs a situation where
remote targets go AWOL.
resync status may not be upto-date by
the time the resync is over due to how
the timer is triggered.
diff is sufficient to know if replication
happened or not.
`GetParityForSC` has a value receiver, so Config is copied before the lock is obtained.
Make it pointer receiver.
Fixes:
```
WARNING: DATA RACE
Read at 0x0000079cdd10 by goroutine 190:
github.com/minio/minio/cmd.(*erasureServerPools).BackendInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:579 +0x6f
github.com/minio/minio/cmd.(*erasureServerPools).LocalStorageInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:614 +0x3c6
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler()
github.com/minio/minio/cmd/peer-rest-server.go:347 +0x4ea
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler-fm()
...
WARNING: DATA RACE
Read at 0x0000079cdd10 by goroutine 190:
github.com/minio/minio/cmd.(*erasureServerPools).BackendInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:579 +0x6f
github.com/minio/minio/cmd.(*erasureServerPools).LocalStorageInfo()
github.com/minio/minio/cmd/erasure-server-pool.go:614 +0x3c6
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler()
github.com/minio/minio/cmd/peer-rest-server.go:347 +0x4ea
github.com/minio/minio/cmd.(*peerRESTServer).LocalStorageInfoHandler-fm()
```
Since relaxing quorum the error across pools
for ListBuckets(), GetBucketInfo() we hit a
situation where loading IAM could potentially
return an error for second pool that server
is not initialized.
We need to handle this, let the pool come online
and retry transparently - this PR fixes that.
x-amz-signed-headers is meant for HTTP headers only
not for query params, using that to verify things
further can lead to failure.
The generated presigned URL with custom metadata
is already kosher (tamper proof).
fixes#18281
`resourceMetricsMap` has no protection against concurrent reads and writes.
Add a mutex and don't use maps from the last iteration.
Bug introduced in #18057Fixes#18271
globalDeploymentID was being read while it was being set.
Fixes race:
```
WARNING: DATA RACE
Write at 0x0000079605a0 by main goroutine:
github.com/minio/minio/cmd.connectLoadInitFormats()
github.com/minio/minio/cmd/prepare-storage.go:269 +0x14f0
github.com/minio/minio/cmd.waitForFormatErasure()
github.com/minio/minio/cmd/prepare-storage.go:294 +0x21d
...
Previous read at 0x0000079605a0 by goroutine 105:
github.com/minio/minio/cmd.newContext()
github.com/minio/minio/cmd/utils.go:817 +0x31e
github.com/minio/minio/cmd.adminMiddleware.func1()
github.com/minio/minio/cmd/admin-router.go:110 +0x96
net/http.HandlerFunc.ServeHTTP()
net/http/server.go:2136 +0x47
github.com/minio/minio/cmd.setBucketForwardingMiddleware.func1()
github.com/minio/minio/cmd/generic-handlers.go:460 +0xb1a
net/http.HandlerFunc.ServeHTTP()
net/http/server.go:2136 +0x47
...
```
currently the default for all drives is 512, which is a lot
for HDDs the recent testing has revealed moving this to 32
for HDDs seems like a fair value.
Introducing a new version of healthinfo struct for adding this info is
not correct. It needs to be implemented differently without adding a new
version.
This reverts commit 8737025d940f80360ed4b3686b332db5156f6659.
There is a fundamental race condition in `newErasureServerPools`, where setObjectLayer is
called before the poolMeta has been loaded/populated.
We add a placeholder value to this field but disable all saving of the value, so we don't risk
overwriting the value on disk. Once the value has been loaded or created, it is replaced with
the proper value, which will also be saved.
Also fixes various accesses of `poolMeta` that were done without locks.
We make the `poolMeta.IsSuspended` return false, even if we shouldn't risk out-of-bounds
reads anymore.
If target went offline while MinIO was down, error once
while trying to send message. If target goes offline during
MinIO server running, it already comes through ping() call
and errors out if target offline.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
if erasure upgrade is needed rely on the in-memory
values, instead of performing a "DiskInfo()" call.
https://brendangregg.com/blog/2016-09-03/sudden-disk-busy.html
for HDDs these are problematic, lets avoid this because
there is no value in "being" absolutely strict here
in terms of parity. We are okay to increase parity
as we see based on the in-memory online/offline ratio.
Several callers to putObjectTar may be fighting to set sc. Move the write out of the loop.
Use static resp, and request elements.
Fixes tests with -race:
```
WARNING: DATA RACE
Read at 0x00c01cd680e0 by goroutine 691354:
github.com/minio/minio/cmd.objectAPIHandlers.PutObjectExtractHandler.func1()
e:/gopath/src/github.com/minio/minio/cmd/object-handlers.go:2130 +0x149
github.com/minio/minio/cmd.untar.func1()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:250 +0x2b6
github.com/minio/minio/cmd.untar.func8()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:261 +0xa4
Previous write at 0x00c01cd680e0 by goroutine 691352:
github.com/minio/minio/cmd.objectAPIHandlers.PutObjectExtractHandler.func1()
e:/gopath/src/github.com/minio/minio/cmd/object-handlers.go:2131 +0x15d
github.com/minio/minio/cmd.untar.func1()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:250 +0x2b6
github.com/minio/minio/cmd.untar.func8()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:261 +0xa4
```
Calling unfreezeServices twice results in panic:
```
panic: "POST /minio/peer/v32/signalservice?signal=4&sub-sys=": close of nil channel
goroutine 14703 [running]:
runtime/debug.Stack()
runtime/debug/stack.go:24 +0x65
github.com/minio/minio/cmd.setCriticalErrorHandler.func1.1()
github.com/minio/minio/cmd/generic-handlers.go:549 +0x8e
panic({0x27c3020, 0x4c9b370})
runtime/panic.go:884 +0x212
github.com/minio/minio/cmd.unfreezeServices()
github.com/minio/minio/cmd/service.go:112 +0xc7
github.com/minio/minio/cmd.(*peerRESTServer).SignalServiceHandler(0x0?, {0x4cb6af0, 0xc010b96420}, 0xc01affab00)
github.com/minio/minio/cmd/peer-rest-server.go:837 +0x13a
net/http.HandlerFunc.ServeHTTP(...)
```
If the function was called a second time `val` would not be nil, but the returned channel `ch` would be, causing the panic.
Check the channel isn't nil and also use Swap for an atomic swap instead of 2 separate operations (though we are in a mutex).
Disk level O_DIRECT support checking at xl storage initialization was
conditional on a config setting being enabled. (This never took effect
because config initialization happens after ObjectLayer is ready.) This
is not necessary as the config setting is dynamic - O_DIRECT should be
enabled via runtime config. So we need to do the disk level support
check regardless of the config setting.
- Trace needs higher buffered channels than 4000 to ensure
when we run `mc admin trace -a` it captures all information
sufficiently.
- Listen event notification needs the event channel to be
`apiRequestsMaxPerNode` * number of nodes
Currently, the retry is not fully used when there is no backup copy of
the data usage; use 5 retry attempts when we don't have any valid data,
new or backup, unless we have seen an un-recognized error.
comment in the code provides more detailed explanation
on what this PR entails and its assumptions.
this PR reduces the amount of listing() by an order
of magnitude, however there are other such calls that
still needs further optimization that shall be done
in subsequent PRs.
Add a new endpoint for "resource" metrics `/v2/metrics/resource`
This should return system metrics related to drives, network, CPU and
memory. Except for drives, other metrics should have corresponding "avg"
and "max" values also.
Reuse the real-time feature to capture the required data,
introducing CPU and memory metrics in it.
Collect the data every minute and keep updating the average and max values
accordingly, returning the latest values when the API is called.
without this the rename2() can rename the previous dataDir
causing issues for different versions of the object, only
latest version is preserved due to this bug.
Added healing code to ensure recovery of such content.
not checking w.Close() can prematurely make us
think that the w.Write() actually succeeded, apparently
Write() may or may not return an error but sometimes
only during a Close() call to the fd we may see the
error from Write() propagate.
Fdatasync(w) on the FD would return an error requiring
Close() error handling is less of a concern, however it may
happen such that fdatasync() did not return an error, where
as Close() would.
Currently, setting a new tiering target returns an error when a bucket
is versioned and the tiering credentials does not have authorization to
specify a version-id when reading or removing a specific version;
Since tiering does not require versioning anymore; avoid doing versioned
operations when performing checklist ops while adding a new tiering
configuration.
Do not error out when a provided marker is before or after the prefix, but instead just ignore it if before and return an empty list when after.
Fixes#18093
Include object and versions heal scan times when checking non-empty abandoned folders.
Furthermore don't add delay between healing versions, instead do one per object wait.
This PR changes the StatObject() to be must have for non-minio source
to being a conditional API call.
- Calls StatObject() when needed
- Calls GetObjectTagging() when needed
These calls if we do without these conditionals can cause a lot of
delays, so we avoid them if not needed in more common scenario.
all retries must not be counted as failed messages,
a failed message is a single counter not for all
retries, this PR fixes this.
Also we do not need to retry 10-times, instead we should
retry at max 3 times with some jitter to deliver the
messages.
If MinIO started with KMS enabled, MINIO_KMS_KES_KEY_NAME should
be set for server to start.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
In a perf test, one node will run speed test with all nodes. If there is
an error with a peer node, the peer node name is not included in the
error hence confusing the user.
This commit will add the peer endpoint string to the netperf error.
To ensure that policy mappings are current for service accounts
belonging to (non-derived) STS accounts (like an LDAP user's service
account) we periodically reload such mappings.
This is primarily to handle a case where a policy mapping update
notification is missed by a minio node. Such a node would continue to
have the stale mapping in memory because STS creds/mappings were never
periodically scanned from storage.
- we already have MRF for most recent failures
- we trigger healing during HEAD/GET operation
These are enough, also change the default max wait
from 5sec to 1sec for default scanner speed.
AccountInfo is quite frequently called by the Console UI
login attempts, when many users are logging in it is important
that we provide them with better responsiveness.
- ListBuckets information is cached every second
- Bucket usage info is cached for up to 10 seconds
- Prefix usage (optional) info is cached for up to 10 secs
Failure to update after cache expiration, would still
allow login which would end up providing information
previously cached.
This allows for seamless responsiveness for the Console UI
logins, and overall responsiveness on a heavily loaded
system.
From the Go specification:
"3. If the map is nil, the number of iterations is 0." [1]
Therefore, an additional nil check for before the loop is unnecessary.
[1]: https://go.dev/ref/spec#For_range
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
- remove targetClient for passing around via replicationObjectInfo{}
- remove cloing to object info unnecessarily
- remove objectInfo from replicationObjectInfo{} (only require necessary fields)
When using a chain provider all providers do not return a valid
access and secret key, an anonymous request is sent, which makes it hard
for users to figure out what is going on
In the case of S3 tiering, when AWS IAM temporary account generation returns
an error, an anonymous login will be used because of the chain provider.
Avoid this and use the AWS IAM provider directly to get a good error
message.
This helps reduce disk operations as these periodic routines would not
run concurrently any more.
Also add expired STS purging periodic operation: Since we do not scan
the on-disk STS credentials (and instead only load them on-demand) a
separate routine is needed to purge expired credentials from storage.
Currently this runs about a quarter as often as IAM refresh.
Also fix a bug where with etcd, STS accounts could get loaded into the
iamUsersMap instead of the iamSTSAccountsMap.
This allows scanner to avoid lengthy scans, skip
things appropriately and also not lose metrics in
any manner.
reduce longer deadlines for usage-cache loads/saves
to match the disk timeout which is 2minutes now per
IOP.
In situations with large number of STS credentials on disk, IAM load
time is high. To mitigate this, STS accounts will now be loaded into
memory only on demand - i.e. when the credential is used.
In each IAM cache (re)load we skip loading STS credentials and STS
policy mappings into memory. Since STS accounts only expire and cannot
be deleted, there is no risk of invalid credentials being reused,
because credential validity is checked when it is used.
Currently we have IOPs of these patterns
```
[OS] os.Mkdir play.min.io:9000 /disk1 2.718µs
[OS] os.Mkdir play.min.io:9000 /disk1/data 2.406µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys 4.068µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp 2.843µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp/d89c8ceb-f8d1-4cc6-b483-280f87c4719f 20.152µs
```
It can be seen that we can save quite Nx levels such as
if your drive is mounted at `/disk1/minio` you can simply
skip sending an `Mkdir /disk1/` and `Mkdir /disk1/minio`.
Since they are expected to exist already, this PR adds a way
for us to ignore all paths upto the mount or a directory which
ever has been provided to MinIO setup.
Previously existing objects were queued to single worker and MRF re-queues
are also handled by same worker - this does not fully use the available
bandwidth in case there is no incoming workload.
Errors such as
```
returned an error (context deadline exceeded) (*fmt.wrapError)
```
```
(msgp: too few bytes left to read object) (*fmt.wrapError)
```
configs from 2020 server throws an
error due to deprecation of the keys
however an attempt is made to parse
them, we should have chosen existing
defaults - this PR fixes that.
Fix drive rotational calculation status
If a MinIO drive path is mounted to a partition and not a real disk,
getting the rotational status would fail because Linux does not expose
that status to partition; In other words,
/sys/block/drive-partition-name/queue/rotational does not exist;
To fix the issue, the code will search for the rotational status of the
disk that hosts the partition, and this can be calculated from the
real path of /sys/class/block/<drive-partition-name>
This change enables embedding files in ZIP with custom permissions.
Also uses default creds for starting MinIO based on inspect data.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
objects with 10,000 parts and many of them can
cause a large memory spike which can potentially
lead to OOM due to lack of GC.
with previous PR reducing the memory usage significantly
in #17963, this PR reduces this further by 80% under
repeated calls.
Scanner sub-system has no use for the slice of Parts(),
it is better left empty.
```
benchmark old ns/op new ns/op delta
BenchmarkToFileInfo/ToFileInfo-8 295658 188143 -36.36%
benchmark old allocs new allocs delta
BenchmarkToFileInfo/ToFileInfo-8 61 60 -1.64%
benchmark old bytes new bytes delta
BenchmarkToFileInfo/ToFileInfo-8 1097210 227255 -79.29%
```
- this PR avoids sending a large ChecksumInfo slice
when its not needed
- also for a file with XLV2 format there is no reason
to allocate Checksum slice while reading
Keys are helpful to ensure the strict ordering of messages, however currently the
code uses a random request id for every log, hence using the request-id
as a Kafka key is not serve any purpose;
This commit removes the usage of the key, to also fix the audit issue from
internal subsystem that does not have a request ID.
This PR adds new bucket replication graphs for better and granular
monitoring of bucket replication. Also arranged all replication graphs
together.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.
This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.
Add prometheus metric to track credential errors since uptime
There is some consistent confusion between the Community Helm Chart in this repo and the MinIO Kubernetes Operator Helm Chart.
This change seeks to clarify the differences between the two charts and which ones are community maintained vs MinIO maintained.
replicationTimestamp might differ if there were retries
in replication and the retried attempt overwrote in
quorum but enough shards with newer timestamp causing
the existing timestamps on xl.meta to be invalid, we
do not rely on this value for anything external.
this is purely a hint for debugging purposes, but there
is no real value in it considering the object itself
is in-tact we do not have to spend time healing this
situation.
we may consider healing this situation in future but
that needs to be decoupled to make sure that we do not
over calculate how much we have to heal.
.metacache objects are transient in nature, and are better left to
use page-cache effectively to avoid using more IOPs on the disks.
this allows for incoming calls to be not taxed heavily due to
multiple large batch listings.
given a versionId the mtime is always the same, it
can never be different than its original value.
versionIds also do not conflict, since they are uuid's
and unique practically forever.
we expect a certain level of IOPs and latency so this is okay.
fixes other miscellaneous bugs
- such as hanging on mrfCh <- when the context is canceled
- queuing MRF heal when the context is canceled
- remove unused saveStateCh channel
In distributed setup with a load balancer, randmoly any server
would report the metrics `minio_cluster_bucket_total` and
`minio_cluster_usage_object_total` and while graphing it, we should
take max of reported values.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
This commit updates the minio/kes-go dependency
to v0.2.0 and updates the existing code to work
with the new KES APIs.
The `SetPolicy` handler got removed since it
may not get implemented by KES at all and could
not have been used in the past since stateless KES
is read-only w.r.t. policies and identities.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Bonus fixes include
- do not have to write final xl.meta (renameData) does this
already, saves some IOPs.
- make sure to purge the multipart directory properly using
a recursive delete, otherwise this can easily pile up and
rely on the stale uploads cleanup.
fixes#17863
This reverts commit bf3901342c.
This is to fix a regression caused when there are inconsistent
versions, but one version is in quorum. SuccessorModTime issue
must be fixed differently.
batch status can perpetually wait after completion
due to a race between the MetricsHandler() returning
the active metrics in intervals of 1sec and delete
of metrics after job completion.
this PR ensures that we keep the 'status' around
for a while, i.e upto 24hrs for all the batch jobs.
Two fields in lifecycles made GOB encoding consistently fail with `gob: type lifecycle.Prefix has no exported fields`.
This meant that in distributed systems listings would never be able to continue and would restart on every call.
Fix issues and be sure to log these errors at least once per bucket. We may see some connectivity errors here, but we shouldn't hide them.
When listing getObjectFileInfo can return `io.EOF` if file is being written.
When we wrap the error it will *not* retry upstream, since `io.EOF` is a valid return value.
Allow one retry before returning errors and canceling the listing.
* optimize deletePrefix, use direct set location via object name
instead of fanning out the calls for an object force delete
we can assume the set location and not do fan-out calls
* Apply suggestions from code review
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
---------
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
Bonus:
- avoid calling DiskInfo() calls when missing blocks
instead heal the object using MRF operation.
- change the max_sleep to 250ms beyond that we will
not stop healing.
As all replication metrics are moved at bucket level, all replication
graphs as well are added under minio-bucket.json. Removing the independent
replication dashboard.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
ignoring valid objects with valid replication metadata
after the Prefix was disabled must still honor the older
metadata.
this can lead to unexpected results, allow it during
READ phase always.
// UnmarshalStrict is like Unmarshal except that any fields that are found
// in the data that do not have corresponding struct members, or mapping
// keys that are duplicates, will result in
// an error.
batch replication pull must preserve versionID regardless
of destination bucket versioning configuration.
This is similar to the issue with decommissioning and rebalancing
health checks were missing for drives replaced since
- HealFormat() would replace the drives without a health check
- disconnected drives when they reconnect via connectEndpoint()
the loop also loses health checks for local disks and merges
these into a single code.
- other than this separate cleanUp, health check variables to avoid
overloading them with similar requirements.
- also ensure that we compete via context selector for disk monitoring
such that the canceled disks don't linger around longer waiting for
the ticker to trigger.
- allow disabling active monitoring.
```
minio[1032735]: panic: label value "\xc0.\xc0." is not valid UTF-8
minio[1032735]: goroutine 1781101 [running]:
minio[1032735]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
```
log such errors for investigation
Send() is synchronous and can affect the latency of S3 requests when the
logger buffer is full.
Avoid checking if the HTTP target is online or not and increase the
workers anyway since the buffer is already full.
Also, avoid logs flooding when the audit target is down.
Limit large uploads (> 128MiB) to a max of 10 workers, intent is to avoid
larger uploads from using all replication bandwidth, giving room for smaller
uploads to sync faster.
slower drives get knocked off because they are too slow via
active monitoring, we do not need to block calls arbitrarily.
Serializing adds latencies for already slow calls, remove
it for SSDs/NVMEs
Also, add a selection with context when writing to `out <-`
channel, to avoid any potential blocks.
Revert "don't error when asked for 0-based range on empty objects (#17708)"
This reverts commit 7e76d66184.
There is no valid way to specify offsets in a 0-byte file. Blame it on the [RFC](https://datatracker.ietf.org/doc/html/rfc7233#section-4.4)
> The 416 (Range Not Satisfiable) status code indicates that none of the ranges in the
> request's Range header field (Section 3.1) overlap the current extent of the selected resource...
A request for "bytes=0-" is a request for the first byte of a resource. If the resource is 0-length,
the range [0,0] does not overlap the resource content and the server responds with an error.
In a reverse proxying setup, a proxy in front of MinIO may attempt to
request objects in slices for enhanced cache efficiency. Since such a
a proxy cannot have prior knowledge of how large a requested resource is,
it usually sends a header of the form:
Range: 0-$slice_size
... and, depending on the size of the resource, expects either:
- an empty response, if $resource_size == 0
- a full response, if $resource_size <= $slice_size
- a partial response, if $resource_size > $slice_size
Prior to this change, MinIO would respond 416 Range Not Satisfiable if a
client tried to request a range on an empty resource. This behavior is
technically consistent with RFC9110[1] – However, it renders sliced
reverse proxying, such as implemented in Nginx, broken in the case of
empty files. Nginx itself seems to break this convention to enable
"useful" responses in these cases, and MinIO should probably do that
too.
[1]: https://www.rfc-editor.org/rfc/rfc9110#byte.ranges
sending whitespace character with CompleteMultipartUpload()
with 200 OK was an AWS S3 compatible implementation detail,
and it was expected that the client SDK must look for both
successful XML as well as error XML for 200 OK.
But this is not useful anymore on MinIO, since we do not
have any large delayed coalescing of parts anymore.
users/customers do not have a reasonable number of buckets anymore,
this is why we must avoid overpopulating cluster endpoints, instead
move the bucket monitoring to a separate endpoint.
some of it's a breaking change here for a couple of metrics, but
it is imperative that we do it to improve the responsiveness of
our Prometheus cluster endpoint.
Bonus: Added new cluster metrics for usage, objects and histograms
Using this script, post decrypt we should be able to bring up the
MinIO instance with same configuration.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Sometimes IAM fails to load certain items, which could be a user,
a service account or a policy but with not enough information for
us to debug.
This commit will create a more descriptive error to make it easier to
debug in such situations.
mc admin trace -a will be able to quickly show
401 Unauthorized header to pinpoint trivial issues
between nodes, such as wrong root
credentials and skewed time.
objects/versions that are not expired via NewerNoncurrentVersions
must be properly returned to be applied under further ILM actions.
this would cause legitimately expired objects to be missed
from expiration.
this randomness is needed to avoid scanning
the same buckets across different erasure sets,
in the same order.
allow random buckets to be scanned instead
allowing a wider spread of ILM, replication
checks.
Additionally do not loop over twice to fill
the channel, fill the channel regardless of
having bucket new or old.
A new middleware function is added for admin handlers, including options
for modifying certain behaviors. This admin middleware:
- sets the handler context via reflection in the request and sends AuditLog
- checks for object API availability (skipping it if a flag is passed)
- enables gzip compression (skipping it if a flag is passed)
- enables header tracing (adding body tracing if a flag is passed)
While the new function is a middleware, due to the flags used for
conditional behavior modification, which is used in each route registration
call.
To try to ensure that no regressions are introduced, the following
changes were done mechanically mostly with `sed` and regexp:
- Remove defer logger.AuditLog in admin handlers
- Replace newContext() calls with r.Context()
- Update admin routes registration calls
Bonus: remove unused NetSpeedtestHandler
Since the new adminMiddleware function checks for object layer presence
by default, we need to pass the `noObjLayerFlag` explicitly to admin
handlers that should work even when it is not available. The following
admin handlers do not require it:
- ServerInfoHandler
- StartProfilingHandler
- DownloadProfilingHandler
- ProfileHandler
- SiteReplicationDevNull
- SiteReplicationNetPerf
- TraceHandler
For these handlers adminMiddleware does not check for the object layer
presence (disabled by passing the `noObjLayerFlag`), and for all other
handlers, the pre-check ensures that the handler is not called when the
object layer is not available - the client would get a
ErrServerNotInitialized and can retry later.
This `noObjLayerFlag` is added based on existing behavior for these
handlers only.
Add check every 2 minutes to see if a write+read operation can complete.
If disk is unresponsive for 2 minutes or returns errFaultyDisk, take it offline.
Simplify MRF queueing and add backlog handler
- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.
- Change MRF to have each node process only its MRF entries.
- Collect MRF backlog by the node to allow for current backlog visibility
Now it would list details of all KMS instances with additional
attributes `endpoint` and `version`. In the case of k8s-based
deployment the list would consist of a single entry.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
This would better to record the correct API name so that
any verification around audit logs to figure out if required
APIs are called required no of times, would be correct.
Here in this case of policy attached, API `AttachDetachPolicyBuiltin`
would be called with `requestPath` as `/minio/admin/v3/idp/builtin/policy/attach`
and in case of detach policy the value would be `/minio/admin/v3/idp/builtin/policy/detach`
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Also shutdown poll add jitter, to verify if the shutdown
sequence can finish before 500ms, this reduces the overall
time taken during "restart" of the service.
Provides speedup for `mc admin service restart` during
active I/O, also ensures that systemd doesn't treat the
returned 'error' as a failure, certain configurations in
systemd can cause it to 'auto-restart' the process by-itself
which can interfere with `mc admin service restart`.
It can be observed how now restarting the service is
much snappier.
on unversioned buckets its possible that 0-byte objects
might lose quorum on flaky systems, allow them to be same
as DELETE markers. Since practically speak they have no
content.
Optimize DeleteObject API to avoid extra
GetObjectInfo call on the replicating side.
For receiving side, it is just a regular
DeleteObject call.
Bonus: Fix a corner case where version purged is
absent on target (either due to replication not yet
complete or target version already deleted in a
one-way replication or when replication was disabled).
In such cases, mark version purge complete.
Since `addCustomerHeaders` middleware was after the `httpTracer`
middleware, the request ID was not set in the http tracing context. By
reordering these middleware functions, the request ID header becomes
available. We also avoid setting the tracing context key again in
`newContext`.
Bonus: All middleware functions are renamed with a "Middleware" suffix
to avoid confusion with http Handler functions.
* Reduce allocations
* Add stringsHasPrefixFold which can compare string prefixes, while ignoring case and not allocating.
* Reuse all msgp.Readers
* Reuse metadata buffers when not reading data.
* Make type safe. Make buffer 4K instead of 8.
* Unslice
DNS refresh() in-case of MinIO can safely re-use
the previous values on bare-metal setups, since
bare-metal arrangements do not change DNS in any
manner commonly.
This PR simplifies that, we only ever need DNS caching
on bare-metal setups.
- On containerized setups do not enable DNS
caching at all, as it may have adverse effects on
the overall effectiveness of k8s DNS systems.
k8s DNS systems are dynamic and expect applications
to avoid managing DNS caching themselves, instead
provide a cleaner container native caching
implementations that must be used.
- update IsDocker() detection, including podman runtime
- move to minio/dnscache fork for a simpler package
Following extension allows users to specify immediate purge of
all versions as soon as the latest version of this object has
expired.
```
<LifecycleConfiguration>
<Rule>
<ID>ClassADocRule</ID>
<Filter>
<Prefix>classA/</Prefix>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>3650</Days>
<ExpiredObjectAllVersions>true</ExpiredObjectAllVersions>
</Expiration>
</Rule>
...
```
- look for requested encryption while compressing
not just via HTTP Headers, but also via multipart
metadata
- look for SSE-S3 etag decryption not just via HTTP
Headers, but also via multipart metadata
fixes#17519
current decommission traces were missing for
- Skipped ILM expired versions
- Skipped single DELETE marked version
- A success or failure in decommissioning DELETE marker
- allow additional info to be shared in DecomStatus() API
there is a possibility that slow drives can actually add latency
to the overall call, leading to a large spike in latency.
this can happen if there are other parallel listObjects()
calls to the same drive, in-turn causing each other to sort
of serialize.
this potentially improves performance and makes PutObject()
also non-blocking.
This change adds a `Secret` property to `HelpKV` to identify secrets
like passwords and auth tokens that should not be revealed by the server
in its configuration fetching APIs. Configuration reporting APIs now do
not return secrets.
Will combine or write partial data of each version found in the inspect data.
Example:
```
> xl-meta -export -combine inspect-data.1228fb52.zip
(... metadata json...)
}
Attempting to combine version "994f1113-da94-4be1-8551-9dbc54b204bc".
Read shard 1 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-01-of-13.data)
Read shard 2 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-02-of-13.data)
Read shard 3 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-03-of-13.data)
Read shard 4 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-04-of-13.data)
Read shard 6 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-06-of-13.data)
Read shard 7 Data shards 9 Parity 4 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-07-of-13.data)
Read shard 8 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-08-of-13.data)
Read shard 9 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-09-of-13.data)
Read shard 10 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-10-of-13.data)
Read shard 11 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-11-of-13.data)
Read shard 13 Data shards 8 Parity 5 (994f1113-da94-4be1-8551-9dbc54b204bc/shard-13-of-13.data)
Attempting to reconstruct using parity sets:
* Setup: Data shards: 9 - Parity blocks: 6
Have 6 complete remapped data shards and 6 complete parity shards. Could NOT reconstruct: too few shards given
* Setup: Data shards: 8 - Parity blocks: 5
Have 5 complete remapped data shards and 5 complete parity shards. Could reconstruct completely
0 bytes missing. Truncating 0 from the end.
Wrote output to 994f1113-da94-4be1-8551-9dbc54b204bc.complete
```
So far only inline data, but no real reason that external data can't also be included with some handling of blocks.
Supports only unencrypted data.
For policy attach/detach API to work correctly the server should hold a
lock before reading existing policy mapping and until after writing the
updated policy mapping. This is fixed in this change.
A site replication bug, where LDAP policy attach/detach were not
correctly propagated is also fixed in this change.
Bonus: Additionally, the server responds with the actual (or net)
changes performed in the attach/detach API call. For e.g. if a user
already has policy A applied, and a call to attach policies A and B is
performed, the server will respond that B was attached successfully.
A continuation of PR #17479 for rebalance behavior must
also match the decommission behavior.
Fixes bug where rebalance would ignore rebalancing object
versions after one of the version returned "ObjectNotFound"
while decommissioning it can so happen that the non-current
versions are all expired but there is a DEL marker as the
latest version.
For such objects, we should not decommission them instead
calculate the remaining versions and if the remaining versions
is one and that version is a DEL marker consider such
an object not to be scheduled for decommissioning.
With the current asynchronous behaviour in sending notification events
to the targets, we can't provide guaranteed delivery as the systems
might go for restarts.
For such event-driven use-cases, we can provide an option to enable
synchronous events where the APIs wait until the event is successfully
sent or persisted.
This commit adds 'MINIO_API_SYNC_EVENTS' env which when set to 'on'
will enable sending/persisting events to targets synchronously.
A state is updated with a delete marker, which does not have parity or
data blocks defined, which can cause the integer divide by zero panics.
This commit fixes to avoid panics.
on "unversioned" buckets there are situations
when successive concurrent I/O can lead to
an inconsistent state() with mtime while the
etag might be the same for the object on disk.
in such a scenario it is possible for us to
allow reading of the object since etag matches
and if etag matches we are guaranteed that we
have enough copies the object will be readable
and same.
This PR allows fallback in such scenarios.
This PR also returns the replication status in
proxy calls and defers replication attempt if
HEAD on object version returned a error different
from NoSuchKey
A specific node should do the decommissioning task, however routing the
start decommissioning to that node was not working properly.
Co-authored-by: Anis Elleuch <anis@min.io>
fixes an issue under bucket replication could cause
ETags for replicated SSE-S3 single part PUT objects,
to fail as we would attempt a decryption while listing,
or stat() operation.
- lifecycle must return InvalidArgument for rule errors
- do not return `null` versionId in HTTP header
- reject mixed SSE uploads with correct error message
- getObjectTagging to be allowed for anonymous policies
- return correct errors for invalid retention period
- return sorted list of tags for an object
- putObjectTagging must return 200 OK not 204 OK
- return 409 ErrObjectLockConfigurationNotAllowed for existing buckets
PUT calls cannot afford to have large latency build-ups due
to contentious usage.json, or worse letting them fail with
some unexpected error, this can happen when this file is
concurrently being updated via scanner or it is being
healed during a disk replacement heal.
However, these are fairly quick in theory, stressed clusters
can quickly show visible latency this can add up leading to
invalid errors returned during PUT.
It is perhaps okay for us to relax this error return requirement
instead, make sure that we log that we are proceeding to take in
the requests while the quota is using an older value for the quota
enforcement. These things will reconcile themselves eventually,
via scanner making sure to overwrite the usage.json.
Bonus: make sure that storage-rest-client sets ExpectTimeouts to
be 'true', such that DiskInfo() call with contextTimeout does
not prematurely disconnect the servers leading to a longer
healthCheck, back-off routine. This can easily pile up while also
causing active callers to disconnect, leading to quorum loss.
DiskInfo is actively used in the PUT, Multipart call path for
upgrading parity when disks are down, it in-turn shouldn't cause
more disks to go down.
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.
Also removes a tiny bit of CPU by each write operation.
Global leader lock was first designated to only acquired once
until the node is killed. However, currently, the code acquires
it repeatedly during the lifetime of the server, now there is a
goroutine leak.
if the certs are the same in an environment where the
cert files are symlinks (e.g Kubernetes), then we resort
to reloading certs every 15mins - we can avoid reload
of the kes client instance. Ensure that the price to pay
for contending with the lock must happen when necessary.
remote error is not required to be passed back to the
client - this is mostly because we have healing that should
eventually, catch up on this and heal the bucket.
Ensure delete marker replication success, especially since the
recent optimizations to heal on HEAD, LIST and GET can force
replication attempts on delete marker before underlying object
version could have synced.
Move to using `xl.meta` data structure to keep temporary partInfo,
this allows for a future change where we move to different parts to
different drives.
PUT shall only proceed if pre-conditions are met, the new
code uses
- x-minio-source-mtime
- x-minio-source-etag
to verify if the object indeed needs to be replicated
or not, allowing us to avoid StatObject() call.
When limiting listing do not count delete, since they may be discarded.
Extend limit, since we may be discarding the forward-to marker.
Fix directories always being sent to resolve, since they didn't return as match.
On occasion this test fails:
```
2022-09-12T17:22:44.6562737Z === RUN TestGetObjectWithOutdatedDisks
2022-09-12T17:22:44.6563751Z erasure-object_test.go:1214: Test 2: Expected data to have md5sum = `c946b71bb69c07daf25470742c967e7c`, found `7d16d23f07072af1a809707ba101ae07`
2
```
Theory: Both objects are written with the same timestamp due to lower timer resolution on Windows. This results in secondary resolution, which is deterministic, but random.
Solution: Instead of hacking in a wait we request the specific version we want. Should still keep the test relevant.
Bonus: Remote action dependency for vulncheck
If replication config could not be read from bucket metadata for some
reason, issue a panic so that unexpected replication outcomes can
be avoided for replicated buckets.
For similar reasons, adding a panic while fetching object-lock config
if it failed for reason other than non-existence of config.
minio_inter_node_traffic_errors_total currently does not track
requests body write/read errors of internode REST communications.
This commit fixes this by wrapping resp.Body.
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.
This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the
MRF queue so that entries in `PENDING` status are healed faster.
500k is a reasonable limit for any single MinIO
cluster deployment, in future we may increase this
value.
However for now we are going to keep this limit.
When healing is parallelized by setting the ` _MINIO_HEAL_WORKERS`
environment variable, multiple goroutines may race while updating the disk's
healing tracker. This change serializes only these concurrent updates using a
channel. Note, the healing tracker is still not concurrency safe in other contexts.
This PR is a continuation of the previous change instead
of returning an error, instead trigger a spot heal on the
'xl.meta' and return only after the healing is complete.
This allows for future GETs on the same resource to be
consistent for any version of the object.
xl.meta gets written and never rolled back, however
we definitely need to validate the state that is
persisted on the disk, if there are inconsistencies
- more than write quorum we should return an error
to the client
- if write quorum was achieved however there are
inconsistent xl.meta's we should simply trigger
an MRF on them
The `clusterInfo` struct in admin-handlers is same as
madmin.ClusterRegistrationInfo, except for small differences in field
names.
Removing this and using madmin.ClusterRegistrationInfo in its place will
help in following ways:
- The JSON payload generated by mc in case of cluster registration will
be consistent (same keys) with cluster.info generated by minio as part
of the profile and inspect zip
- health-analyzer can parse the cluster.info using the same struct and
won't have to define it's own
Currently, there is a short time window where the code is allowed
to save the status of a replication resync. Currently, the window is
`now.Sub(st.EndTime) <= resyncTimeInterval`. Also, any failure to
write in the backend disks is not retried.
Refactor the code a little bit to rely on the last timestamp of a
successful write of the resync status of any given bucket in the
backend disks.
When replication is enabled in a particular bucket, the listing will send
objects to bucket replication, but it is also sending prefixes for non
recursive listing which is useless and shows a lot of error logs.
This commit will ignore prefixes.
under some sequence of events following code would
reach an infinite loop.
```
idx1, idx2 := 0, 1
for ; idx2 != idx1; idx2++ {
fmt.Println(idx2)
}
```
fixes#15639
A lot of warning messages are printed in CI/CD failures generated by go
test. Avoid that by requiring at least Error level for logging when
doing go test.
inlined data often is bigger than the allowed
O_DIRECT alignment, so potentially we can write
'xl.meta' without O_DSYNC instead we can rely on
O_DIRECT + fdatasync() instead.
This PR allows O_DIRECT on inlined data that
would gain the benefits of performing O_DIRECT,
eventually performing an fdatasync() at the end.
Performance boost can be observed here for small
objects < 128KiB. The performance boost is mainly
seen on HDD, and marginal on NVMe setups.
* add a new line to the end of the credentials file when creating a user
* add extra volumes and mounts option into helm chart
* add extra volumes and extra volume mounts option for job resources
When a node finds a change in the other replication cluster and applies
to itself will already notify other peers. No need for all nodes in a
given cluster to do site replication healing, only one node is
sufficient.
This PR improves the replication failure healing by persisting
most recent failures to disk and re-queuing them until the replication
is successful.
While this does not eliminate the need for healing during a full scan,
queuing MRF vastly improves the ETA to keeping replicated buckets
in sync as it does not wait for the scanner visit to detect unreplicated
object versions.
competing calls on the same object on versioned bucket
mutating calls on the same object may unexpected have
higher delays.
This can be reproduced with a replicated bucket
overwriting the same object writes, deletes repeatedly.
For longer locks like scanner keep the 1sec interval
This PR fixes possible leaks that may emanate from not
listening on context cancelation or timeouts.
```
goroutine 60957610 [chan send, 16 minutes]:
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1.1.1(...)
github.com/minio/minio/cmd/erasure-server-pool.go:1724 +0x368
github.com/minio/minio/cmd.listPathRaw({0x4a9a740, 0xc0666dffc0},...
github.com/minio/minio/cmd/metacache-set.go:1022 +0xfc4
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1.1()
github.com/minio/minio/cmd/erasure-server-pool.go:1764 +0x528
created by github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1
github.com/minio/minio/cmd/erasure-server-pool.go:1697 +0x1b7
```
The bottom line is delete markers are a nuisance,
most applications are not version aware and this
has simply complicated the version management.
AWS S3 gave an unnecessary complication overhead
for customers, they need to now manage these
markers by applying ILM settings and clean
them up on a regular basis.
To make matters worse all these delete markers
get replicated as well in a replicated setup,
requiring two ILM settings on each site.
This PR is an attempt to address this inferior
implementation by deviating MinIO towards an
idempotent delete marker implementation i.e
MinIO will never create any more than single
consecutive delete markers.
This significantly reduces operational overhead
by making versioning more useful for real data.
This is an S3 spec deviation for pragmatic reasons.
Queue failed/pending replication for healing during listing and GET/HEAD
API calls. This includes healing of existing objects that were never
replicated or those in the middle of a resync operation.
This PR also fixes a bug in ListObjectVersions where lifecycle filtering
should be done.
when object speedtest is running keep writing
previous speedtest result back to client until
we have a new result - this avoids sending back
blank entries in between the speedtest when it
is running in 'autotune' mode.
```
commit 7bdaf9bc50
Author: Aditya Manthramurthy <donatello@users.noreply.github.com>
Date: Wed Jul 24 17:34:23 2019 -0700
Update on-disk storage format for users system (#7949)
```
Bonus: fixes a bug when etcd keys were being re-encrypted.
Currently, the code doesn't check if the user creating a bucket with
locking feature has bucket locking and versioning permissions enabled,
adding it in accordance with S3 spec.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateBucket.html
Object Lock - If ObjectLockEnabledForBucket is set to true in your CreateBucket request,
s3:PutBucketObjectLockConfiguration and s3:PutBucketVersioning permissions are required.
Capture average, p50, p99, p999 response times
and ttfb values. These are needed for latency
measurements and overall understanding of our
speedtest results.
listConfigItems creates a goroutine but sometimes callers will
exit without properly asking listAllIAMConfigItems() to stop sending
results, hence a goroutine leak.
Create a new context and cancel it for each listAllIAMConfigItems
call.
This commit adds support for automatically reloading
the MinIO client certificate for authentication to KES.
The client certificate will now be reloaded:
- when the private key / certificate file changes
- when a SIGHUP signal is received
- every 15 minutes
Fixes#14869
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
The path is marked dirty automatically when healObject() is called, which is
wrong. HealObject() is called during self-healing and this will lead to
an increase in the false positive result of the bloom filter.
Also move NSUpdated() from renameData() and call it directly in
CompleteMultipart and PutObject, this is not a functional change but
it will make it less prone to errors in the future.
mc admin config reset <alias> notify_webhook:something was not working
properly.
The reason is that GetSubSys() was not calculating the target
name properly because it is quitting early when the number of config
inputs ('notify_webhook:something' in this case) is equal to 1.
This commit will make the code calculates always calculate the target
name if found.
There is a known rare issue in the current version 1.30.0 described here
https://github.com/Shopify/sarama/issues/2241.
Update the library to 1.35.0
Bonus: update shirou/gopsutil v3.22.5 to v3.22.6 to fix a compilation
error for OpenBSD
smaller setups may have less drives per server choosing
the concurrency based on number of local drives, and let
the MinIO server change the overall concurrency as
necessary.
It is possible for anyone with admin access to relatively
to get any content of any random OS location by simply
providing the file with 'mc admin update alias/ /etc/passwd`.
Workaround is to disable 'admin:ServiceUpdate' action. Everyone
is advised to upgrade to this patch.
Thanks to @alevsk for finding this bug.
this has been observed in multiple environments
where the setups are small `speedtest` naturally
fails with default '10s' and the concurrency
of '32' is big for such clusters.
choose a smaller value i.e equal to number of
drives in such clusters and let 'autotune'
increase the concurrency instead.
fixes#15334
- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
"consoleAdmin" was used as the policy for root derived accounts, but this
lead to unexpected bugs when an administrator modified the consoleAdmin
policy
This change avoids evaluating a policy for root derived accounts as by
default no policy is mapped to the root user. If a session policy is
attached to a root derived account, it will be evaluated as expected.
This PR changes the handling of bucket deletes for site
replicated setups to hold on to deleted bucket state until
it syncs to all the clusters participating in site replication.
Currently, if one server in a distributed setup fails to upgrade
due to any reasons, it is not possible to upgrade again unless
nodes are restarted.
To fix this, split the upgrade process into two steps :
- download the new binary on all servers
- If successful, overwrite the old binary with the new one
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Add cluster info to inspect and profiling archive.
In addition to the existing data generation for both inspect and profiling,
cluster.info file is added. This latter contains some info of the cluster.
The generation of cluster.info is is done as the last step and it can fail
if it exceed 10 seconds.
a/b/c/d/ where `a/b/c/` exists results in additional syscalls
such as an Lstat() call to verify if the `a/b/c/` exists
and its a directory.
We do not need to do this on MinIO since the parent prefixes
if exist, we can simply return success without spending
additional syscalls.
Also this implementation attempts to simply use Access() calls
to avoid os.Stat() calls since the latter does memory allocation
for things we do not need to use.
Access() is simpler since we have a predictable structure on
the backend and we know exactly how our path structures are.
A huge number of goroutines would build up from various monitors
When creating test filesystems provide a context so they can shut down when no longer needed.
Do completely independent multipart uploads.
In distributed mode, a lock was held to merge each multipart
upload as it was added. This lock was highly contested and
retries are expensive (timewise) in distributed mode.
Instead, each part adds its metadata information uniquely.
This eliminates the per object lock required for each to merge.
The metadata is read back and merged by "CompleteMultipartUpload"
without locks when constructing final object.
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit adds a `context.Context` to the
the KMS `{Stat, CreateKey, GenerateKey}` API
calls.
The context will be used to terminate external calls
as soon as the client requests gets canceled.
A follow-up PR will add a `context.Context` to
the remaining `DecryptKey` API call.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Uploading a part object can leave an inconsistent state inside
.minio.sys/multipart where data are uploaded but xl.meta is not
committed yet.
Do not list upload-ids that have this state in the multipart listing.
The default replica value is 16 (right now) which can lead to massive
resource consumption on one node in smaller clusters. The idea for this
addition is to allow users to specify how the pods (replicas) are being
spread across the cluster. It gives more control over this Helm Release
in smaller clusters where most worker nodes have taints.
As this Kubernetes feature exists since Kubernetes 1.19 and is only
useful for a replica count > 1, this was taken into account.
Since this is a MinIO specific extension in the replication config,
default this to Disabled to allow other sdks to be used to configure
replication rules.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Use losetup to create fake disks, start a MinIO cluster, umount
one disk, and fails if the mount point directory will have format.json
recreated. It should fail because the mount point directory will belong
to the root disk after unmount.
Add up to 256 bytes of padding for compressed+encrypted files.
This will obscure the obvious cases of extremely compressible content
and leave a similar output size for a very wide variety of inputs.
This does *not* mean the compression ratio doesn't leak information
about the content, but the outcome space is much smaller,
so often *less* information is leaked.
Make bucket requests sent after decommissioning is started are not
created in a suspended pool. Therefore listing buckets should avoid
suspended pools as well.
Rename Trigger -> Event to be a more appropriate
name for the audit event.
Bonus: fixes a bug in AddMRFWorker() it did not
cancel the waitgroup, leading to waitgroup leaks.
There is no point in compressing very small files.
Typically the effective size on disk will be the same due to disk blocks.
So don't waste resources on extremely small files.
We don't check on multipart. 1) because we don't know and 2) this is very likely a big object anyway.
This commit adds a minimal set of KMS-related metrics:
```
# HELP minio_cluster_kms_online Reports whether the KMS is online (1) or offline (0)
# TYPE minio_cluster_kms_online gauge
minio_cluster_kms_online{server="127.0.0.1:9000"} 1
# HELP minio_cluster_kms_request_error Number of KMS requests that failed with a well-defined error
# TYPE minio_cluster_kms_request_error counter
minio_cluster_kms_request_error{server="127.0.0.1:9000"} 16790
# HELP minio_cluster_kms_request_success Number of KMS requests that succeeded
# TYPE minio_cluster_kms_request_success counter
minio_cluster_kms_request_success{server="127.0.0.1:9000"} 348031
```
Currently, we report whether the KMS is available and how many requests
succeeded/failed. However, KES exposes much more metrics that can be
exposed if necessary. See: https://pkg.go.dev/github.com/minio/kes#Metric
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
If more than 1M folders (objects or prefixes) are found at the top level in a bucket allow it to be compacted.
While very suboptimal structure we should limit memory usage at some point.
GetDiskInfo() uses timedValue to cache the disk info for one second.
timedValue behavior was recently changed to return an old cached value
when calculating a new value returns an error.
When a mount point is empty, GetDiskInfo() will return errUnformattedDisk,
timedValue will return cached disk info with unexpected IsRootDisk value,
e.g. false if the mount point belongs to a root disk. Therefore, the mount
point will be considered a valid disk and will be formatted as well.
This commit will also add more defensive code when marking root disks:
always mark a disk offline for any GetDiskInfo() error except
errUnformattedDisk. The server will try anyway to reconnect to those
disks every 10 seconds.
it is not safe to pass around sync.Map
through pointers, as it may be concurrently
updated by different callers.
this PR simplifies by avoiding sync.Map
altogether, we do not need sync.Map
to keep object->erasureMap association.
This PR fixes a crash when concurrently
using this value when audit logs are
configured.
```
fatal error: concurrent map iteration and map write
goroutine 247651580 [running]:
runtime.throw({0x277a6c1?, 0xc002381400?})
runtime/panic.go:992 +0x71 fp=0xc004d29b20 sp=0xc004d29af0 pc=0x438671
runtime.mapiternext(0xc0d6e87f18?)
runtime/map.go:871 +0x4eb fp=0xc004d29b90 sp=0xc004d29b20 pc=0x41002b
```
The current code uses approximation using a ratio. The approximation
can skew if we have multiple pools with different disk capacities.
Replace the algorithm with a simpler one which counts data
disks and ignore parity disks.
fix: allow certain mutation on objects during decommission
currently by mistake deletion of objects was skipped,
if the object resided on the pool being decommissioned.
delete's are okay to be allowed since decommission is
designed to run on a cluster with active I/O.
Small uploads spend a significant amount of time (~5%) fetching disk info metrics. Also maps are allocated for each call.
Add a 100ms cache to disk metrics.
versioned buckets were not creating the delete markers
present in the versioned stack of an object, this essentially
would stop decommission to succeed.
This PR fixes creating such delete markers properly during
a decommissioning process, adds tests as well.
Current code incorrectly passed the
config asset object name while decommissioning,
make sure that we pass the right object name
to be hashed on the newer set of pools.
This PR fixes situations after a successful
decommission, the users and policies might go
missing due to wrong hashed set.
also use designated names for internal
calls
- storageREST calls are storageR
- lockREST calls are lockR
- peerREST calls are just peer
Named in this fashion to facilitate wildcard matches
by having prefixes of the same name.
Additionally, also enable funcNames for generic handlers
that return errors, currently we disable '<unknown>'
In a replicated setup, when an object is updated in one cluster but
still waiting to be replicated to the other cluster, GET requests with
if-match, and range headers will likely fail. It is better to proxy
requests instead.
Also, this commit avoids printing verbose logs about precondition &
range errors.
reedsolomon/cpuid would take a long time to start up on Xen VMs with
AMD processors due to a bug in the VM CPUID implementation.
Compression upgraded for better speed/compression.
fix: change timedvalue to return previous cached value
caller can interpret the underlying error and decide
accordingly, places where we do not interpret the
errors upon timedValue.Get() - we should simply use
the previously cached value instead of returning "empty".
Bonus: remove some unused code
Add a generic handler that adds a new tracing context to the request if
tracing is enabled. Other handlers are free to modify the tracing
context to update information on the fly, such as, func name, enable
body logging etc..
With this commit, requests like this
```
curl -H "Host: ::1:3000" http://localhost:9000/
```
will be traced as well.
Directories markers are not healed when healing a new fresh disk. A
a proper fix would be moving object names encoding/decoding to erasure
object level but it is too late now since the object to set distribution is
calculated at a higher level.
It is observed in a local 8 drive system the CPU seems to be
bottlenecked at
```
(pprof) top
Showing nodes accounting for 1385.31s, 88.47% of 1565.88s total
Dropped 1304 nodes (cum <= 7.83s)
Showing top 10 nodes out of 159
flat flat% sum% cum cum%
724s 46.24% 46.24% 724s 46.24% crypto/sha256.block
219.04s 13.99% 60.22% 226.63s 14.47% syscall.Syscall
158.04s 10.09% 70.32% 158.04s 10.09% runtime.memmove
127.58s 8.15% 78.46% 127.58s 8.15% crypto/md5.block
58.67s 3.75% 82.21% 58.67s 3.75% github.com/minio/highwayhash.updateAVX2
40.07s 2.56% 84.77% 40.07s 2.56% runtime.epollwait
33.76s 2.16% 86.93% 33.76s 2.16% github.com/klauspost/reedsolomon._galMulAVX512Parallel84
8.88s 0.57% 87.49% 11.56s 0.74% runtime.step
7.84s 0.5% 87.99% 7.84s 0.5% runtime.memclrNoHeapPointers
7.43s 0.47% 88.47% 22.18s 1.42% runtime.pcvalue
```
Bonus changes:
- re-use transport for bucket replication clients, also site replication clients.
- use 32KiB buffer for all read and writes at transport layer seems to help
TLS read connections.
- Do not have 'MaxConnsPerHost' this is problematic to be used with net/http
connection pooling 'MaxIdleConnsPerHost' is enough.
This commit fixes the order of elliptic curves.
As documented by https://pkg.go.dev/crypto/tls#Config
```
// CurvePreferences contains the elliptic curves that will be used in
// an ECDHE handshake, in preference order. If empty, the default will
// be used. The client will use the first preference as the type for
// its key share in TLS 1.3. This may change in the future.
```
In general, we should prefer `X25519` over the NIST curves.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Always reformat all disks when a new disk is detected, this will
ensure new uploads to be written in new fresh disks
- Always heal all buckets first when an erasure set started to be healed
- Use a lock to prevent two disks belonging to different nodes but in
the same erasure set to be healed in parallel
- Heal different sets in parallel
Bonus:
- Avoid logging errUnformattedDisk when a new fresh disk is inserted but
not detected by healing mechanism yet (10 seconds lag)
site replication errors were printed at
various random locations, repeatedly - this
PR attempts to remove double logging and
capture all of them at a common place.
This PR also enhances the code to show
partial success and errors as well.
`mc admin heal -r <alias>` in a multi setup pools returns incorrectly
grey objects. The reason is that erasure-server-pools.HealObject() runs
HealObject in all pools and returns the result of the first nil
error. However, in the lower erasureObject level, HealObject() returns
nil if an object does not exist + missing error in each disk of the object
in that pool, therefore confusing mc.
Make erasureObject.HealObject() to return not found error in the lower
level, so at least erasureServerPools will know what pools to ignore.
`config.ResolveConfigParam` returns the value of a configuration for any
subsystem based on checking env, config store, and default value. Also returns info
about which config source returned the value.
This is useful to return info about config params overridden via env in the user
APIs. Currently implemented only for OpenID subsystem, but will be extended for
others subsequently.
If sending a white space during a long S3 handler call fails,
the whitespace goroutine forgets to return a result to the caller.
Therefore, the complete multipart handler will be blocked.
Remember to send the header written result to the caller
or/and close the channel.
- currently subnet health check was freezing and calling
locks at multiple locations, avoid them.
- throw errors if first attempt itself fails with no results
Erasure SD DeleteObjects() is only inheriting bucket versioning status
from the handler layer.
Add the missing versioning prefix evaluation for each object that will
deleted.
PR #15052 caused a regression, add the missing metrics back.
Bonus:
- internode information should be only for distributed setups
- update the dashboard to include 4xx and 5xx error panels.
this allows for customers to use `mc admin service restart`
directly even when performing RPM, DEB upgrades. Upon such 'restart'
after upgrade MinIO will re-read the /etc/default/minio for any
newer environment variables.
As long as `MINIO_CONFIG_ENV_FILE=/etc/default/minio` is set, this
is honored.
Currently minio_s3_requests_errors_total covers 4xx and
5xx S3 responses which can be confusing when s3 applications
sent a lot of HEAD requests with obvious 404 responses or
when the replication is enabled.
Add
- minio_s3_requests_4xx_errors_total
- minio_s3_requests_5xx_errors_total
to help users monitor 4xx and 5xx HTTP status codes separately.
peerOnlineCounter was making NxN calls to many peers, this
can be really long and tedious if there are random servers
that are going down.
Instead we should calculate online peers from the point of
view of "self" and return those online and offline appropriately
by performing a healthcheck.
The 'go mod vendor' command generates a directory called
'vendor' in the main module's root directory, which includes
the required packages to support builds. Therefore, we can
include the 'vendor' directory in .gitignore completely,
regardless of any file extension.
* Add periodic callhome functionality
Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.
New config keys under the `callhome` subsystem:
enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`
* Improvements based on review comments
- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`
* Use atomic package instead of lock
* Use uber atomic package
* Use `Cancel` instead of `cancel`
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
PR #15041 fixed replicating 'null' version however
due to a regression from #14994 caused the target
versions for these 'null' versioned objects to have
different 'versions', this may cause confusion with
bi-directional replication and cause double replication.
This PR fixes this properly by making sure we replicate
the correct versions on the objects.
mergeEntryChannels has the potential to perpetually
wait on the results channel, context might be closed
and we did not honor the caller context canceling.
The S3 service can be frozen indefinitely if a client or mc asks for object
perf API but quits early or has some networking issues. The reason is
that partialWrite() can block indefinitely.
This commit makes partialWrite() listens to context cancellation as well. It
also renames deadlinedCtx to healthCtx since it covers handler context
cancellation and not only not only the speedtest deadline.
In a streaming response, the client knows the size of a streamed
message but never checks the message size. Add the check to error
out if the response message is truncated.
Indexed streams would be decoded by the legacy loader if there
was an error loading it. Return an error when the stream is indexed
and it cannot be loaded.
Fixes "unknown minor metadata version" on corrupted xl.meta files and
returns an actual error.
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
readAllXL would return inlined data for outdated disks
causing "read" to return incorrect content to the client,
this PR fixes this behavior by making sure we skip such
outdated disks appropriately based on the latest ModTime
on the disk.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
Following code can reproduce an unending go-routine buildup,
while keeping connections established due to lack of client
not closing the connections.
https://gist.github.com/harshavardhana/2d00e6f909054d2d2524c71485ad02e1
Without this PR all MinIO deployments can be put into
denial of service attacks, causing entire service to be
unavailable.
We bring in two timeouts at this stage to control such
go-routine build ups, new change
- IdleTimeout (to kill off idle connections)
- ReadHeaderTimeout (to kill off connections that are too slow)
This new change also brings two hidden options to make any
additional relevant changes if desired in some setups.
It would seem like the PR #11623 had chewed more
than it wanted to, non-fips build shouldn't really
be forced to use slower crypto/sha256 even for
presumed "non-performance" codepaths. In MinIO
there are really no "non-performance" codepaths.
This assumption seems to have had an adverse
effect in certain areas of CPU usage.
This PR ensures that we stick to sha256-simd
on all non-FIPS builds, our most common build
to ensure we get the best out of the CPU at
any given point in time.
- Adds an STS API `AssumeRoleWithCustomToken` that can be used to
authenticate via the Id. Mgmt. Plugin.
- Adds a sample identity manager plugin implementation
- Add doc for plugin and STS API
- Add an example program using go SDK for AssumeRoleWithCustomToken
this PR also fixes a situation where incorrect
partsMetadata slice was used where fi.Data was
re-used from a single drive causing duplication
of the shards across all drives.
This happens for situations where shouldHeal()
returns true for all drives > parityBlocks.
To avoid this we should never attempt to heal on all
drives > parityBlocks, unless we are doing metadata
migration from xl.json -> xl.meta
If one or more pools reach 85% usage in a set, we will only
use pools that have more free space.
In case all pools are above 85% we allow all of them to be used
with the regular distribution.
When a server pool with a different number of sets is added they are
not compensated when choosing a destination pool for new objects.
This leads to the unbalanced placement of objects with smaller pools
getting a bigger number of objects since we only compare the destination
sets directly.
This change will compensate for differences in set sizes when choosing
the destination pool.
Different set sizes are already compensated by fewer disks.
updating metadata with CopyObject on a versioned bucket
causes the latest version to be not readable, this PR fixes
this properly by handling the inline data bug fix introduced
in PR #14780.
This bug affects only inlined data.
* Do not use inline data size in xl.meta quorum calculation
Data shards of one object can different inline/not-inline decision
in multiple disks. This happens with outdated disks when inline
decision changes. For example, enabling bucket versioning configuration
will change the small file threshold.
When the parity of an object becomes low, GET object can return 503
because it is not unable to calculate the xl.meta quorum, just because
some xl.meta has inline data and other are not.
So this commit will be disable taking the size of the inline data into
consideration when calculating the xl.meta quorum.
* Add tests for simulatenous inline/notinline object
Co-authored-by: Anis Elleuch <anis@min.io>
current implementation relied on recursively calling one bucket
at a time across all peers, this would be very slow and chatty
when there are 100's of buckets which would mean 100*peerCount
amount of network operations.
This PR attempts to reduce this entire call into `peerCount`
amount of network calls only. This functionality addresses also a
concern where the Prometheus metrics would significantly slow
down when one of the peers is offline.
Fix fallback hot loop
fd was never refreshed, leading to an infinite hot loop if a disk failed and the fallback disk fails as well.
Fix & simplify retry loop.
Fixes#14960
One usee reported having mc admin heal status output ETA increasing
by time. It turned out it is MRF that is not clearing its data due to a
bug in the code.
pendingItems is increased when an object is queued to be healed but
never decreasd when there is a healing error. This commit will decrease
pendingItems and pendingBytes even when there is an error to give
accurate reporting.
If LDAP is enabled, STS security token policy is evaluated using a
different code path and expects ldapUser claim to exist in the security
token. This means other STS temporary accounts generated by any Assume
Role function, such as AssumeRoleWithCertificate, won't be allowed to do any
operation as these accounts do not have LDAP user claim.
Since IsAllowedLDAPSTS() is similar to IsAllowedSTS(), this commit will
merge both.
Non harmful changes:
- IsAllowed for LDAP will start supporting RoleARN claim
- IsAllowed for LDAP will not check for parent claim anymore. This check doesn't
seem to be useful since all STS login compare access/secret/security-token
with the one saved in the disk.
- LDAP will support $username condition in policy documents.
Co-authored-by: Anis Elleuch <anis@min.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
.Reset() documentation states:
For a Timer created with NewTimer, Reset should be invoked only on stopped
or expired timers with drained channels.
This change is just to comply with this requirement as there might be some
runtime dependent situation that might lead to unexpected behavior.
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD
this PR fixes all the usage comprehensively
anything that is stuck on the disk today can cause latency
spikes for all incoming S3 I/O, we need to have this
de-coupled so that we can make sure that latency in loading
credentials are not reflected back to the S3 API calls.
The approach this PR takes is by checking if the calls were
updated just in case when the IAM load was in progress,
so that we can use merge instead of "replacement" to avoid
missing state.
When configuring a new target, such as an audit target, the server waits
until all audit events are sent to the audit target before doing the
swap from the old to the new audit target. Therefore current S3 operations
can suffer from this since the audit swap lock will be held.
This behavior is unnecessary as the new audit target can enter in a
functional mode immediately and the old audit will just cancel itself
at its own pace.
Object tags can have special characters such as whitespace. However
the current code doesn't properly consider those characters while
evaluating the lifecycle document.
ObjectInfo.UserTags contains an url encoded form of object tags
(e.g. key+1=val)
This commit fixes the issue by using the tags package to parse object tags.
The test expects from DeleteFile to return errDiskNotFound when the disk
is not available. It calls os.RemoveAll() to remove one disk after XL storage
initialization. However, this latter contains some goroutines which can
race with os.RemoveAll() and then the test fails sporadically with
returning random errors.
The commit will tweak the initialization routine of the XL storage to
only run deletion of temporary and metacache data in the background,
so TestXLStorageDeleteFile won't fail anymore.
currently, we allowed buckets to be listed from the
API call if and when the user has ListObject()
permission at the global level, this is okay to be
extended to GetBucketLocation() as well since
GetBucketLocation() is a "read" call and allowing "reads"
on a bucket has an implicit assumption that ListBuckets()
should be allowed.
This makes discoverability of access for read-only users
becomes easier or users with specific restrictions on their
policies.
This PR simplifies few things by splitting
the locks between audit, logger targets to
avoid potential contention between them.
any failures inside audit/logger HTTP
targets must only log to console instead
of other targets to avoid cyclical dependency.
avoids unneeded atomic variables instead
uses RWLock to differentiate a more common
read phase v/s lock phase.
- This change renames the OPA integration as Access Management Plugin - there is
nothing specific to OPA in the integration, it is just a webhook.
- OPA configuration is automatically migrated to Access Management Plugin and
OPA specific configuration is marked as deprecated.
- OPA doc is updated and moved.
In case of multi-pools setup, GetObjectNInfo returns a GetObjectReader
but it unlocks the read lock when quitting GetObjectNInfo. This should
not happen, unlock should only happen when GetObjectReader is closed.
- do not need to restrict prefix exclusions that do not
have `/` as suffix, relax this requirement as spark may
have staging folders with other autogenerated characters
, so we are better off doing full prefix March and skip.
- multiple delete objects was incorrectly creating a
null delete marker on a versioned bucket instead of
creating a proper versioned delete marker.
- do not suspend paths on the excluded prefixes during
delete operations to avoid creating `null` delete markers,
honor suspension of versioning only at bucket level for
delete markers.
PR #14828 introduced prefix-level exclusion of versioning
and replication - however our site replication implementation
since it defaults versioning on all buckets did not allow
changing versioning configuration once the bucket was created.
This PR changes this and ensures that such changes are honored
and also propagated/healed across sites appropriately.
Spark/Hadoop workloads which use Hadoop MR
Committer v1/v2 algorithm upload objects to a
temporary prefix in a bucket. These objects are
'renamed' to a different prefix on Job commit.
Object storage admins are forced to configure
separate ILM policies to expire these objects
and their versions to reclaim space.
Our solution:
This can be avoided by simply marking objects
under these prefixes to be excluded from versioning,
as shown below. Consequently, these objects are
excluded from replication, and don't require ILM
policies to prune unnecessary versions.
- MinIO Extension to Bucket Version Configuration
```xml
<VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Status>Enabled</Status>
<ExcludeFolders>true</ExcludeFolders>
<ExcludedPrefixes>
<Prefix>app1-jobs/*/_temporary/</Prefix>
</ExcludedPrefixes>
<ExcludedPrefixes>
<Prefix>app2-jobs/*/__magic/</Prefix>
</ExcludedPrefixes>
<!-- .. up to 10 prefixes in all -->
</VersioningConfiguration>
```
Note: `ExcludeFolders` excludes all folders in a bucket
from versioning. This is required to prevent the parent
folders from accumulating delete markers, especially
those which are shared across spark workloads
spanning projects/teams.
- To enable version exclusion on a list of prefixes
```
mc version enable --excluded-prefixes "app1-jobs/*/_temporary/,app2-jobs/*/_magic," --exclude-prefix-marker myminio/test
```
when the site is being removed is missing replication config. This can
happen when a new deployment is brought in place of a site that
is lost/destroyed and needs to delink old deployment from site
replication.
console logging peer API was broken as it would
timeout after 15minutes, this never really worked
beyond this value and basically failed to provide
the streaming "log" functionality that was expected
from this implementation.
also fix convoluted channel handling by keeping things
simple, this is rewritten.
do not modify opts.UserDefined after object-handler
has set all the necessary values, any mutation needed
should be done on a copy of this value not directly.
As there are other pieces of code that access opts.UserDefined
concurrently this becomes challenging.
fixes#14856
When a decommission task is successfully completed, failed, or canceled,
this commit allows restarting the decommission again. Restarting is not
allowed when there is an ongoing decommission task.
this PR introduces a few changes such as
- sessionPolicyName is not reused in an extracted manner
to apply policies for incoming authenticated calls,
instead uses a different key to designate this
information for the callers.
- this differentiation is needed to ensure that service
account updates do not accidentally store JSON representation
instead of base64 equivalent on the disk.
- relax requirements for Deleting a service account, allow
deleting a service account that might be unreadable, i.e
a situation where the user might have removed session policy
which now carries a JSON representation, making it unparsable.
- introduce some constants to reuse instead of strings.
fixes#14784
If an invalid status code is generated from an error we risk panicking. Even if there
are no potential problems at the moment we should prevent this in the future.
Add safeguards against this.
Sample trace:
```
May 02 06:41:39 minio[52806]: panic: "GET /20180401230655.PDF": invalid WriteHeader code 0
May 02 06:41:39 minio[52806]: goroutine 16040430822 [running]:
May 02 06:41:39 minio[52806]: runtime/debug.Stack(0xc01fff7c20, 0x25c4b00, 0xc0490e4080)
May 02 06:41:39 minio[52806]: runtime/debug/stack.go:24 +0x9f
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.setCriticalErrorHandler.func1.1(0xc022048800, 0x4f38ab0, 0xc0406e0fc0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/generic-handlers.go:469 +0x85
May 02 06:41:39 minio[52806]: panic(0x25c4b00, 0xc0490e4080)
May 02 06:41:39 minio[52806]: runtime/panic.go:965 +0x1b9
May 02 06:41:39 minio[52806]: net/http.checkWriteHeaderCode(...)
May 02 06:41:39 minio[52806]: net/http/server.go:1092
May 02 06:41:39 minio[52806]: net/http.(*response).WriteHeader(0xc0406e0fc0, 0x0)
May 02 06:41:39 minio[52806]: net/http/server.go:1126 +0x718
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc032fa3ea0, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc032fa3f40, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc002ce8000, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.writeResponse(0x4f364a0, 0xc002ce8000, 0x0, 0xc0443b86c0, 0x1cb, 0x224, 0x2a9651e, 0xf)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/api-response.go:736 +0x18d
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.writeErrorResponse(0x4f44218, 0xc069086ae0, 0x4f364a0, 0xc002ce8000, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656afc0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/api-response.go:798 +0x306
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.objectAPIHandlers.getObjectHandler(0x4b73768, 0x4b73730, 0x4f44218, 0xc069086ae0, 0x4f82090, 0xc002d80620, 0xc040e03885, 0xe, 0xc040e03894, 0x61, ...)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/object-handlers.go:456 +0x252c
```
space characters at the beginning or at the end can lead to
confusion under various UI elements in differentiating the
actual name of "policy, user or group" - to avoid this behavior
this PR onwards we shall reject such inputs for newer entries.
existing saved entries will behave as is and are going to be
operable until they are removed/renamed to something more
meaningful.
- When using multiple providers, claim-based providers are not allowed. All
providers must use role policies.
- Update markdown config to allow `details` HTML element
this is allowed as long as order is preserved as is
on an existing setup, the new command line is updated
in `pool.bin` to facilitate future decommission's on
these pools.
introduce x-minio-force-create environment variable
to force create a bucket and its metadata as required,
it is useful in some situations when bucket metadata
needs recovery.
improvements in this PR include
- decommission objects that have __XLDIR__ suffix
- decommission objects that have `null` version on
a versioned bucket.
- make sure to look for any "decom" failures to ensure
that we do not wrong conclude decom as complete without
all files getting copied over.
- break out eagerly upon first error for objects with
multiple versions, leave the object as is for support
debugging and analysis.
heal bucket metadata and IAM entries for
sites participating in site replication from
the site with the most updated entry.
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <aditya@minio.io>
The site replication status call was using a loop iteration variable sent
directly into go-routines instead of being passed as an argument. As the
variable is being updated in the loop, previously launched go routines do not
necessarily use the value at the time they were launched.
This PR fixes two issues
- The first fix is a regression from #14555, the fix itself in #14555
is correct but the interpretation of that information by the
object layer code for "replication" was not correct. This PR
tries to fix this situation by making sure the "Delete" replication
works as expected when "VersionPurgeStatus" is already set.
Without this fix, there is a DELETE marker created incorrectly on
the source where the "DELETE" was triggered.
- The second fix is perhaps an older problem started since we inlined-data
on the disk for small objects, CopyObject() incorrectly inline's
a non-inlined data. This is due to the fact that we have code where
we read the `part.1` under certain conditions where the size of the
`part.1` is less than the specific "threshold".
This eventually causes problems when we are "deleting" the data that
is only inlined, which means dataDir is ignored leaving such
dataDir on the disk, that looks like an inconsistent content on
the namespace.
fixes#14767
It is wasteful to allow parallel upgrades of MinIO server. This also generates
weird error invoked by selfupdate module when it happens such as:
'rename /opt/bin/.minio.old /opt/bin/..minio.old.old'
currently filterPefix was never used and set
that would filter out entries when needed
when `prefix` doesn't end with `/` - this
often leads to objects getting Walked(), Healed()
that were never requested by the caller.
without this wait there is a potential for some objects
that are in actively being decommissioned would cancel,
however the decommission status might wrongly conclude
this as "Complete".
To avoid this make sure to add waitgroups on the parallel
workers, allowing parallel copies to complete fully before
we return.
In previous releases, mc admin user list would return the list of users
that have policies mapped in IAM database. However, this was removed but
this commit will bring it back until we revamp this.
- This change switches to a new parquet library
- SelectObjectContent now takes a single lock at the beginning and holds it
during the operation. Previously the operation took a lock every time the
parquet library performed a Seek on the underlying object stream.
- Add basic support for LogicalType annotations for timestamps.
Execute the object, drive and net speedtests as part of the healthinfo
(if requested by the client), and include their result in the response.
The options for the speedtests have been picked from the default values
used by `mc support perf` command.
This commit improves the listing of encrypted objects:
- Use `etag.Format` and `etag.Decrypt`
- Detect SSE-S3 single-part objects in a single iteration
- Fix batch size to `250`
- Pass request context to `DecryptAll` to not waste resources
when a client cancels the operation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds two new functions to the
internal `etag` package:
- `ETag.Format`
- `Decrypt`
The `Decrypt` function decrypts an encrypted
ETag using a decryption key. It returns not
encrypted / multipart ETags unmodified.
The `Decrypt` function is mainly used when
handling SSE-S3 encrypted single-part objects.
In particular, the ETag of an SSE-S3 encrypted
single-part object needs to be decrypted since
S3 clients expect that this ETag is equal to the
content MD5.
The `ETag.Format` method also covers SSE ETag handling.
MinIO encrypts all ETags of SSE single part objects.
However, only the ETag of SSE-S3 encrypted single part
objects needs to be decrypted.
The ETag of an SSE-C or SSE-KMS single part object
does not correspond to its content MD5 and can be
a random value.
The `ETag.Format` function formats an ETag such that
it is an AWS S3 compliant ETag. In particular, it
returns non-encrypted ETags (single / multipart)
unmodified. However, for encrypted ETags it returns
the trailing 16 bytes as ETag. For encrypted ETags
the last 16 bytes will be a random value.
The main purpose of `Format` is to format ETags
such that clients accept them as well-formed AWS S3
ETags.
It differs from the `String` method since `String`
will return string representations for encrypted
ETags that are not AWS S3 compliant.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
The deployment id was being written to the health report towards the end
of the handler. Because of this, if there was a timeout in any of the
data fetching, the deployment id was not getting written at all. Upload
of such reports fails on SUBNET as deployment id is the unique
identifier for a cluster in subnet.
Fixed by writing the deployment id at the beginning of the processing.
This commit simplifies the ETag decryption and size adjustment
when listing object parts.
When listing object parts, MinIO has to decrypt the ETag of all
parts if and only if the object resp. the parts is encrypted using
SSE-S3.
In case of SSE-KMS and SSE-C, MinIO returns a pseudo-random ETag.
This is inline with AWS S3 behavior.
Further, MinIO has to adjust the size of all encrypted parts due to
the encryption overhead.
The ListObjectParts does specifically not use the KMS bulk decryption
API (4d2fc530d0) since the ETags of all
parts are encrypted using the same object encryption key. Therefore,
MinIO only has to connect to the KMS once, even if there are multiple
parts resp. ETags. It can simply reuse the same object encryption key.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for encrypted KES
client private keys.
Now, it is possible to encrypt the KES client
private key (`MINIO_KMS_KES_KEY_FILE`) with
a password.
For example, KES CLI already supports the
creation of encrypted private keys:
```
kes identity new --encrypt --key client.key --cert client.crt MinIO
```
To decrypt an encrypted private key, the password
needs to be provided:
```
MINIO_KMS_KES_KEY_PASSWORD=<password>
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit optimises the ETag decryption when
listing objects.
When MinIO lists objects, it has to decrypt the
ETags of single-part SSE-S3 objects.
It does not need to decrypt ETags of
- plaintext objects => Their ETag is not encrypted
- SSE-C objects => Their ETag is not the content MD5
- SSE-KMS objects => Their ETag is not the content MD5
- multipart objects => Their ETag is not encrypted
Hence, MinIO only needs to make a call to the KMS
when it needs to decrypt a single-part SSE-S3 object.
It can resolve the ETags off all other object types
locally.
This commit implements the above semantics by
processing an object listing in batches.
If the batch contains no single-part SSE-S3 object,
then no KMS calls will be made.
If the batch contains at least one single-part
SSE-S3 object we have to make at least one KMS call.
No we first filter all single-part SSE-S3 objects
such that we only request the decryption keys for
these objects.
Once we know which objects resp. ETags require a
decryption key, MinIO either uses the KES bulk
decryption API (if supported) or decrypts each
ETag serially.
This commit is a significant improvement compared
to the previous listing code. Before, a single
non-SSE-S3 object caused MinIO to fall-back to
a serial ETag decryption.
For example, if a batch consisted of 249 SSE-S3
objects and one single SSE-KMS object, MinIO would
send 249 requests to the KMS.
Now, MinIO will send a single request for exactly
those 249 objects and skip the one SSE-KMS object
since it can handle its ETag locally.
Further, MinIO would request decryption keys
for SSE-S3 multipart objects in the past - even
though multipart ETags are not encrypted.
So, if a bucket contained only multipart SSE-S3
objects, MinIO would make totally unnecessary
requests to the KMS.
Now, MinIO simply skips these multipart objects
since it can handle the ETags locally.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
In bulk ETag decryption, do not rely on the etag to check if it is
encrypted or not to decide if we should set the actual object size in
ObjectInfo. The reason is that multipart objects ETags are not
encrypted.
Always get the actual object size in that case.
This commit fixes a subtle bug in the ETag
`IsEncrypted` implementation.
An encrypted ETag may contain random bytes,
i.e. some randomness used for encryption.
This random value can contain a '-' byte
simple due to being randomly generated.
Before, the `IsEncrypted` implementation
incorrectly assumed that an encrypted ETag
cannot contain a '-' since it would be a
multipart ETag. Multipart ETags have a
16 byte value followed by a '-' and the part number.
For example:
```
059ba80b807c3c776fb3bcf3f33e11ae-2
```
However, the following encrypted ETag
```
20000f00db2d90a7b40782d4cff2b41a7799fc1e7ead25972db65150118dfbe2ba76a3c002da28f85c840cd2001a28a9
```
also contains a '-' byte but is not a multipart ETag.
This commit fixes the `IsEncrypted` implementation
simply by checking whether the ETag is at least 32
bytes long. A valid multipart ETag is never 32 bytes
long since a part number must be <= 10000.
However, an encrypted ETag must be at least 32 bytes
long. It contains the encrypted ETag bytes (16 bytes)
and the authentication tag added by the AEAD cipher (again
16 bytes).
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for bulk ETag
decryption for SSE-S3 encrypted objects.
If KES supports a bulk decryption API, then
MinIO will check whether its policy grants
access to this API. If so, MinIO will use
a bulk API call instead of sending encrypted
ETags serially to KES.
Note that MinIO will not use the KES bulk API
if its client certificate is an admin identity.
MinIO will process object listings in batches.
A batch has a configurable size that can be set
via `MINIO_KMS_KES_BULK_API_BATCH_SIZE=N`.
It defaults to `500`.
This env. variable is experimental and may be
renamed / removed in the future.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
ListObjects, ListObjectsV2 calls are being heavily taxed when
there are many versions on objects left over from a previous
release or ILM was never setup to clean them up. Instead
of being absolutely correct at resolving the exact latest
version of an object, we simply rely on the top most 1
version and resolve the rest.
Once we have obtained the top most "1" version for
ListObject, ListObjectsV2 call we break out.
In riscv64, the `syscall.Uname` function will return a uint8 slice.
func main() {
var buf syscall.Utsname
fmt.Printf("Buffer Type: %T\n", buf.Release)
}
output:
Buffer Type: [65]uint8
This is tested in the Arch Linux RISC-V 64 QEMU environment.
Signed-off-by: Avimitin <avimitin@gmail.com>
For ListObjects and ListObjectsV2 perform lifecycle checks on
all objects before returning. This will filter out objects that are
pending lifecycle expiration.
Bonus: Cheaper server pool conflict resolution by not converting to FileInfo.
When reloading a dynamic config allow the request pool to scale both ways.
Existing requests hold on to the previous pool, so they will pop the elements from that.
currently an on-going decommission, during a server
restart might block the startup sequence for relatively
longer periods, instead start the decommission in
background lazily.
This commit fixes two bugs in the `PutObjectPartHandler`.
First, `PutObjectPart` should return SSE-KMS headers
when the object is encrypted using SSE-KMS.
Before, this was not the case.
Second, the ETag should always be a 16 byte hex string,
perhaps followed by a `-X` (where `X` is the number of parts).
However, `PutObjectPart` used to return the encrypted ETag
in case of SSE-KMS. This leaks MinIO internal etag details
through the S3 API.
The combination of both bugs causes clients that use SSE-KMS
to fail when trying to validate the ETag. Since `PutObjectPart`
did not send the SSE-KMS response headers, the response looked
like a plaintext `PutObjectPart` response. Hence, the client
tries to verify that the ETag is the content-md5 of the part.
This could never be the case, since MinIO used to return the
encrypted ETag.
Therefore, clients behaving as specified by the S3 protocol
tried to verify the ETag in a situation they should not.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Fix `panic: "POST /minio/peer/v21/signalservice?signal=2": sync: WaitGroup is reused before previous Wait has returned`
Log entries already on the channel would cause `logEntry` to increment the
waitgroup when sending messages, after Cancel has been called.
Instead of tracking every single message, just check the send goroutine. Faster
and safe, since it will not decrement until the channel is closed.
Regression from #14289
When more than 2 disks are unavailable for listing, the same disk will be used for fallback.
This makes quorum calculations incorrect since the same disk will have multiple entries.
This PR keeps track of which fallback disks have been handed out and only every returns a disk once.
avoids creating new transport for each `isServerResolvable`
request, instead re-use the available global transport and do
not try to forcibly close connections to avoid TIME_WAIT
build upon large clusters.
Never use httpClient.CloseIdleConnections() since that can have
a drastic effect on existing connections on the transport pool.
Remove it everywhere.
- GetObject() with vid should return 405
- GetObject() without vid should return 404
- ListObjects() should ignore this object if this is the "latest" version of the object
- ListObjectVersions() should list this object as "DELETE marker"
- Remove data parts before sync'ing the version pending purge
PR introduced in #13819 was incorrect and was not
handling the situation where a buffer is full can
cause incessant amount of logs that would keep the
logger webhook overrun by the requests.
To avoid this only log failures to console logger
instead of all targets as it can cause self reference,
leading to an infinite loop.
changing root credentials makes service accounts
in-operable, this PR changes the way sessionToken
is generated for service accounts.
It changes service account behavior to generate
sessionToken claims from its own secret instead
of using global root credential.
Existing credentials will be supported by
falling back to verify using root credential.
fixes#14530
```
tmp = buf[want:]
```
Would potentially crash when `buf` is truncated for some reason
and does not have the expected bytes, this is of course considered
not normal and is an odd situation. But we do not need to crash
here instead allow for errors to be returned and let callers handle
the errors.
This PR simply adds a warning message when it detects older kernel
versions and warn's them about potential performance issues on this
kernel.
The issue can be seen only with parallel I/O across all drives
on denser setups such as 90 drives or 45 drives per server configurations.
This type of code is not necessary, read's of all
metadata content at `.minio.sys/config` automatically
triggers healing when necessary in the GetObjectNInfo()
call-path.
Having this code is not useful and this also adds to
the overall startup time of MinIO when there are lots
of users and policies.
The main goal of this PR is to solve the situation where disks stop
responding to operations. This generally causes an FD build-up and
eventually will crash the server.
This adds detection of hung disks, where calls on disk get stuck.
We add functionality to `xlStorageDiskIDCheck` where it keeps
track of the number of concurrent requests on a given disk.
A total number of 100 operations are allowed. If this limit is reached
we will block (but not reject) new requests, but we will monitor the
state of the disk.
If no requests have been completed or updated within a 15-second
window, we mark the disk as offline. Requests that are blocked will be
unblocked and return an error as "faulty disk".
New requests will be rejected until the disk is marked OK again.
Once a disk has been marked faulty, a check will run every 5 seconds that
will attempt to write and read back a file. As long as this fails the disk will
remain faulty.
To prevent lots of long-running requests to mark the disk faulty we
implement a callback feature that allows updating the status as parts
of these operations are running.
We add a reader and writer wrapper that will update the status of each
successful read/write operation. This should allow fine enough granularity
that a slow, but still operational disk will not reach 15 seconds where
50 operations have not progressed.
Note that errors themselves are not enough to mark a disk faulty.
A nil (or io.EOF) error will mark a disk as "good".
* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.
* de-couple IsOnline() from disk health tracker
The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was
- disconnected from network we need to validate
if the drive is "correct" and is the same drive
which belongs to this server.
- drive was replaced we have to format it - we
support hot swapping of the drives.
IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.
* return errFaultyDisk for DiskInfo() call
Co-authored-by: Harshavardhana <harsha@minio.io>
Possible future Improvements:
* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
Even if we specify the target namespace by `helm install --namespace`,
the StatefulSet is created on the default namespace. Since this resource
references the ServiceAccount created on the target namespace, pods are
hindered to be created. To avoid this, we deploy the StatefulSet to the
target namespace of helm.
This commit replaces the KMS / KES environment
variables with `MINIO_KMS_SECRET_KEY` when testing
healing on CI.
This change is necessary since KES `0.18.0` introduced
some API breaking changes and the healing tests run
a test (`verify-3604`) that requires an older MinIO
version (e.g. `2021-11-24T23-19-33Z`) which is not
able to parse a KES error as expected.
This commit allows the KES instance at `https://play.min.io:7373`
to get updated to newer versions.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Data usage does not always contain tiering info even if the data usage
information is valid. Avoid a crash in that case.
(e.g. the scanner scanned the namespace, the user enables tiering,
prometheus scrapes the server before the scanner gets a chance to
update the data usage with new tiering information)
Healing decisions would align with skipped folder counters. This can lead to files
never being selected for heal checks on "clean" paths.
Use different hashing methods and take objectHealProbDiv into account when
calculating the cycle.
Found by @vadmeste
This is a side-affect of the optimization done in PR #13544 which
causes a certain type of delete operations on given object versions
can cause lastVersion indication to be skipped, which leads to
an `xl.meta` where Versions[] slice is empty while the entire
file is intact by itself.
This PR tries to ensure that such files are visible and deletable
by regular means of listing as null 'delete-marker' and also
avoid the situation where this potential issue might arise.
When scanning using normal mode, HealObject() can report an
error saying that it found a corrupted part. This doesn't have
when HealObject() is called with bitrot scan flag. However, when
this happens, we can still restart HealObject() with the bitrot scan.
This is also important because this means the scanner and the
new disks healer will not be able to heal an object that doesn't
exist in a specific disk and has corruption in another disk.
Also without this PR, mc admin heal command without bitrot will report
an error.
This commit removes some duplicate code that
converts KES API errors.
This code was added since KES `0.18.0` changed
some exported API errors. However, the KES SDK
handles this error conversion itself.
Therefore, it is not necessary to duplicate this
behavior in MinIO.
See: 21555fa624/error.go (L94)
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Updating KES dependency to v.0.18.0
- Fixing incompatibility issue when checking for errors during KES key creation
Signed-off-by: Lenin Alevski <alevsk.8772@gmail.com>
fixes a regression introduced in #14269 that refactored
the notification registration logic, all the amqp targets
however online will not be available for use anymore.
fixes#14451
In a distributed setup, a DiskInfo REST call to an unformatted disk
returns an error with no disk information, such as the disk endpoint
URL, which is unexpected.
metadata headers can have headers without values
as per AWS S3 spec however, we need to skip some
headers that do not have values that potentially
can have empty values set.
small setups do not return appropriate errors when speedtest
cannot run on small tiny setups, allow the tests to fail
appropriately more pro-actively.
many users bring toy setups, this PR simply returns an error
in such situations.
Up until now `InitializeProvider` method of `Config` struct was
implemented on a value receiver which is why changes on `provider`
field where never reflected to method callers. In order to fix this
issue, the method was implemented on a pointer receiver.
This is a regression from #14037, distributed setups
with MQTT was not working anymore. According to MQTT
spec it is expected this is unique per server.
We shall proceed to use unix nano timestamp hex
value instead here.
healing disks take active I/O it is possible
that deleted objects might stay in .trash
folder for a really long time until the drive
is fully healed.
this PR changes it such that we are making sure
we purge the active content written to these
disks as well.
- speedtest logs calls that were canceled
spuriously, in situations where it should
be ignored.
- all errors of interest are always sent back
to the client there is no need to log them
on the server console.
- PUT failures should negate the increments
such that GET is not attempted on unsuccessful
calls.
- do not attempt MRF on speedtest objects.
In the testing mode, reformatting disks will fail because the healing
code will complain if one disk is in root mode. This commit will
automatically set all disks as non-root if MINIO_CI_CD is set.
Currently, when applying any dynamic config, the system reloads and
re-applies the config of all the dynamic sub-systems.
This PR refactors the code in such a way that changing config of a given
dynamic sub-system will work on only that sub-system.
An onlineDisk means its a valid disk but it may be a
re-connected disk, this PR verifies that based on LastConn()
to only trigger MRF. Current code would again re-load the
disk 'format.json' which is not necessary and perhaps an
unnecessary call.
A potential side affect of this is closing perfectly online
disks and getting re-replaced by reloading 'format.json'.
This PR tries to avoid this situation by making sure MRF
is triggered but not reloading 'format.json' because of MRF.
The `LookupConfig` code was not using `GetWithDefault`, because of which
some of the config values were being returned as empty string, and calls
like `strconv.Atoi` and `time.ParseDuration` on these were failing.
If the policy fails MinIO's minimum threshold for a valid policy,
they'll still (correctly) fail, but policies with a : (and probably a
/) should be allowed since they work with standard MC/MinIO
Console interactions.
This creates the files as policy_IDX.json instead of <name>.json
to avoid any issues with the name + Kubernetes ConfigMaps since
ConfigMap keys must be: [-._a-zA-Z0-9]+
Only the first `listAndHeal` would ever be able to write on errCh, blocking all others infinitely.
Instead read all errors but return the first non-nil, if any.
The intention appears to be that this should cancel on any error,
so that part is kept.
Regression from #13990
When more than one gateway reads and writes from the same mount point
and there is a load balancer pointing to those gateways. Each gateway
will try to create its own temporary append file but fails to clear it later
when not needed.
This commit creates a routine that checks all upload IDs saved in
multipart directory and remove any stale entry with the same upload id
in the memory and in the temporary background append folder as well.
Enabled with `mc admin config set alias/ api gzip_objects=on`
Standard filtering applies (1K response minimum, not compressed content
type, not range request, gzip accepted by client).
The current code considers a pool with all root disks to be as part
of a testing environment even if there are other pools with mounted
disks. This will result to illegitimate writing in root disks.
Fix this by simplifing the logic: require MINIO_CI_CD in order to skip
root disk check.
MinIO configuration is loaded after the initialization of the server
handlers, which will miss the initialization of the bucket forwarder
handler.
Though the federation is deprecated, let's fix this for the time being.
S3 spec returns x-amz-restore header in HEAD/GET object with the
following format:
```
x-amz-restore: ongoing-request="false", expiry-date="Fri, 21 Dec 2012
00:00:00 GMT"
```
This commit adds quotes as the current code does not support it. It will
also supports the old format saved in the disk (in xl.meta) for backward
compatibility.
A regression removed support of federation in the gateway mode.
Enable it again.
Federation is deprecated for a while but let's fix this for the time being.
Deleting bulk objects had an issue since the relevant versionID
is not passed through the layers to ensure that the dangling
object purge actually works cleanly.
This is a continuation of quorum related error returned by
multi-object delete API from #14248
This PR ensures that we pass down correct information as
well as extend the scope of dangling object detection.
When setting a config of a particular sub-system, validate the existing
config and notification targets of only that sub-system, so that
existing errors related to one sub-system (e.g. notification target
offline) do not result in errors for other sub-systems.
Some users running MinIO claim that their system became slow. One
way to investigate is to look at this Prometheus history of the number of
the requests reaching the server. The existing current S3 requests metric
is not enough because it can increase of the system really becomes slow,
due to disk issues for example.
startup speed-up, currently getFormatErasureInQuorum()
would spend up to 2-3secs when there are 3000+ drives
for example in a setup, simplify this implementation
to use drive counts.
DeleteMarkers do not have a default quorum, i.e it is possible that
DeleteMarkers were created with n/2+1 quorum as well to make sure
that we satisfy situations such as those we need to make sure delete
markers only expect n/2 read quorum.
Additionally we should also look at additional metadata on the
actual objects that might have been "erasure" upgraded with new
parity when disks are down.
In such a scenario do not default to the standard storage class
parity, instead use the parityBlocks present on the FileInfo to
ensure that we are dealing with the correct quorum for READs and
DELETEs.
Retry listings, when no next marker is returned and the result isn't truncated.
This can happen when an object is queued, but no info can be fetched.
Fixes#14190
The healing code repeatedly tries to heal a root disk when it is empty
the reason is that connectEndpoint() returns errUnformattedDisk even
if the disk is a root disk. Changing that to returning another error
will avoid queueing the disk to the healing code in each connect disks
iteration.
some upgraded objects might not get listed due
to different quorum ratios across objects.
make sure to list all objects that satisfy the
maximum possible quorum.
This change allows the MinIO server to lookup users in different directory
sub-trees by allowing specification of multiple search bases separated by
semicolons.
This PR removes an unnecessary state that gets
passed around for DiskIDs, which is not necessary
since each disk exactly knows which pool and which
set it belongs to on a running system.
Currently cached DiskId's won't work properly
because it always ends up skipping offline disks
and never runs healing when disks are offline, as
it expects all the cached diskIDs to be present
always. This also sort of made things in-flexible
in terms perhaps a new diskID for `format.json`.
(however this is not a big issue)
This is an unnecessary requirement that healing
via scanner needs all drives to be online, instead
healing should trigger even when partial nodes
and drives are available this ensures that we
keep the SLA in-tact on the objects when disks
are offline for a prolonged period of time.
Wrong resource is being fetched, since idx is incremented, but mapID is reused.
Regression caused by #13454 - that part didn't optimize anything anyway.
Publish storage functions latency to help compare the performance
of different disks in a single deployment.
e.g.:
```
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/1",server="localhost:9001"} 226
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/2",server="localhost:9002"} 1180
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/3",server="localhost:9003"} 1183
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/4",server="localhost:9004"} 1625
```
- create internal erasure volumes only if the disk is unformatted
- return a copy of format data in xlStorage.ReadAll
- parse env vars only once, to be re-used by xl-storage
This speed-up is intended for faster startup times
for almost all MinIO operations. Changes here are
- Drives are not re-read for 'format.json' on a regular
basis once read during init is remembered and refreshed
at 5 second intervals.
- Do not do O_DIRECT tests on drives with existing 'format.json'
only fresh setups need this check.
- Parallelize initializing erasureSets for multiple sets.
- Avoid re-reading format.json when migrating 'format.json'
from really old V1->V2->V3
- Keep a copy of local drives for any given server in memory
for a quick lookup.
this helps in caching the resolved values early on, avoids
causing further resolution for individual nodes when
object layer comes online.
this can speed up our startup time during, upgrades etc by
an order of magnitude.
additional changes in connectLoadInitFormats() and parallelize
all calls that might be potentially blocking.
- Site replication was missing replicating users,
groups when an empty site was added.
- Add site replication for groups and users when they
are disabled and enabled.
- Add support for replicating bucket quota config.
When calculating signatures empty part ETags were not discarded, leading
to a different signature compared to freshly created ones.
This would mean that after a heal signature of the healed metadata would be
different. Fixing the calculation of signature will make these consistent.
Furthermore when inconsistent entries, with zero version ID, with the same
mod times but different signatures, the one with the lowest signature would
be picked for quorum check. Since this is 50/50, we fall back to a simple
quorum count on all signatures.
Each of these fixes by themselves will lead to quorum. Tests were added
for regressions and expected outcomes.
When the replication rule is based on tag matches, the replication process
should pick up targets matching the tags specified in the replication
rule.
Fixing regression due to #12880
repeated reads on single large objects in HPC like
workloads, need the following option to disable
O_DIRECT for a more effective usage of the kernel
page-cache.
However this optional should be used in very specific
situations only, and shouldn't be enabled on all
servers.
NVMe servers benefit always from keeping O_DIRECT on.
map labels might have been referenced else, this
can lead to concurrent access at lower layers.
avoid this by copying the information while
concurrently serving the metrics.
The code was not properly deciding if a lock needs to be removed
when it doesn't have quorum anymore. After this commit, a lock will be
forcefully unlocked if nodes reporting they are not able to find a lock
internally breaks the quorum.
Simplify the code as well.
do not allow mutation to pool command line when there are
unfinished decommissions in place, disallow such scenarios
to avoid user mistakes.
also add testcases to cover all relevant scenarios.
When reading input for PutObject or PutObjectPart add a readahead buffer for big inputs.
This will make network reads+hashing separate run async with erasure coding and writes. This will reduce overall latency in distributed setups where the input is from upstream and writes go to other servers.
We will read at 2 buffers ahead, meaning one will always be ready/waiting and one is currently being read from.
This improves PutObject and PutObjectParts for these cases.
When deleting multiple versions it "gives" up with an errFileVersionNotFound if
a version cannot be found. This effectively skips deleting other versions
sent in the same request.
This can happen on inconsistent objects. We should ignore errFileVersionNotFound
and continue with others.
We already ignore these at the caller level, this PR is continuation of 54a9877
This PR simplifies few things
- Multipart parts are renamed, upon failure are unrenamed() keep this
multipart specific behavior it is needed and works fine.
- AbortMultipart should blindly delete once lock is acquired instead
of re-reading metadata and calculating quorum, abort is a delete()
operation and client has no business looking for errors on this.
- Skip Access() calls to folders that are operating on
`.minio.sys/multipart` folder as well.
Large clusters with multiple sets, or multi-pool setups at times might
fail and report unexpected "file not found" errors. This can become
a problem during startup sequence when some files need to be created
at multiple locations.
- This PR ensures that we nil the erasure writers such that they
are skipped in RenameData() call.
- RenameData() doesn't need to "Access()" calls for `.minio.sys`
folders they always exist.
- Make sure PutObject() never returns ObjectNotFound{} for any
errors, make sure it always returns "WriteQuorum" when renameData()
fails with ObjectNotFound{}. Return appropriate errors for all
other cases.
Currently tag removal leaves replication state as `PENDING`
because the `HEAD` api returns just a tag count but not the
actual tags, and this is treated as a no-op
```
λ mc admin decommission start alias/ http://minio{1...2}/data{1...4}
```
```
λ mc admin decommission status alias/
┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Active │
│ 2nd │ http://minio{3...4}/data{1...4} │ 329 GiB (used) / 421 GiB (total) │ Active │
└─────┴─────────────────────────────────┴──────────────────────────────────┴────────┘
```
```
λ mc admin decommission status alias/ http://minio{1...2}/data{1...4}
Progress: ===================> [1GiB/sec] [15%] [4TiB/50TiB]
Time Remaining: 4 hours (started 3 hours ago)
```
```
λ mc admin decommission status alias/ http://minio{1...2}/data{1...4}
ERROR: This pool is not scheduled for decommissioning currently.
```
```
λ mc admin decommission cancel alias/
┌─────┬─────────────────────────────────┬──────────────────────────────────┬──────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining │
└─────┴─────────────────────────────────┴──────────────────────────────────┴──────────┘
```
> NOTE: Canceled decommission will not make the pool active again, since we might have
> Potentially partial duplicate content on the other pools, to avoid this scenario be
> very sure to start decommissioning as a planned activity.
```
λ mc admin decommission cancel alias/ http://minio{1...2}/data{1...4}
┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────────────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining(Canceled) │
└─────┴─────────────────────────────────┴──────────────────────────────────┴────────────────────┘
```
In a multi-pool setup when disks are coming up, or in a single pool
setup let's say with 100's of erasure sets with a slow network.
It's possible when healing is attempted on `.minio.sys/config`
folder, it can lead to healing unexpectedly deleting some policy
files as dangling due to a mistake in understanding when `isObjectDangling`
is considered to be 'true'.
This issue happened in commit 30135eed86
when we assumed the validMeta with empty ErasureInfo is considered
to be fully dangling. This implementation issue gets exposed when
the server is starting up.
This is most easily seen with multiple-pool setups because of the
disconnected fashion pools that come up. The decision to purge the
object as dangling is taken incorrectly prior to the correct state
being achieved on each pool, when the corresponding drive let's say
returns 'errDiskNotFound', a 'delete' is triggered. At this point,
the 'drive' comes online because this is part of the startup sequence
as drives can come online lazily.
This kind of situation exists because we allow (totalDisks/2) number
of drives to be online when the server is being restarted.
Implementation made an incorrect assumption here leading to policies
getting deleted.
Added tests to capture the implementation requirements.
To avoid error message like:
```
go: warning: github.com/gomodule/redigo@v2.0.0+incompatible: retracted by module author: Old development version not maintained or published.
go: to switch to the latest unretracted version, run:
go get github.com/gomodule/redigo@latest
```
- This allows site-replication to be configured when using OpenID or the
internal IDentity Provider.
- Internal IDP IAM users and groups will now be replicated to all members of the
set of replicated sites.
- When using OpenID as the external identity provider, STS and service accounts
are replicated.
- Currently this change dis-allows root service accounts from being
replicated (TODO: discuss security implications).
- This speeds up running the linters during local development. With a fully
cached run, linter completes in 8 seconds.
- Any caching issues if present would be local and would not impact CI anyway
which always starts with a clean state.
It is possible that GetLock() call remembers a previously
failed releaseAll() when there are networking issues, now
this state can have potential side effects.
This PR tries to avoid this side affect by making sure
to initialize NewNSLock() for each GetLock() attempts
made to avoid any prior state in the memory that can
interfere with the new lock grants.
The current usage of assuming `default` parity of `4` is not correct
for all objects stored on MinIO, objects in .minio.sys have maximum
parity, healing won't trigger on these objects due to incorrect
verification of quorum.
time.Format() is not necessary prematurely for JSON
marshalling, since JSON marshalling indeed defaults
to RFC3339Nano.
This also ensures the 'time' is remembered until its
logged and it is the same time when the 'caller'
invoked 'log' functions.
The AddUser() API endpoint was accepting a policy field.
This API is used to update a user's secret key and account
status, and allows a regular user to update their own secret key.
The policy update is also applied though does not appear to
be used by any existing client-side functionality.
This fix changes the accepted request body type and removes
the ability to apply policy changes as that is possible via the
policy set API.
NOTE: Changing passwords can be disabled as a workaround
for this issue by adding an explicit "Deny" rule to disable the API
for users.
This PR is an attempt to make this configurable
as not all situations have same level of tolerable
delta, i.e disks are replaced days apart or even
hours.
There is also a possibility that nodes have drifted
in time, when NTP is not configured on the system.
data shards were wrong due to a healing bug
reported in #13803 mainly with unaligned object
sizes.
This PR is an attempt to automatically avoid
these shards, with available information about
the `xl.meta` and actually disk mtime.
- When using MinIO's internal IDP, STS credential permissions did not check the
groups of a user.
- Also fix bug in policy checking in AccountInfo call
Also log all the missed events and logs instead of silently
swallowing the events.
Bonus: Extend the logger webhook to support mTLS
similar to audit webhook target.
- r.ulock was not locked when r.UsageCache was being modified
Bonus:
- simplify code by removing some unnecessary clone methods - we can
do this because go arrays are values (not pointers/references) that are
automatically copied on assignment.
- remove some unnecessary map allocation calls
data-structures were repeatedly initialized
this causes GC pressure, instead re-use the
collectors.
Initialize collectors in `init()`, also make
sure to honor the cache semantics for performance
requirements.
Avoid a global map and a global lock for metrics
lookup instead let them all be lock-free unless
the cache is being invalidated.
When STS credentials are created for a user, a unique (hopefully stable) parent
user value exists for the credential, which corresponds to the user for whom the
credentials are created. The access policy is mapped to this parent-user and is
persisted. This helps ensure that all STS credentials of a user have the same
policy assignment at all times.
Before this change, for an OIDC STS credential, when the policy claim changes in
the provider (when not using RoleARNs), the change would not take effect on
existing credentials, but only on new ones.
To support existing STS credentials without parent-user policy mappings, we
lookup the policy in the policy claim value. This behavior should be deprecated
when such support is no longer required, as it can still lead to stale
policy mappings.
Additionally this change also simplifies the implementation for all non-RoleARN
STS credentials. Specifically, for AssumeRole (internal IDP) STS credentials,
policies are picked up from the parent user's policies; for
AssumeRoleWithCertificate STS credentials, policies are picked up from the
parent user mapping created when the STS credential is generated.
AssumeRoleWithLDAP already picks up policies mapped to the virtual parent user.
A corner case can occur where the delete-marker was propagated
but the metadata could not be updated on the primary. Sending
a RemoveObject call with the Delete marker version would end
up permanently deleting the version on target. Instead, perform
a Stat on the delete-marker version on target and redo replication
only if the delete-marker is missing on target.
After the introduction of Refresh logic in locks, the data scanner can
quit when the data scanner lock is not able to get refreshed. In that
case, the context of the data scanner will get canceled and
runDataScanner() will quit. Another server would pick the scanning
routine but after some time, all nodes can just have all scanning
routine aborted, as described above.
This fix will just run the data scanner in a loop.
Return as soon as an AND fails and whenever an OR succeeds. Faster and more flexible.
For example makes `select * from S3object where _2 != '' AND _2 > 1` able to operate on empty fields.
Followup to #13900
minisign v0.10.0 tool broke compatibility, that leads
to our library failing to parse the newer signatures.
This PR
fixes - https://github.com/minio/operator/issues/913
fixes - https://github.com/minio/minio/issues/13824
A workaround for users facing this problem is to unset
```
MINIO_UPDATE_MINISIGN_PUBKEY
```
or set it to `empty` string then signature verification
is skipped automatically.
These changes have been migrated from the previous chart: https://github.com/helm/charts/tree/master/stable/minio
Added `GCS` support for gateway mode in the helm chart.
Added a new GCS block under the gateway key to the house
the GCS-specific variables.
The gateway-deployment template now sets the env var: GOOGLE_APPLICATION_CREDENTIALS as a path to the
service-account-file.json
The service-account-file.json can be added to the MinIO
the secret if an existingSecret is not specified.
- Allow proper SRError to be propagated to
handlers and converted appropriately.
- Make sure to enable object locking on buckets
when requested in MakeBucketHook.
- When DNSConfig is enabled attempt to delete it
first before deleting buckets locally.
- Rename MaxNoncurrentVersions tag to NewerNoncurrentVersions
Note: We apply overlapping NewerNoncurrentVersions rules such that
we honor the highest among applicable limits. e.g if 2 overlapping rules
are configured with 2 and 3 noncurrent versions to be retained, we
will retain 3.
- Expire newer noncurrent versions after noncurrent days
- MinIO extension: allow noncurrent days to be zero, allowing expiry
of noncurrent version as soon as more than configured
NewerNoncurrentVersions are present.
- Allow NewerNoncurrentVersions rules on object-locked buckets
- No x-amz-expiration when NewerNoncurrentVersions configured
- ComputeAction should skip rules with NewerNoncurrentVersions > 0
- Add unit tests for lifecycle.ComputeAction
- Support lifecycle rules with MaxNoncurrentVersions
- Extend ExpectedExpiryTime to work with zero days
- Fix all-time comparisons to be relative to UTC
- This introduces a new admin API with a query parameter (v=2) to return a
response with the timestamps
- Older API still works for compatibility/smooth transition in console
- allow any regular user to change their own password
- allow STS credentials to create users if permissions allow
Bonus: do not allow changes to sts/service account credentials (via add user API)
ListObjects() should never list a delete-marked folder
if latest is delete marker and delimiter is not provided.
ListObjectVersions() should list a delete-marked folder
even if latest is delete marker and delimiter is not
provided.
Enhance further versioning listing on the buckets
request.Form uses less memory allocation and avoids gorilla mux matching
with weird characters in parameters such as '\n'
- Remove Queries() to avoid matching
- Ensure r.ParseForm is called to populate fields
- Add a unit test for object names with '\n'
delete marked objects should not be considered
for listing when listing is delimited, this issue
as introduced in PR #13804 which was mainly to
address listing of directories in listing when
delimited.
This PR fixes this properly and adds tests to
ensure that we behave in accordance with how
an S3 API behaves for ListObjects() without
versions.
Save part.1 for writebacks in a separate folder
and move it to cache dir atomically while saving
the cache metadata. This is to avoid GC mistaking
part.1 as orphaned cache entries and purging them.
This PR also fixes object size being overwritten during
retries for write-back mode.
- deleting policies was deleting all LDAP
user mapping, this was a regression introduced
in #13567
- deleting of policies is properly sent across
all sites.
- remove unexpected errors instead embed the real
errors as part of the 500 error response.
- deleteBucket() should be called for cleanup
if client abruptly disconnects
- out of disk errors should be sent to client
properly and also cancel the calls
- limit concurrency to available MAXPROCS not
32 for auto-tuned setup, if procs are beyond
32 then continue normally. this is to handle
smaller setups.
fixes#13834
Return errors when untar fails at once.
Current error handling was quite a mess. Errors are written
to the stream, but processing continues.
Instead, return errors when they occur and transform
internal errors to bad request errors, since it is likely a
problem with the input.
Fixes#13832
Sometimes, we see an error message like "Server expects 'storage' API
version 'v41', instead found 'v41'" shows a more generic error message
with the path of the REST call.
The earlier approach of using a license token for
communicating with SUBNET is being replaced
with a simpler mechanism of API keys. Unlike the
license which is a JWT token, these API keys will
be simple UUID tokens and don't have any embedded
information in them. SUBNET would generate the
API key on cluster registration, and then it would
be saved in this config, to be used for subsequent
communication with SUBNET.
Following scenario such as objects that exist inside a
prefix say `folder/` must be included in the listObjects()
response.
```
2aa16073-387e-492c-9d59-b4b0b7b6997a v2 DEL folder/
a5b9ce68-7239-4921-90ab-20aed402c7a2 v1 PUT folder/
f2211798-0eeb-4d9e-9184-fcfeae27d069 v1 PUT folder/1.txt
```
Current master does not handle this scenario, because it
ignores the top level delete-marker on folders. This is
however unexpected. It is expected that list-objects returns
the top level prefix in this situation.
```
aws s3api list-objects --bucket harshavardhana --prefix unique/ \
--delimiter / --profile minio --endpoint-url http://localhost:9000
{
"CommonPrefixes": [
{
"Prefix": "unique/folder/"
}
]
}
```
There are applications in the wild such as Hadoop s3a connector
that exploit this behavior and expect the folder to be present
in the response.
This also makes the behavior consistent with AWS S3.
single object delete was not working properly
on a bucket when versioning was suspended,
current version 'null' object was never removed.
added unit tests to cover the behavior
fixes#13783
totalDrives reported in speedTest result were wrong
for multiple pools, this PR fixes this.
Bonus: add support for configurable storage-class, this
allows us to test REDUCED_REDUNDANCY to see further
maximum throughputs across the cluster.
- Allows setting a role policy parameter when configuring OIDC provider
- When role policy is set, the server prints a role ARN usable in STS API requests
- The given role policy is applied to STS API requests when the roleARN parameter is provided.
- Service accounts for role policy are also possible and work as expected.
- New sub-system has "region" and "name" fields.
- `region` subsystem is marked as deprecated, however still works, unless the
new region parameter under `site` is set - in this case, the region subsystem is
ignored. `region` subsystem is hidden from top-level help (i.e. from `mc admin
config set myminio`), but appears when specifically requested (i.e. with `mc
admin config set myminio region`).
- MINIO_REGION, MINIO_REGION_NAME are supported as legacy environment variables for server region.
- Adds MINIO_SITE_REGION as the current environment variable to configure the
server region and MINIO_SITE_NAME for the site name.
The index was converted directly from bytes to binary. This would fail a roundtrip through json.
This would result in `Error: invalid input: magic number mismatch` when reading back.
On non-erasure backends store index as base64.
The httpStreamResponse should not return until CloseWithError has been called.
Instead keep track of write state and skip writing/flushing if an error has occurred.
Fixes#13743
Regression from #13597 (not released)
Go's atomic.Value does not support `nil` type,
concrete type is necessary to avoid any panics with
the current implementation.
Also remove boolean to turn-off tracking of freezeCount.
an active running speedTest will reject all
new S3 requests to the server, until speedTest
is complete.
this is to ensure that speedTest results are
accurate and trusted.
Co-authored-by: Klaus Post <klauspost@gmail.com>
Since JWT tokens remain valid for up to 15 minutes, we
don't have to regenerate tokens for every call.
Cache tokens for matching access+secret+audience
for up to 15 seconds.
```
BenchmarkAuthenticateNode/uncached-32 270567 4179 ns/op 2961 B/op 33 allocs/op
BenchmarkAuthenticateNode/cached-32 7684824 157.5 ns/op 48 B/op 1 allocs/op
```
Reduces internode call allocations a great deal.
FileInfo quorum shouldn't be passed down, instead
inferred after obtaining a maximally occurring FileInfo.
This PR also changes other functions that rely on
wrong quorum calculation.
Update tests as well to handle the proper requirement. All
these changes are needed when migrating from older deployments
where we used to set N/2 quorum for reads to EC:4 parity in
newer releases.
dataDir loosely based on maxima is incorrect and does not
work in all situations such as disks in the following order
- xl.json migration to xl.meta there may be partial xl.json's
leftover if some disks are not yet connected when the disk
is yet to come up, since xl.json mtime and xl.meta is
same the dataDir maxima doesn't work properly leading to
quorum issues.
- its also possible that XLV1 might be true among the disks
available, make sure to keep FileInfo based on common quorum
and skip unexpected disks with the older data format.
Also, this PR tests upgrade from older to a newer release if the
data is readable and matches the checksum.
NOTE: this is just initial work we can build on top of this to do further tests.
there is a corner case where the new check
doesn't work where dataDir has changed, especially
when xl.json -> xl.meta healing happens, if some
healing is partial this can make certain backend
files unreadable.
This PR fixes and updates unit-tests
This unit allows users to limit the maximum number of noncurrent
versions of an object.
To enable this rule you need the following *ilm.json*
```
cat >> ilm.json <<EOF
{
"Rules": [
{
"ID": "test-max-noncurrent",
"Status": "Enabled",
"Filter": {
"Prefix": "user-uploads/"
},
"NoncurrentVersionExpiration": {
"MaxNoncurrentVersions": 5
}
}
]
}
EOF
mc ilm import myminio/mybucket < ilm.json
```
currently getReplicationConfig() failure incorrectly
returns error on unexpected buckets upon upgrade, we
should always calculate usage as much as possible.
listing can fail and it is allowed to be retried,
instead of returning right away return an error at
the end - heal the rest of the buckets and objects,
and when we are retrying skip the buckets that
are already marked done by using the tracked buckets.
fixes#12972
- Go might reset the internal http.ResponseWriter() to `nil`
after Write() failure if the go-routine has returned, do not
flush() such scenarios and avoid spurious flushes() as
returning handlers always flush.
- fix some racy tests with the console
- avoid ticker leaks in certain situations
Existing:
```go
type xlMetaV2 struct {
Versions []xlMetaV2Version `json:"Versions" msg:"Versions"`
}
```
Serialized as regular MessagePack.
```go
//msgp:tuple xlMetaV2VersionHeader
type xlMetaV2VersionHeader struct {
VersionID [16]byte
ModTime int64
Type VersionType
Flags xlFlags
}
```
Serialize as streaming MessagePack, format:
```
int(headerVersion)
int(xlmetaVersion)
int(nVersions)
for each version {
binary blob, xlMetaV2VersionHeader, serialized
binary blob, xlMetaV2Version, serialized.
}
```
xlMetaV2VersionHeader is <= 30 bytes serialized. Deserialized struct
can easily be reused and does not contain pointers, so efficient as a
slice (single allocation)
This allows quickly parsing everything as slices of bytes (no copy).
Versions are always *saved* sorted by modTime, newest *first*.
No more need to sort on load.
* Allows checking if a version exists.
* Allows reading single version without unmarshal all.
* Allows reading latest version of type without unmarshal all.
* Allows reading latest version without unmarshal of all.
* Allows checking if the latest is deleteMarker by reading first entry.
* Allows adding/updating/deleting a version with only header deserialization.
* Reduces allocations on conversion to FileInfo(s).
This will help other projects like `health-analyzer` to verify that the
struct was indeed populated by the minio server, and is not
default-populated during unmarshalling of the JSON.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
legacy objects in 'xl.json' after upgrade, should have
following sequence of events - bucket should have versioning
enabled and the object should have been overwritten with
another version of an object.
this situation was not handled, which would lead to older
objects to stay perpetually with "legacy" dataDir, however
these objects were readable by all means - there weren't
converted to newer format.
This PR fixes this situation properly.
Add a new Prometheus metric for bucket replication latency
e.g.:
minio_bucket_replication_latency_ns{
bucket="testbucket",
operation="upload",
range="LESS_THAN_1_MiB",
server="127.0.0.1:9001",
targetArn="arn:minio:replication::45da043c-14f5-4da4-9316-aba5f77bf730:testbucket"} 2.2015663e+07
Co-authored-by: Klaus Post <klauspost@gmail.com>
container limits would not be properly honored in
our current implementation, mem.VirtualMemory()
function only reads /proc/meminfo which points to
the host system information inside the container.
This feature is useful in situations when console is exposed
over multiple intranent or internet entities when users are
connecting over local IP v/s going through load balancer.
Related console work was merged here
373bfbfe3f
- remove some duplicated code
- reported a bug, separately fixed in #13664
- using strings.ReplaceAll() when needed
- using filepath.ToSlash() use when needed
- remove all non-Go style comments from the codebase
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
- add checks such that swapped disks are detected
and ignored - never used for normal operations.
- implement `unrecognizedDisk` to be ignored with
all operations returning `errDiskNotFound`.
- also add checks such that we do not load unexpected
disks while connecting automatically.
- additionally humanize the values when printing the errors.
Bonus: fixes handling of non-quorum situations in
getLatestFileInfo(), that does not work when 2 drives
are down, currently this function would return errors
incorrectly.
creating service accounts is implicitly enabled
for all users, this PR however adds support to
reject creating service accounts, with an explicit
"Deny" policy.
On first list resume or when specifying a custom markers entries could be missed in rare cases.
Do conservative truncation of entries when forwarding.
Replaces #13619
If a given MinIO config is dynamic (can be changed without restart),
ensure that it can be reset also without restart.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
when `MINIO_CACHE_COMMIT` is set.
- `writeback` caching applies only to single
uploads. When cache commit mode is
`writeback`, default multipart caching to be
synchronous.
- Add writethrough caching for single uploads
This commit makes the MinIO server behavior more consistent
w.r.t. key usage verification.
When MinIO verifies the client certificates it also checks
that the client certificate is valid of client authentication
(or any (i.e. wildcard) usage).
However, the MinIO server used to not verify the client key usage
when client certificate verification was disabled.
Now, the MinIO server verifies the client key usage even when
client certificate verification has been disabled. This makes
the MinIO behavior more consistent from a client's perspective.
Now, a client certificate has to be valid for client authentication
in all cases.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Bonus: if runs have PUT higher then capture it anyways
to display an unexpected result, which provides a way
to understand what might be slowing things down on the
system.
For example on a Data24 WDC setup it is clearly visible
there is a bug in the hardware.
```
./mc admin speedtest wdc/
⠧ Running speedtest (With 64 MiB object size, 32 concurrency) PUT: 31 GiB/s GET: 24 GiB/s
⠹ Running speedtest (With 64 MiB object size, 48 concurrency) PUT: 38 GiB/s GET: 24 GiB/s
MinIO 2021-11-04T06:08:33Z, 6 servers, 48 drives
PUT: 38 GiB/s, 605 objs/s
GET: 24 GiB/s, 383 objs/s
```
Reads are almost 14GiB/sec slower than Writes which
is practically not possible.
This reverts commit 091a7ae359.
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
Also adds fix for accidental canned policy removal that was present in the
reverted version of the change.
Windows users often click on the binary without
knowing MinIO is a command-line tool and should be
run from a terminal. Throw a message to guide them
on what to do.
Co-authored-by: Klaus Post <klauspost@gmail.com>
Borrowed idea from Go's usage of this
optimization for ReadFrom() on client
side, we should re-use the 32k buffers
io.Copy() allocates for generic copy
from a reader to writer.
the performance increase for reads for
really tiny objects is at this range
after this change.
> * Fastest: +7.89% (+1.3 MiB/s) throughput, +7.89% (+1308.1) obj/s
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
- The race happens with a goroutine that refreshes IAM cache data from storage.
- It could lead to deleted users re-appearing as valid live credentials.
- This change also causes CI to run tests without a race flag (in addition to
running it with).
deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,
the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.
additionally reduce the data transfer between
nodes by shortening msgp data values.
- combine similar looking functionalities into single
handlers, and remove unnecessary proxying of the
requests at handler layer.
- remove bucket forwarding handler as part of default setup
add it only if bucket federation is enabled.
Improvements observed for 1kiB object reads.
```
-------------------
Operation: GET
Operations: 4538555 -> 4595804
* Average: +1.26% (+0.2 MiB/s) throughput, +1.26% (+190.2) obj/s
* Fastest: +4.67% (+0.7 MiB/s) throughput, +4.67% (+739.8) obj/s
* 50% Median: +1.15% (+0.2 MiB/s) throughput, +1.15% (+173.9) obj/s
```
Preemptively disable AVX512 until https://github.com/golang/go/issues/49233 has been resolved.
This potentially affects reedsolomon, simdjson, sha256-simd, md5-simd packages.
Init order requires a separate package since main itself is initialized last, but imports are initialized in the order they are imported from main (confirmed).
Removes RLock/RUnlock for updating metadata,
since we already take a write lock to update
metadata, this change removes reading of xl.meta
as well as an additional lock, the performance gain
should increase 3x theoretically for
- PutObjectRetention
- PutObjectLegalHold
This optimization is mainly for Veeam like
workloads that require a certain level of iops
from these API calls, we were losing iops.
read/writers are not concurrent in handlers
and self contained - no need to use atomics on
them.
avoids unnecessary contentions where it's not
required.
Logger targets were not race protected against concurrent updates from for example `HTTPConsoleLoggerSys`.
Restrict direct access to targets and make slices immutable so a returned slice can be processed safely without locks.
various situations where the client is retrying the request
server going through shutdown might incorrectly send 403
which is a non-retriable error, this PR allows for clients
when they retry an attempt to go to another healthy pod
or server in a distributed cluster - assuming it is a properly
load-balanced setup.
As we use etcd's watch interface, we do not need the
network notifications as they are no-ops anyway.
Bonus: Remove globalEtcdClient global usage in IAM
IAMSys is a higher-level object, that should not be called by the lower-level
storage API interface for IAM. This is to prepare for further improvements in
IAM code.
etcd operations, get/put/delete, should be logged when failed
with errors other than not found error. It will make it easier to
see connections issues from MinIO to etcd.
also simplify readerLocks to be just like
writeLocks, DRWMutex() is never shared
and there are order guarantees that need
for such a thing to work for RLock's
3DES is enabled by default in Golang, this commit will use
tls.CipherSuites() which returns all ciphers excluding those with
security issues, such as 3DES.
Refresh was doing a linear scan of all locked resources. This was adding
up to significant delays in locking on high load systems with long
running requests.
Add a secondary index for O(log(n)) UID -> resource lookups.
Multiple resources are stored in consecutive strings.
Bonus fixes:
* On multiple Unlock entries unlock the write locks we can.
* Fix `expireOldLocks` skipping checks on entry after expiring one.
* Return fast on canTakeUnlock/canTakeLock.
* Prealloc some places.
We do not reliably know the length of compressed data, including headers.
Request until the end-of-stream. Results will still be properly truncated.
Fixes#13441
Testing with `mc sql --compression BZIP2 --csv-input "rd=\n,fh=USE,fd=;" --query="select COUNT(*) from S3Object" local2/testbucket/nyc-taxi-data-10M.csv.bz2`
Before 96.98s, after 10.79s. Uses about 70% CPU while running.
offset+length should match the Size() of the individual parts
return 'errFileCorrupt' otherwise, to trigger healing of the individual
parts do not error out prematurely when healing such bitrot's upon
successful parts being written to the client.
another issue this PR fixes is to not return and error to
the client if we have just triggered a heal on a specific
part of the object, instead continue to read all the content
and let the heal happen asynchronously later.
Looks like policy restriction was not working properly
for normal users when they are not svc or STS accounts.
- svc accounts are now properly fixed to get
right permissions when its inherited, so
we do not have to set 'owner = true'
- sts accounts have always been using right
permissions, do not need an explicit lookup
- regular users always have proper policy mapping
- avoids relying in listQuorum from the underlying listObjects()
and potentially missing entries if any.
- avoid the entire merging logic etc, listing raw set by set
and loading whatever is found is cleaner when dealing with
a large cluster for IAM metadata.
* fix: disallow invalid x-amz-security-token for root credentials
fixes#13335
This was a regression added in #12947 when this part of the
code was refactored to avoid privilege issues with service
accounts with session policy.
Bonus:
- fix: AssumeRoleWithCertificate policy mapping and reload
AssumeRoleWithCertificate was not mapping to correct
policies even after successfully generating keys, since
the claims associated with this API were never looked up
properly. Ensure that policies are set appropriately.
- GetUser() API was not loading policies correctly based
on AccessKey based mapping which is true with OpenID
and AssumeRoleWithCertificate API.
with some broken clients allow non-strict validation
of sha256 when ContentLength > 0, it has been found in
the wild some applications that need this behavior. This
shall be only allowed if `--no-compat` is used.
change credentials handling such that
prefer MINIO_* envs first if they work,
if not fallback to AWS credentials. If
they fail we fail to start anyways.
This change allows a set of MinIO sites (clusters) to be configured
for mutual replication of all buckets (including bucket policies, tags,
object-lock configuration and bucket encryption), IAM policies,
LDAP service accounts and LDAP STS accounts.
LDAP TLS dialer shouldn't be strict with ServerName, there
maybe many certs talking to common DNS endpoint it is
better to allow Dialer to choose appropriate public cert.
In (erasureServerPools).MakeBucketWithLocation deletes the created
buckets if any set returns an error.
Add `NoRecreate` option, which will not recreate the bucket
in `DeleteBucket`, if the operation fails.
Additionally use context.Background() for operations we always want to be performed.
bucket deletes should purge entire bucket metadata
appropriately, use rename() to move the metadata files
to trash folder,
for dangling buckets instead of doing recursive deletes,
rename such buckets to trash folder as well.
Bonus: reduce retry duration for listing to 200ms
additionally optimize for IP only setups, avoid doing
unnecessary lookups if the Dial addr is an IP.
allow support for multiple listeners on same socket,
this is mainly meant for future purposes.
DeleteMarkers were unreadable if they had quorum based
guarantees, this PR tries to fix this behavior appropriately.
DeleteMarkers with sufficient should be allowed and the
return error should be accordingly with or without version-id.
This also allows for overwrites which may not be possible
in a multi-pool setup.
fixes#12787
it would seem like using `bufio.Scan()` is very
slow for heavy concurrent I/O, ie. when r.Body
is slow , instead use a proper
binary exchange format, to marshal and unmarshal
the LockArgs datastructure in a cleaner way.
this PR increases performance of the locking
sub-system for tiny repeated read lock requests
on same object.
```
BenchmarkLockArgs
BenchmarkLockArgs-4 6417609 185.7 ns/op 56 B/op 2 allocs/op
BenchmarkLockArgsOld
BenchmarkLockArgsOld-4 1187368 1015 ns/op 4096 B/op 1 allocs/op
```
This PR brings two optimizations mainly
for page-cache build-up and how to avoid
getting OOM killed in the process. Although
these memories are reclaimable Linux is not
fast enough to reclaim them as needed on a
very busy system. fadvise is a system call
implemented in Linux to advise page-cache to
avoid overload as we get significant amount
of requests on the server.
- FADV_SEQUENTIAL tells that all I/O from now
is going to be sequential, allowing for more
resposive throughput.
- FADV_NOREUSE tells kernel to start removing
things for this 'fd' from page-cache.
An endpoint can be empty when a disk is offline or something
wrong with it. Avoid it by filling erasureSets.endpointStrings
with values from arguments.
This was a regression introduced in '14bb969782'
this has the potential to cause corruption when
there are concurrent overwrites attempting to update
the content on the namespace.
This PR adds a situation where PutObject(), CopyObject()
compete properly for the same locks with NewMultipartUpload()
however it ends up turning off competing locks for the actual
object with GetObject() and DeleteObject() - since they do not
compete due to concurrent I/O on a versioned bucket it can lead
to loss of versions.
This PR fixes this bug with multi-pool setup with replication
that causes corruption of inlined data due to lack of competing
locks in a multi-pool setup.
Instead CompleteMultipartUpload holds the necessary
locks when finishing the transaction, knowing the exact
location of an object to schedule the multipart upload
doesn't need to compete in this manner, a pool id location
for existing object.
This reverts commit 91567ba916.
Revert because the error was incorrectly converted, there are
callers that rely on errConfigNotFound and it also took away
the migration code.
Instead the correct fix is PutBucketTaggingHandler() which
is already added.
also remove HealObjects() code from dataScanner running another
listing from the data-scanner is super in-efficient and in-fact
this code is redundant since we already attempt to heal all
dangling objects anyways.
- Supports object locked buckets that require
PutObject() to set content-md5 always.
- Use SSE-S3 when S3 gateway is being used instead
of SSE-KMS for auto-encryption.
This PR however also proceeds to simplify the loading
of various subsystems such as
- globalNotificationSys
- globalTargetSys
converge them directly into single bucket metadata sys
loader, once that is loaded automatically every other
target should be loaded and configured properly.
fixes#13252
DeleteObject() on existing objects before `xl.json` to
`xl.meta` change were not working, not sure when this
regression was added. This PR fixes this properly.
Also this PR ensures that we perform rename of xl.json
to xl.meta only during "write" phase of the call i.e
either during Healing or PutObject() overwrites.
Also handles few other scenarios during migration where
`backendEncryptedFile` was missing deleteConfig() will
fail with `configNotFound` this case was not ignored,
which can lead to failure during upgrades.
once we have competed for locks, verify if the
context is still valid - this is to ensure that
we do not start readdir() or read() calls on the
drives on canceled connections.
This commit brings two locks instead of single lock for
WalkDir() calls on top of c25816eabc.
The main reason is to avoid contention between readMetadata()
and ListDir() calls, ListDir() can take time on prefixes that
are huge for readdir() but this shouldn't end up blocking
all readMetadata() operations, this allows for more room for
I/O while not overly penalizing all listing operations.
This commit fixes an issue in the `AssumeRoleWithCertificate`
handler.
Before clients received an error when they send
a chain of X.509 certificates (their client certificate as
well as intermediate / root CAs).
Now, client can send a certificate chain and the server
will only consider non-CA / leaf certificates as possible
client certificate candidates. However, the client still
can only send one certificate.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
json.Unmarshal expects a pointer receiver, otherwise
kms.Context unmarshal fails with lack of pointer receiver,
this becomes complicated due to type aliasing over
map[string]string - fix it properly.
When unable to load existing metadata new versions
would not be written. This would leave objects in a
permanently unrecoverable state
Instead, start with clean metadata and write the incoming data.
Some identity providers like GitLab do not provide
information about group membership as part of the
identity token claims. They only expose it via OIDC compatible
'/oauth/userinfo' endpoint, as described in the OpenID
Connect 1.0 sepcification.
But this of course requires application to make sure to add
additional accessToken, since idToken cannot be re-used to
perform the same 'userinfo' call. This is why this is specialized
requirement. Gitlab seems to be the only OpenID vendor that requires
this support for the time being.
fixes#12367
Don't perform an independent evaluation of inlining, but mirror the decision made when uploading the object.
Leads to some objects being inlined or not based on new metrics. Instead respect previous decision.
Replication was not working properly for encrypted
objects in single PUT object for preserving etag,
We need to make sure to preserve etag such that replication
works properly and not gets into infinite loops of copying
due to ETag mismatches.
This will allow objects to relinquish read lock held during
replication earlier if the target is known to be down
without waiting for connection timeout when replication
is attempted.
Stop async listing if we have not heard back from the client for 3 minutes.
This will stop spending resources on async listings when they are unlikely to get used.
If the client returns a new listing will be started on the second request.
Stop saving cache metadata to disk. It is cleared on restarts anyway. Removes all
load/save functionality
This commit adds a new STS API for X.509 certificate
authentication.
A client can make an HTTP POST request over a TLS connection
and MinIO will verify the provided client certificate, map it to an
S3 policy and return temp. S3 credentials to the client.
So, this STS API allows clients to authenticate with X.509
certificates over TLS and obtain temp. S3 credentials.
For more details and examples refer to the docs/sts/tls.md
documentation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds the TLS 1.3 ciphers to the list of
supported ciphers. Now, clients can connect to MinIO
using TLS 1.3
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Azure storage SDK uses http.Request feature which panics when the
request contains r.Form popuplated.
Azure gateway code creates a new request, however it modifies the
transport to add our metrics code which sets Request.Form during
shouldMeterRequest() call.
This commit simplifies shouldMeterRequest() to avoid setting
request.Form and avoid the crash.
console service should be shutdown last once all shutdown
sequences are complete, this is to ensure that we do not
prematurely kill the server before it cleans up the
`.minio.sys/tmp/uuid` folder.
NOTE: this only applies to NAS gateway setup.
Use a single allocation for reading the file, not the growing buffer of `io.ReadAll`.
Reuse the write buffer if we can when writing metadata in RenameData.
* reduce extra getObjectInfo() calls during ILM transition
This PR also changes expiration logic to be non-blocking,
scanner is now free from additional costs incurred due
to slower object layer calls and hitting the drives.
* move verifying expiration inside locks
A multi resources lock is a single lock UID with multiple associated
resources. This is created for example by multi objects delete
operation. This commit changes the behavior of Refresh() to iterate over
all locks having the same UID and refresh them.
Bonus: Fix showing top locks for multi delete objects
#11878 added "keepHTTPResponseAlive" to CreateFile requests.
The problem is that it will begin writing to the response before the
body is read after 10 seconds. This will abort the writes on the
client-side, since it assumes the server has received what it wants.
The proposed solution here is to monitor the completion of the body
before beginning to send keepalive pings.
Fixes observed high number of goroutines stuck in `io.Copy` in
`github.com/minio/minio/cmd.(*xlStorage).CreateFile` and
`(*storageRESTClient).CreateFile` stuck in `http.DrainBody`.
In the event when a lock is not refreshed in the cluster, this latter
will be automatically removed in the subsequent cleanup of non
refreshed locks routine, but it forgot to clean the local server,
hence having the same weird stale locks present.
This commit will remove the lock locally also in remote nodes, if
removing a lock from a remote node will fail, it will be anyway
removed later in the locks cleanup routine.
Currently in master this can cause existing
parent users to stop working and lead to
credentials getting overwritten.
```
~ mc admin user add alias/ minio123 minio123456
```
```
~ mc admin user svcacct add alias/ minio123 \
--access-key minio123 --secret-key minio123456
```
This PR rejects all such scenarios.
Faster healing as well as making healing more
responsive for faster scanner times.
also fixes a bug introduced in #13079, newly replaced
disks were not healing automatically.
- remove sourceCh usage from healing
we already have tasks and resp channel
- use read locks to lookup globalHealConfig
- fix healing resolver to pick candidates quickly
that need healing, without this resolver was
unexpectedly skipping.
healObject() should be non-blocking to ensure
that scanner is not blocked for a long time,
this adversely affects performance of the scanner
and also affects the way usage is updated
subsequently.
This PR allows for a non-blocking behavior for
healing, dropping operations that cannot be queued
anymore.
Synchronize bucket cycles so it is much more
likely that the same prefixes will be picked up
for scanning.
Use the global bloom filter cycle for that.
Bump bloom filter versions to clear those.
The intention is to list values of sys config that can potentially
impact the performance of minio.
At present, it will return max value configured for rlimit
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
proceed to heal the cluster when all the
drives in a set have failed, this is extremely
rare occurrence but even if it happens we allow
the cluster to be functional.
A recent regression caused new disks not being re-formatted. In the old
code, a disk needed be 'online' to be chosen to be formatted but the
disk has to be already formatted for XL storage IsOnline() function to
return true.
It is enough to check if XL storage is nil or not if we want to avoid
formatting root disks.
Co-authored-by: Anis Elleuch <anis@min.io>
markRootDisksAsDown() relies on disk info even if the
disk is unformatted. Therefore, we should always return
DiskInfo data even when DiskInfo storage API returns
errUnformattedDisk
When reading `TrafficMeter` values, there was a value receiver.
This means that receivers are copied unsafely when invoked.
Fixes race seen with `-race` build.
`mc admin heal` command will show servers/disks tolerance, for that
purpose, you need to know the number of parity disks for each storage
class.
Parity is always the same in all pools.
prefixes at top level create such as
```
~ mc mb alias/bucket/prefix
```
The prefix/ incorrect appears as prefix__XL_DIR__/
in the accountInfo output, make sure to trim '__XL_DIR__'
Objects uploaded in this format for example
```
mc cp /etc/hosts alias/bucket/foo/bar/xl.meta
mc ls -r alias/bucket/foo/bar
```
Won't list the object, handle this scenario.
Ensure that one call will succeed and others will serialize
Example failure without code in place:
```
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:135: want 1 ok, got 0
```
We are observing heavy system loads, potentially
locking the system up for periods when concurrent
listing operations are performed.
We place a per-disk lock on walk IO operations.
This will minimize the impact of concurrent listing
operations on the entire system and de-prioritize
them compared to other operations.
Single list operations should remain largely unaffected.
Some applications albeit poorly written rather than using headObject
rely on listObjects to check for existence of object, this unusual
request always has prefix=(to actual object) and max-keys=1
handle this situation specially such that we can avoid readdir()
on the top level parent to avoid sorting and skipping, ensuring
that such type of listObjects() always behaves similar to a
headObject() call.
this addresses a regression from #12984
which only addresses flat key from single
level deep at bucket level.
added extra tests as well to cover all
these scenarios.
- deletes should always Sweep() for tiering at the
end and does not need an extra getObjectInfo() call
- puts, copy and multipart writes should conditionally
do getObjectInfo() when tiering targets are configured
- introduce 'TransitionedObject' struct for ease of usage
and understanding.
- multiple-pools optimization deletes don't need to hold
read locks verifying objects across namespace and pools.
baseDir is empty if the top level prefix does not
end with `/` this causes large recursive listings
without any filtering, to fix this filtering make
sure to set the filter prefix appropriately.
also do not navigate folders at top level that do
not match the filter prefix, entries don't need
to match prefix since they are never prefixed
with the prefix anyways.
This ensures that the deprecation warning is shown when the setting is actually
used in a configuration - instead of showing up whenever LDAP is enabled.
The previous code removes SVC/STS accounts for ldap users that do not
exist anymore in LDAP server. This commit will actually re-evaluate
filter as well if it is changed and remove all local SVC/STS accounts
beloning to the ldap user if the latter is not eligible for the
search filter anymore.
For example: the filter selects enabled users among other criteras in
the LDAP database, if one ldap user changes his status to disabled
later, then associated SVC/STS accounts will be removed because that user
does not meet the filter search anymore.
Traffic metering was not protected against concurrent updates.
```
WARNING: DATA RACE
Read at 0x00c02b0dace8 by goroutine 235:
github.com/minio/minio/cmd.setHTTPStatsHandler.func1()
d:/minio/minio/cmd/generic-handlers.go:360 +0x27d
net/http.HandlerFunc.ServeHTTP()
...
Previous write at 0x00c02b0dace8 by goroutine 994:
github.com/minio/minio/internal/http/stats.(*IncomingTrafficMeter).Read()
d:/minio/minio/internal/http/stats/http-traffic-recorder.go:34 +0xd2
```
The intention is to provide status of any sys services that can
potentially impact the performance of minio.
At present, it will return information about the `selinux` service
(not-installed/disabled/permissive/enforcing)
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
This happens because of a change added where any sub-credential
with parentUser == rootCredential i.e (MINIO_ROOT_USER) will
always be an owner, you cannot generate credentials with lower
session policy to restrict their access.
This doesn't affect user service accounts created with regular
users, LDAP or OpenID
Use `readMetadata` when reading version
information without data requested.
Reduces IO on inlined data.
Bonus: Inline compressed data as well when
compression is enabled.
- avoid extra lookup for 'xl.meta' since we are
definitely sure that it doesn't exist.
- use this in newMultipartUpload() as well
- also additionally do not write with O_DSYNC
to avoid loading the drives, instead create
'xl.meta' for listing operations without
O_DSYNC since these are ephemeral objects.
- do the same with newMultipartUpload() since
it gets synced when the PutObjectPart() is
attempted, we do not need to tax newMultipartUpload()
instead.
removes unexpected features from regular putObject() such as
- increasing parity when disks are down, avoids
a lot of DiskInfo() calls.
- triggering MRF for metacache objects
if disks are offline
- avoiding renames from temporary location
to actual namespace, not needed since
metacache files are unique.
Disregard interfaces that are down when selecting bind addresses
Windows often has a number of disabled NICs used for VPN and other services.
This often causes minio to select an address for contacting the console that is on a disabled (virtual) NIC.
This checks if the interface is up before adding it to the pool on Windows.
Before, the gateway will complain that it found KMS configured in the
environment but the gateway mode does not support encryption. This
commit will allow starting of the gateway but ensure that S3 operations
with encryption headers will fail when the gateway doesn't support
encryption. That way, the user can use etcd + KMS and have IAM data
encrypted in the etcd store.
Co-authored-by: Anis Elleuch <anis@min.io>
improvements include
- skip IPv6 correctly
- do not set default value for
MINIO_SERVER_URL, let it be
configured if not use local IPs
Bonus:
- In healing return error from listPathRaw()
- update console to v0.8.3
Remote caches were not returned correctly, so they would not get updated on save.
Furthermore make some tweaks for more reliable updates.
Invalidate bloom filter to ensure rescan.
we will allow situations such as
```
a/b/1.txt
a/b
```
and
```
a/b
a/b/1.txt
```
we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.
rest of the situations where the prefixes get
created across sets are supported as is.
its possible that some multipart uploads would have
uploaded only single parts so relying on `len(o.Parts)`
alone is not sufficient, we need to look for ETag
pattern to be absolutely sure.
Some incorrect setups might have multiple audiences
where they are trying to use a single authentication
endpoint for multiple services.
Nevertheless OpenID spec allows it to make it
even more confusin for no good reason.
> It MUST contain the OAuth 2.0 client_id of the
> Relying Party as an audience value. It MAY also
> contain identifiers for other audiences. In the
> general case, the aud value is an array of case
> sensitive strings. In the common special case
> when there is one audience, the aud value MAY
> be a single case sensitive string.
fixes#12809
this healing optimization caused multiple
regressions in healing
- delete-markers incorrectly missing
heal and returning incorrect healing
results to client.
- missing individual 'parts' such
as for restored object or simply
for all objects just missing few parts.
This optimization is not necessary, we
should proceed to verify all cases possible
not just when metadata is inconsistent.
destination path and old path will be similar
when healing occurs, this can lead to healed
parts being again purged leading to always an
inconsistent state on an object which might
further cause reduction in quorum eventually.
delete-markers missing on drives were
not healed due to few things
disksWithAllParts() does not know-how
to deal with delete markers, add support
for that.
fixes#12787
- delete-markers are incorrectly reported
as corrupt with wrong data sent to client
'mc admin heal -r' on objects with delete
marker will report as 'grey' incorrectly.
- do not heal delete-markers during HeadObject()
this can lead to inconsistent order of heals
on the object, although this is not an issue
in terms of order of versions it is rather
simpler to keep the same order on all drives.
- defaultHealResult() should handle 'err == nil'
case such that valid cases should be handled
as 'drive' status OK.
- remove use of getOnlineDisks() instead rely on fallbackDisks()
when disk return errors like diskNotFound, unformattedDisk
use other fallback disks to list from, instead of paying the
price for checking getOnlineDisks()
- optimize getDiskID() further to avoid large write locks when
looking formatLastCheck time window
This new change allows for a more relaxed fallback for listing
allowing for more tolerance and also eventually gain more
consistency in results even if using '3' disks by default.
When configured in Lookup Bind mode, the server now periodically queries the
LDAP IDP service to find changes to a user's group memberships, and saves this
info to update the access policies for all temporary and service account
credentials belonging to LDAP users.
Add a new goroutine file which has another printing format. We need it
to see how much time each goroutine was blocked. Easier to detect stops.
Co-authored-by: Anis Elleuch <anis@min.io>
- Show notice when `MINIO_IDENTITY_LDAP_STS_EXPIRY` or the
corresponding to the configuration option is used at server startup.
- Once support is removed, the default will be fixed at 1 hour.
- Users may specify expiry directly in the STS API.
- Update docs and help message
- Adds example in ldap.go to configure expiry in STS API.
when TLS is configured using IPs directly
might interfere and not work properly when
the server is configured with TLS certs but
the certs only have domain certs.
Also additionally allow users to specify
a public accessible URL for console to talk
to MinIO i.e `MINIO_SERVER_URL` this would
allow them to use an external ingress domain
to talk to MinIO. This internally fixes few
problems such as presigned URL generation on
the console UI etc.
This needs to be done additionally for any
MinIO deployments that might have a much more
stricter requirement when running in standalone
mode such as FS or standalone erasure code.
This method is used to add expected expiration and transition time
for an object in GET/HEAD Object response headers.
Also fixed bugs in lifecycle.PredictTransitionTime and
getLifecycleTransitionTier in handling current and
non-current versions.
This allows remote bucket admin to identify the origin of transitioned
objects by simply inspecting the object prefixes.
e.g let's take a remote tier TIER-1 pointing to a remote bucket (prefix)
testbucket/testprefix-1. The remote bucket admin can list all transitioned objects
from a MinIO deployment identified by '2e78e906-1c5d-4f94-8689-9df44cafde39' and
source bucket 'mybucket' like so,
```
$ ./mc ls -r minio-tier-target/testbucket/testprefix-1/2e78e906-1c5d-4f94-8689-9df44cafde39/mybucket/
[2021-07-12 17:15:50 PDT] 160B 48/fb/48fbc0e6-3a73-458b-9337-8e722c619ca4
[2021-07-12 16:58:46 PDT] 160B 7d/1c/7d1c96bd-031a-48d4-99ea-b1304e870830
```
In case of non-distributed setup, if the server start command contains a
`--console-address` flag and its value contains a hostname, it is not
getting anonymized.
Fixed by replacing the console host also with `server1`
This commit gathers MRF metrics from
all nodes in a cluster and return it to the caller. This will show information about the
number of objects in the MRF queues
waiting to be healed.
docker-entrypoint.sh will load configuration values from
'config.env' file, this is useful when MinIO is deployed in Kubernetes
environments and want to avoid reading secrets from environment
variables
Signed-off-by: Lenin Alevski <alevsk.8772@gmail.com>
In case of replication healing, we always store completed status in the
object metadata, which is wrong because replication could fail in the
further retries.
Ensure that hostnames / ip addresses are not printed in the subnet
health report. Anonymize them by replacing them with `servern` where `n`
represents the position of the server in the pool.
This is done by building a `host anonymizer` map that maps every
possible value containing the host e.g. host, host:port,
http://host:port, etc to the corresponding anonymized name and using
this map to replace the values at the time of health report generation.
A different logic is used to anonymize host names in the `procinfo`
data, as the host names are part of an ellipses pattern in the process
start command. Here we just replace the prefix/suffix of the ellipses
pattern with their hashes.
Gzip responses if appropriate, except GetObject requests.
List reponses has an almost 10:1 compression ratio with no
measurable slowdown (in fact it seems a bit faster).
- ParentUser for OIDC auth changed to `openid:`
instead of `jwt:` to avoid clashes with variable
substitution
- Do not pass in random parents into IsAllowed()
policy evaluation as it can change the behavior
of looking for correct policies underneath.
fixes#12676fixes#12680
with console addition users cannot login with
root credentials without etcd persistent layer,
allow a dummy store such that such functionalities
can be supported when running as non-persistent
manner, this enables all calls and operations.
MinIO might be running inside proxies, and
console while being on another port might not be
reachable on a specific port behind such proxies.
For such scenarios customize the redirect URL
such that console can be redirected to correct
proxy endpoint instead.
fixes#12661
Download files from *any* bucket/path as an encrypted zip file.
The key is included in the response but can be separated so zip
and the key doesn't have to be sent on the same channel.
Requires https://github.com/minio/pkg/pull/6
Additional support for vendor-specific admin API
integrations for OpenID, to ensure validity of
credentials on MinIO.
Every 5minutes check for validity of credentials
on MinIO with vendor specific IDP.
DiskInfo() calls can stagger and wait if run
serially timing out 10secs per drive, to avoid
this lets check DiskInfo in parallel to avoid
delays when nodes get disconnected.
Fixes brought forward from https://github.com/minio/minio/pull/12605
Fixes resolution when an object is in prefix of another and one zone returns the directory and another the object.
Fixes resolution on single entries that arrive first, so resolution doesn't depend on order.
auditLog should be attempted right before the
return of the function and not multiple times
per function, this ensures that we only trigger
it once per function call.
if object was uploaded with multipart. This is to ensure that
GetObject calls with partNumber in URI request parameters
have same behavior on source and replication target.
also do not incorrectly double count
objExists unless its selected and it
matches with previous entry.
Bonus: change listQuorum to match with
AskDisks to ensure that we atleast by
default choose all the "drives" that
we asked is consistent.
Bonus: remove kms_kes as sub-system, since its ENV only.
- also fixes a crash with etcd cluster without KMS
configured and also if KMS decryption is missing.
backend-encrypted doesn't need to be explicitly healed anymore
since this file is deleted upon upgrade and migration to the
KMS based encrypted config/IAM credentials.
goroutine 1 [running]:
runtime/internal/atomic.panicUnaligned()
/usr/local/go/src/runtime/internal/atomic/unaligned.go:8 +0x24
golang doc:
// BUG(rsc): On x86-32, the 64-bit functions use instructions unavailable before the Pentium MMX.
//
// On non-Linux ARM, the 64-bit functions use instructions unavailable before the ARMv6k core.
//
// On ARM, x86-32, and 32-bit MIPS,
// it is the caller's responsibility to arrange for 64-bit
// alignment of 64-bit words accessed atomically. The first word in a
// variable or in an allocated struct, array, or slice can be relied upon to be
// 64-bit aligned.
Create write lock on PutObject and CopyObject when on multi-pool setup.
Use the same lock as NewMultipartUpload so all creation calls share the same lock.
This feature also changes the default port where
the browser is running, now the port has moved
to 9001 and it can be configured with
```
--console-address ":9001"
```
When no results are sent `result.end` is never sent, so the list becomes hot until the list is full.
Break immediately when channel is closed.
Fixes#12518
previous PR incorrectly changed rename() from
tmp to -> tmp/.trash/uuid, since it is self
referential - to clear this up make sure its
renamed to a separate folder and deleted
in background - just like before.
Each multipart upload is holding a read lock for the entire upload
duration of each part.
This makes it impossible for other parts to complete until all currently
uploading parts have released their locks.
It will also make it impossible for new parts to start as long as the
write lock is still being requested, essentially deadlocking uploads
until all that may have been granted a read lock has been completed.
Refactor to only hold the upload id lock while reading and writing
the metadata, but hold a part id lock while the part is being uploaded.
This commit adds an admin API for fetching
the KMS status information (default key ID, endpoints, ...).
With this commit the server exposes REST endpoint:
```
GET <admin-api>/kms/status
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Web Handlers can generate STS tokens but forgot to create a parent user
and save it along with the temporary access account. This commit fixes
this.
fixes#12381
its possible that, version might exist on second pool such that
upon deleteBucket() might have deleted the bucket on pool1 successfully
since it doesn't have any objects, undo such operations properly in
all any error scenario.
Also delete bucket metadata from pool layer rather than sets layer.
objectErasureMap in the audit holds information about the objects
involved in the current S3 operation such as pool index, set an index,
and disk endpoints. One user saw a crash due to a concurrent update of
objectErasureMap information. Use sync.Map to prevent a crash.
Always use `GetActualSize` to get the part size, not just when encrypted.
Fixes mint test io.minio.MinioClient.uploadPartCopy,
error "Range specified is not valid for source object".
healing code was using incorrect buffers to heal older
objects with 10MiB erasure blockSize, incorrect calculation
of such buffers can lead to incorrect premature closure of
io.Pipe() during healing.
fixes#12410
- it is possible that during I/O failures we might
leave partially written directories, make sure
we purge them after.
- rename current data-dir (null) versionId only after
the newer xl.meta has been written fully.
- attempt removal once for minioMetaTmpBucket/uuid/
as this folder is empty if all previous operations
were successful, this allows avoiding recursive os.Remove()
- for single pool setups usage is not checked.
- for pools, only check the "set" in which it would be placed.
- keep a minimum number of inodes (when we know it).
- ignore for `.minio.sys`.
It makes sense that a node that has multiple disks starts when one
disk fails, returning an i/o error for example. This commit will make this
faulty tolerance available in this specific use case.
Due to incorrect KMS context constructed, we need to add
additional fallbacks and also fix the original root cause
to fix already migrated deployments.
Bonus remove double migration is avoided in gateway mode
for etcd, instead do it once in iam.Init(), also simplify
the migration by not migrating STS users instead let the
clients regenerate them.
- Adds versioning support for S3 based remote tiers that have versioning
enabled. This ensures that when reading or deleting we specify the specific
version ID of the object. In case of deletion, this is important to ensure that
the object version is actually deleted instead of simply being marked for
deletion.
- Stores the remote object's version id in the tier-journal. Tier-journal file
version is not bumped up as serializing the new struct version is
compatible with old journals without the remote object version id.
- `storageRESTVersion` is bumped up as FileInfo struct now includes a
`TransitionRemoteVersionID` member.
- Azure and GCS support for this feature will be added subsequently.
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
Also adding an API to allow resyncing replication when
existing object replication is enabled and the remote target
is entirely lost. With the `mc replicate reset` command, the
objects that are eligible for replication as per the replication
config will be resynced to target if existing object replication
is enabled on the rule.
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
IAM not initialized doesn't mean we can't still
read the content from the disk, we should just
allow the request to go-through if object layer
is initialized.
Real-time metrics calculated in-memory rely on the initial
replication metrics saved with data usage. However, this can
lag behind the actual state of the cluster at the time of server
restart leading to inaccurate Pending size/counts reported to
Prometheus. Dropping the Pending metrics as this can be more
reliably monitored by applications with replication notifications.
Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
LDAPusername is the simpler form of LDAPUser (userDN),
using a simpler version is convenient from policy
conditions point of view, since these are unique id's
used for LDAP login.
In cases where a cluster is degraded, we do not uphold our consistency
guarantee and we will write fewer erasure codes and rely on healing
to recreate the missing shards.
In some cases replacing known bad disks in practice take days.
We want to change the behavior of a known degraded system to keep
the erasure code promise of the storage class for each object.
This will create the objects with the same confidence as a fully
functional cluster. The tradeoff will be that objects created
during a partial outage will take up slightly more space.
This means that when the storage class is EC:4, there should
always be written 4 parity shards, even if some disks are unavailable.
When an object is created on a set, the disks are immediately
checked. If any disks are unavailable additional parity shards
will be made for each offline disk, up to 50% of the number of disks.
We add an internal metadata field with the actual and intended
erasure code level, this can optionally be picked up later by
the scanner if we decide that data like this should be re-sharded.
Bonus change LDAP settings such as user, group mappings
are now listed as part of `mc admin user list` and
`mc admin group list`
Additionally this PR also deprecates the `/v2` API
that is no longer in use.
A configured audit logger or HTTP logger is validated during MinIO
server startup. Relax the timeout to 10 seconds in that case, otherwise,
both loggers won't be used.
1 second could be too low for a busy HTTP endpoint.
This commit fixes a bug causing the MinIO server to compute
the ETag of a single-part object as MD5 of the compressed
content - not as MD5 of the actual content.
This usually does not affect clients since the MinIO appended
a `-1` to indicate that the ETag belongs to a multipart object.
However, this behavior was problematic since:
- A S3 client being very strict should reject such an ETag since
the client uploaded the object via single-part API but got
a multipart ETag that is not the content MD5.
- The MinIO server leaks (via the ETag) that it compressed the
object.
This commit addresses both cases. Now, the MinIO server returns
an ETag equal to the content MD5 for single-part objects that got
compressed.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
A lot of healing is likely to be on non-existing objects and
locks are very expensive and will slow down scanning
significantly.
In cases where all are valid or, all are broken allow
rejection without locking.
Keep the existing behavior, but move the check for
dangling objects to after the lock has been acquired.
```
_, err = getLatestFileInfo(ctx, partsMetadata, errs)
if err != nil {
return er.purgeObjectDangling(ctx, bucket, object, versionID, partsMetadata, errs, []error{}, opts)
}
```
Revert "heal: Hold lock when reading xl.meta from disks (#12362)"
This reverts commit abd32065aa
This PR fixes two bugs
- Remove fi.Data upon overwrite of objects from inlined-data to non-inlined-data
- Workaround for an existing bug on disk with latest releases to ignore fi.Data
and instead read from the disk for non-inlined-data
- Addtionally add a reserved metadata header to indicate data is inlined for
a given version.
Lock is hold in healObject() after reading xl.meta from disks the first
time. This commit will held the lock since the beginning of HealObject()
Co-authored-by: Anis Elleuch <anis@min.io>
Fixes `testSSES3EncryptedGetObjectReadSeekFunctional` mint test.
```
{
"args": {
"bucketName": "minio-go-test-w53hbpat649nhvws",
"objectName": "6mdswladz4vfpp2oit1pkn3qd11te5"
},
"duration": 7537,
"error": "We encountered an internal error, please try again.: cause(The requested range \"bytes 251717932 -> -116384170 of 135333762\" is not satisfiable.)",
"function": "GetObject(bucketName, objectName)",
"message": "CopyN failed",
"name": "minio-go: testSSES3EncryptedGetObjectReadSeekFunctional",
"status": "FAIL"
}
```
Compressed files always start at the beginning of a part so no additional offset should be added.
Previous PR #12351 added functions to read from the reader
stream to reduce memory usage, use the same technique in
few other places where we are not interested in reading the
data part.
in setups with lots of drives the server
startup is slow, initialize all local drives
in parallel before registering with muxer.
this speeds up when there are multiple pools
and large collection of drives.
multi-disk clusters initialize buffer pools
per disk, this is perhaps expensive and perhaps
not useful, for a running server instance. As this
may disallow re-use of buffers across sets,
this change ensures that buffers across sets
can be re-used at drive level, this can reduce
quite a lot of memory on large drive setups.
In lieu of new changes coming for server command line, this
change is to deprecate strict requirement for distributed setups
to provide root credentials.
Bonus: remove MINIO_WORM warning from April 2020, it is time to
remove this warning.
However, this slice is also used for closing the writers, so close is never called on these.
Furthermore when an error is returned from a write it is now reported to the reader.
bonus: remove unused heal param from `newBitrotWriter`.
* Remove copy, now that we don't mutate.
At some places bloom filter tracker was getting
updated for `.minio.sys/tmp` bucket, there is no
reason to update bloom filters for those.
And add a missing bloom filter update for MakeBucket()
Bonus: purge unused function deleteEmptyDir()
gracefully start the server, if there are other drives
available - print enough information for administrator
to notice the errors in console.
Bonus: for really large streams use larger buffer for
writes.
- GetObject() should always use a common dataDir to
read from when it starts reading, this allows the
code in erasure decoding to have sane expectations.
- Healing should always heal on the common dataDir, this
allows the code in dangling object detection to purge
dangling content.
These both situations can happen under certain types of
retries during PUT when server is restarting etc, some
namespace entries might be left over.
attempt a delete on remote DNS store first before
attempting locally, because removing at DNS store
is cheaper than deleting locally, in case of
errors locally we can cheaply recreate the
bucket on dnsStore instead of.
This commit adds support for SSE-KMS bucket configurations.
Before, the MinIO server did not support SSE-KMS, and therefore,
it was not possible to specify an SSE-KMS bucket config.
Now, this is possible. For example:
```
mc encrypt set sse-kms some-key <alias>/my-bucket
```
Further, this commit fixes an issue caused by not supporting
SSE-KMS bucket configuration and switching to SSE-KMS as default
SSE method.
Before, the server just checked whether an SSE bucket config was
present (not which type of SSE config) and applied the default
SSE method (which was switched from SSE-S3 to SSE-KMS).
This caused objects to get encrypted with SSE-KMS even though a
SSE-S3 bucket config was present.
This issue is fixed as a side-effect of this commit.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
when bidirectional replication is set up.
If ReplicaModifications is enabled in the replication
configuration, sync metadata updates to source if
replication rules are met. By default, if this
configuration is unset, MinIO automatically sync's
metadata updates on replica back to the source.
This commit adds a check to the MinIO server setup that verifies
that MinIO can reach KES, if configured, and that the default key
exists. If the default key does not exist it will create it
automatically.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
This commit fixes a bug in the KMS KES client integration.
The client should return a non-nil error when the key generation
fails.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
A cache structure will be kept with a tree of usages.
The cache is a tree structure where each keeps track
of its children.
An uncompacted branch contains a count of the files
only directly at the branch level, and contains link to
children branches or leaves.
The leaves are "compacted" based on a number of properties.
A compacted leaf contains the totals of all files beneath it.
A leaf is only scanned once every dataUsageUpdateDirCycles,
rarer if the bloom filter for the path is clean and no lifecycles
are applied. Skipped leaves have their totals transferred from
the previous cycle.
A clean leaf will be included once every healFolderIncludeProb
for partial heal scans. When selected there is a one in
healObjectSelectProb that any object will be chosen for heal scan.
Compaction happens when either:
- The folder (and subfolders) contains less than dataScannerCompactLeastObject objects.
- The folder itself contains more than dataScannerCompactAtFolders folders.
- The folder only contains objects and no subfolders.
- A bucket root will never be compacted.
Furthermore, if a has more than dataScannerCompactAtChildren recursive
children (uncompacted folders) the tree will be recursively scanned and the
branches with the least number of objects will be compacted until the limit
is reached.
This ensures that any branch will never contain an unreasonable amount
of other branches, and also that small branches with few objects don't
take up unreasonable amounts of space.
Whenever a branch is scanned, it is assumed that it will be un-compacted
before it hits any of the above limits. This will make the branch rebalance
itself when scanned if the distribution of objects has changed.
TLDR; With current values: No bucket will ever have more than 10000
child nodes recursively. No single folder will have more than 2500 child
nodes by itself. All subfolders are compacted if they have less than 500
objects in them recursively.
We accumulate the (non-deletemarker) version count for paths as well,
since we are changing the structure anyway.
MRF does not detect when a node is disconnected and reconnected quickly
this change will ensure that MRF is alerted by comparing the last disk
reconnection timestamp with the last MRF check time.
Signed-off-by: Anis Elleuch <anis@min.io>
Co-authored-by: Klaus Post <klauspost@gmail.com>
wait groups are necessary with io.Pipes() to avoid
races when a blocking function may not be expected
and a Write() -> Close() before Read() races on each
other. We should avoid such situations..
Co-authored-by: Klaus Post <klauspost@gmail.com>
This commit replaces the custom KES client implementation
with the KES SDK from https://github.com/minio/kes
The SDK supports multi-server client load-balancing and
requests retry out of the box. Therefore, this change reduces
the overall complexity within the MinIO server and there
is no need to maintain two separate client implementations.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
- Check ES server version by querying its API
- Minimum required version of ES is 5.x
- Add deprecation warnings for ES versions < 7.x
- Still works with 5.x and 6.x, but support to be removed at a later date.
Signed-off-by: Aditya Manthramurthy <aditya@minio.io>
This commit enforces the usage of AES-256
for config and IAM data en/decryption in FIPS
mode.
Further, it improves the implementation of
`fips.Enabled` by making it a compile time
constant. Now, the compiler is able to evaluate
the any `if fips.Enabled { ... }` at compile time
and eliminate unused code.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
p.writers is a verbatim value of bitrotWriter
backed by a pipe() that should never be nil'ed,
instead use the captured errors to skip the writes.
additionally detect also short writes, and reject
them as errors.
currently GetUser() returns 403 when IAM is not initialized
this can lead to applications crashing, instead return 503
so that the applications can retry and backoff.
fixes#12078
as there is no automatic way to detect if there
is a root disk mounted on / or /var for the container
environments due to how the root disk information
is masked inside overlay root inside container.
this PR brings an environment variable to set
root disk size threshold manually to detect the
root disks in such situations.
This commit fixes a bug in the single-part object decryption
that is triggered in case of SSE-KMS. Before, it was assumed
that the encryption is either SSE-C or SSE-S3. In case of SSE-KMS
the SSE-C branch was executed. This lead to an invalid SSE-C
algorithm error.
This commit fixes this by inverting the `if-else` logic.
Now, the SSE-C branch only gets executed when SSE-C headers
are present.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
This commit fixes a bug introduced by af0c65b.
When there is no / an empty client-provided SSE-KMS
context the `ParseMetadata` may return a nil map
(`kms.Context`).
When unsealing the object key we must check that
the context is nil before assigning a key-value pair.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
UpdateServiceAccount ignores updating fields when not passed from upper
layer, such as empty policy, empty account status, and empty secret key.
This PR will check for a secret key only if it is empty and add more
check on the value of the account status.
Signed-off-by: Anis Elleuch <anis@min.io>
This commit adds basic SSE-KMS support.
Now, a client can specify the SSE-KMS headers
(algorithm, optional key-id, optional context)
such that the object gets encrypted using the
SSE-KMS method. Further, auto-encryption now
defaults to SSE-KMS.
This commit does not try to do any refactoring
and instead tries to implement SSE-KMS as a minimal
change to the code base. However, refactoring the entire
crypto-related code is planned - but needs a separate
effort.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
It is possible in some scenarios that in multiple pools,
two concurrent calls for the same object as a multipart operation
can lead to duplicate entries on two different pools.
This PR fixes this
- hold locks to serialize multiple callers so that we don't race.
- make sure to look for existing objects on the namespace as well
not just for existing uploadIDs
When running MinIO server without LDAP/OpenID, we should error out when
the code tries to create a service account for a non existant regular
user.
Bonus: refactor the check code to be show all cases more clearly
Signed-off-by: Anis Elleuch <anis@min.io>
Co-authored-by: Anis Elleuch <anis@min.io>
To avoid returning 5xx error from MinIO server and show a better error
message, we need to return ErrInvalidAccessKeyLength and ErrInvalidSecretKeyLength
when attempting to create a new credentials with invalid access or
secret keys.
Signed-off-by: Anis Elleuch <anis@min.io>
Co-authored-by: Anis Elleuch <anis@min.io>
This commit adds basic SSE-KMS support.
Now, a client can specify the SSE-KMS headers
(algorithm, optional key-id, optional context)
such that the object gets encrypted using the
SSE-KMS method. Further, auto-encryption now
defaults to SSE-KMS.
This commit does not try to do any refactoring
and instead tries to implement SSE-KMS as a minimal
change to the code base. However, refactoring the entire
crypto-related code is planned - but needs a separate
effort.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
Co-authored-by: Klaus Post <klauspost@gmail.com>
avoid time_wait build up with getObject requests if there are
pending callers and they timeout, can lead to time_wait states
Bonus share the same buffer pool with erasure healing logic,
additionally also fixes a race where parallel readers were
never cleanup during Encode() phase, because pipe.Reader end
was never closed().
Added closer right away upon an error during Encode to make
sure to avoid racy Close() while stream was still being
Read().
This commit fixes a bug when parsing the env. variable
`MINIO_KMS_SECRET_KEY`. Before, the env. variable
name - instead of its value - was parsed. This (obviously)
did not work properly.
This commit fixes this.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
cleanup functions should never be cleaned before the reader is
instantiated, this type of design leads to situations where order
of lockers and places for them to use becomes confusing.
Allow WithCleanupFuncs() if the caller wishes to add cleanupFns
to be run upon close() or an error during initialization of the
reader.
Also make sure streams are closed before we unlock the resources,
this allows for ordered cleanup of resources.
upon errors to acquire lock context would still leak,
since the cancel would never be called. since the lock
is never acquired - proactively clear it before returning.
failed queue should be used for retried requests to
avoid cascading the failures into incoming queue, this
would allow for a more fair retry for failed replicas.
Additionally also avoid taking context in queue task
to avoid confusion, simplifies its usage.
There can be situations where replication completed but the
`X-Amz-Replication-Status` metadata update failed such as
when the server returns 503 under high load. This object version will
continue to be picked up by the scanner and replicateObject would perform
no action since the versions match between source and target.
The metadata would never reflect that replication was successful
without this fix, leading to repeated re-queuing.
This commit enhances the docs about IAM encryption.
It adds a quick-start section that explains how to
get started quickly with `MINIO_KMS_SECRET_KEY`
instead of setting up KES.
It also removes the startup message that gets printed
when the server migrates IAM data to plaintext.
We will point this out in the release notes.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
OpenID connect generated service accounts do not work
properly after console logout, since the parentUser state
is lost - instead use sub+iss claims for parentUser, according
to OIDC spec both the claims provide the necessary stability
across logins etc.
Currently, only credentials could be updated with
`mc admin bucket remote edit`.
Allow updating synchronous replication flag, path,
bandwidth and healthcheck duration on buckets, and
a flag to disable proxying in active-active replication.
* lock: Always cancel the returned Get(R)Lock context
There is a leak with cancel created inside the locking mechanism. The
cancel purpose was to cancel operations such erasure get/put that are
holding non-refreshable locks.
This PR will ensure the created context.Cancel is passed to the unlock
API so it will cleanup and avoid leaks.
* locks: Avoid returning nil cancel in local lockers
Since there is no Refresh mechanism in the local locking mechanism, we
do not generate a new context or cancel. Currently, a nil cancel
function is returned but this can cause a crash. Return a dummy function
instead.
LDAP DN should be used when allowing setting service accounts
for LDAP users instead of just simple user,
Bonus root owner should be allowed full access
to all service account APIs.
Signed-off-by: Harshavardhana <harsha@minio.io>
Part ETags are not available after multipart finalizes, removing this
check as not useful.
Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit reverts a change that added support for
parsing base64-encoded keys set via `MINIO_KMS_MASTER_KEY`.
The env. variable `MINIO_KMS_MASTER_KEY` is deprecated and
should ONLY support parsing existing keys - not the new format.
Any new deployment should use `MINIO_KMS_SECRET_KEY`. The legacy
env. variable `MINIO_KMS_MASTER_KEY` will be removed at some point
in time.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
avoid re-read of xl.meta instead just use
the success criteria from PutObjectPart()
and check the ETag matches per Part, if
they match then the parts have been
successfully restored as is.
Signed-off-by: Harshavardhana <harsha@minio.io>
Bonus fix fallback to decrypt previously
encrypted content as well using older master
key ciphertext format.
Signed-off-by: Harshavardhana <harsha@minio.io>
just like replication workers, allow failed replication
workers to be configurable in situations like DR failures
etc to catch up on replication sooner when DR is back
online.
Signed-off-by: Harshavardhana <harsha@minio.io>
peer nodes would not update if policy is unset on
a user, until policies reload every 5minutes. Make
sure to reload the policies properly, if no policy
is found make sure to delete such users and groups
fixes#12074
Signed-off-by: Harshavardhana <harsha@minio.io>
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.
Some new additions include:
- Admin APIs remote tier configuration management
- Simple journal to track remote objects to be 'collected'
This is used by object API handlers which 'mutate' object versions by
overwriting/replacing content (Put/CopyObject) or removing the version
itself (e.g DeleteObjectVersion).
- Rework of previous ILM transition to fit the new model
In the new model, a storage class (a.k.a remote tier) is defined by the
'remote' object storage type (one of s3, azure, GCS), bucket name and a
prefix.
* Fixed bugs, review comments, and more unit-tests
- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
instead use expect continue timeout, and have
higher response header timeout, the new higher
timeout satisfies worse case scenarios for total
response time on a CreateFile operation.
Also set the "expect" continue header to satisfy
expect continue timeout behavior.
Some clients seem to cause CreateFile body to be
truncated, leading to no errors which instead
fails with ObjectNotFound on a PUT operation,
this change avoids such failures appropriately.
Signed-off-by: Harshavardhana <harsha@minio.io>
allow restrictions on who can access Prometheus
endpoint, additionally add prometheus as part of
diagnostics canned policy.
Signed-off-by: Harshavardhana <harsha@minio.io>
Having the default port in there makes the test using a presigned
request fail.
Co-authored-by: Robert Lützner <robert.luetzner@iternity.com>
Signed-off-by: Harshavardhana <harsha@minio.io>
This commit changes the config/IAM encryption
process. Instead of encrypting config data
(users, policies etc.) with the root credentials
MinIO now encrypts this data with a KMS - if configured.
Therefore, this PR moves the MinIO-KMS configuration (via
env. variables) to a "top-level" configuration.
The KMS configuration cannot be stored in the config file
since it is used to decrypt the config file in the first
place.
As a consequence, this commit also removes support for
Hashicorp Vault - which has been deprecated anyway.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
* fix: pick valid FileInfo additionally based on dataDir
historically we have always relied on modTime
to be consistent and same, we can now add additional
reference to look for the same dataDir value.
A dataDir is the same for an object at a given point in
time for a given version, let's say a `null` version
is overwritten in quorum we do not by mistake pick
up the fileInfo's incorrectly.
* make sure to not preserve fi.Data
Signed-off-by: Harshavardhana <harsha@minio.io>
InfoServiceAccount admin API does not correctly calculate the policy for
a given service account in case if the policy is implied. Fix it.
Signed-off-by: Anis Elleuch <anis@min.io>
avoid potential for duplicates under multi-pool
setup, additionally also make sure CompleteMultipart
is using a more optimal API for uploadID lookup
and never delete the object there is a potential
to create a delete marker during complete multipart.
Signed-off-by: Harshavardhana <harsha@minio.io>
it seems to be legitimate to have `mountinfo` lines
to have keywords with spaces such as
```
rootfs overlay / overlay rw,relatime,lowerdir...
```
This was not expected, but for our requirement
we can just ignore this and move forward.
fixes#12047
Signed-off-by: Harshavardhana <harsha@minio.io>
This is an optimization by reducing one extra system call,
and many network operations. This reduction should increase
the performance for small file workloads.
When an error is reported it is ignored and zipping continues with the next object.
However, if there is an error it will write a response to `writeWebErrorResponse(w, err)`, but responses are still being built.
Fixes#12082
Bonus: Exclude common compressed image types.
Thanks to @Alevsk for noticing this nuanced behavior
change between releases from 03-04 to 03-20, make sure
that we handle the legacy path removal as well.
also make sure to close the channel on the producer
side, not in a separate go-routine, this can lead
to races between a writer and a closer.
fixes#12073
This is a minor change to call out the new documentation and warn
users to change their bookmarks. Once we are ready to set up
some redirects, we can remove this page from Gluegun TOC.
This is an optimization to save IOPS. The replication
failures will be re-queued once more to re-attempt
replication. If it still does not succeed, the replication
status is set as `FAILED` and will be caught up on
scanner cycle.
For InfoServiceAccount API, calculating the policy before showing it to
the user was not correctly done (only UX issue, not a security issue)
This commit fixes it.
policy might have an associated mapping with an expired
user key, do not return an error during DeletePolicy
for such situations - proceed normally as its an
expected situation.
This commit introduces a new package `pkg/kms`.
It contains basic types and functions to interact
with various KMS implementations.
This commit also moves KMS-related code from `cmd/crypto`
to `pkg/kms`. Now, it is possible to implement a KMS-based
config data encryption in the `pkg/config` package.
This commit introduces a new package `pkg/fips`
that bundles functionality to handle and configure
cryptographic protocols in case of FIPS 140.
If it is compiled with `--tags=fips` it assumes
that a FIPS 140-2 cryptographic module is used
to implement all FIPS compliant cryptographic
primitives - like AES, SHA-256, ...
In "FIPS mode" it excludes all non-FIPS compliant
cryptographic primitives from the protocol parameters.
Follow S3 to accept an empty filter tag inside an XML document.
<Filter> needs to be specified but it doesn't have to contain any other
XML tags inside it.
- Add 32-bit checksum (32 LSB part of xxhash64) of the serialized metadata.
This will ensure that we always reject corrupted metadata.
- Add automatic repair of inline data, so the data structure can be used.
If data was corrupted, we remove all unreadable entries to ensure that operations
can succeed on the object. Since higher layers add bitrot checks this is not a big problem.
Cannot downgrade to v1.1 metadata, but since that isn't released, no need for a major bump.
only in case of S3 gateway we have a case where we
need to allow for SSE-S3 headers as passthrough,
If SSE-C headers are passed then they are rejected
if KMS is not configured.
This code is necessary for `mc admin update` command
to work with fips compiled binaries, with fips tags
the releaseInfo will automatically point to fips
specific binaries.
This commit fixes a bug in the put-part
implementation. The SSE headers should be
set as specified by AWS - See:
https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html
Now, the MinIO server should set SSE-C headers,
like `x-amz-server-side-encryption-customer-algorithm`.
Fixes#11991
locks can get relinquished when Read() sees io.EOF
leading to prematurely closing of the readers
concurrent writes on the same object can have
undesired consequences here when these locks
are relinquished.
EOF may be sent along with data so queue it up and
return it when the buffer is empty.
Also, when reading data without direct io don't add a buffer
that only results in extra memcopy.
mc admin trace does not work with older MinIO versions because if an
incompability with older trace admin API. This commit changes madmin for
better backward compatibility with server admin API.
Multiple disks from the same set would be writing concurrently.
```
WARNING: DATA RACE
Write at 0x00c002100ce0 by goroutine 166:
github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
d:/minio/minio/cmd/erasure-sets.go:254 +0x82f
Previous write at 0x00c002100ce0 by goroutine 129:
github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
d:/minio/minio/cmd/erasure-sets.go:254 +0x82f
Goroutine 166 (running) created at:
github.com/minio/minio/cmd.(*erasureSets).connectDisks()
d:/minio/minio/cmd/erasure-sets.go:210 +0x324
github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
d:/minio/minio/cmd/erasure-sets.go:288 +0x244
Goroutine 129 (finished) created at:
github.com/minio/minio/cmd.(*erasureSets).connectDisks()
d:/minio/minio/cmd/erasure-sets.go:210 +0x324
github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
d:/minio/minio/cmd/erasure-sets.go:288 +0x244
```
This change fixes handling of these types of queries:
- Double quoted column names with special characters:
SELECT "column.name" FROM s3object
- Double quoted column names with reserved keywords:
SELECT "CAST" FROM s3object
- Table name as prefix for column names:
SELECT S3Object."CAST" FROM s3object
This commit adds a self-test for all bitrot algorithms:
- SHA-256
- BLAKE2b
- HighwayHash
The self-test computes an incremental checksum of pseudo-random
messages. If a bitrot algorithm implementation stops working on
some CPU architecture or with a certain Go version this self-test
will prevent the server from starting and silently corrupting data.
For additional context see: minio/highwayhash#19
Metrics calculation was accumulating inital usage across all nodes
rather than using initial usage only once.
Also fixing:
- bug where all peer traffic was going to the same node.
- reset counters when replication status changes from
PENDING -> FAILED
This PR fixes
- close leaking bandwidth report channel leakage
- remove the closer requirement for bandwidth monitor
instead if Read() fails remember the error and return
error for all subsequent reads.
- use locking for usage-cache.bin updates, with inline
data we cannot afford to have concurrent writes to
usage-cache.bin corrupting xl.meta
implementation in #11949 only catered from single
node, but we need cluster metrics by capturing
from all peers. introduce bucket stats API that
will be used for capturing in-line bucket usage
as well eventually
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.
This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.
This helps reduce the chattiness for following calls..
- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.
- add API to get replication metrics
- add MRF worker to handle spill-over replication operations
- multiple issues found with replication
- fixes an issue when client sends a bucket
name with `/` at the end from SetRemoteTarget
API call make sure to trim the bucket name to
avoid any extra `/`.
- hold write locks in GetObjectNInfo during replication
to ensure that object version stack is not overwritten
while reading the content.
- add additional protection during WriteMetadata() to
ensure that we always write a valid FileInfo{} and avoid
ever writing empty FileInfo{} to the lowest layers.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
current master breaks this important requirement
we need to preserve legacyXLv1 format, this is simply
ignored and overwritten causing a myriad of issues
by leaving stale files on the namespace etc.
for now lets still use the two-phase approach of
writing to `tmp` and then renaming the content to
the actual namespace.
versionID is the one that needs to be preserved and as
well as overwritten in case of replication, transition
etc - dataDir is an ephemeral entity that changes
during overwrites - make sure that versionID is used
to save the object content.
this would break things if you are already running
the latest master, please wipe your current content
and re-do your setup after this change.
upgrading from 2yr old releases is expected to work,
the issue was we were missing checksum info to be
passed down to newBitrotReader() for whole bitrot
calculation
Ensure that we don't use potentially broken algorithms for critical functions, whether it be a runtime problem or implementation problem for a specific platform.
It is inefficient to decide to heal an object before checking its
lifecycle for expiration or transition. This commit will just reverse
the order of action: evaluate lifecycle and heal only if asked and
lifecycle resulted a NoneAction.
replication didn't work as expected when deletion of
delete markers was requested in DeleteMultipleObjects
API, this is due to incorrect lookup elements being
used to look for delete markers.
This allows us to speed up or slow down sleeps
between multiple scanner cycles, helps in testing
as well as some deployments might want to run
scanner more frequently.
This change is also dynamic can be applied on
a running cluster, subsequent cycles pickup
the newly set value.
using Lstat() is causing tiny memory allocations,
that are usually wasted and never used, instead
we can simply uses Access() call that does 0
memory allocations.
This feature brings in support for auto extraction
of objects onto MinIO's namespace from an incoming
tar gzipped stream, the only expected metadata sent
by the client is to set `snowball-auto-extract`.
All the contents from the tar stream are saved as
folders and objects on the namespace.
fixes#8715
service accounts were not inheriting parent policies
anymore due to refactors in the PolicyDBGet() from
the latest release, fix this behavior properly.
The local node name is heavily used in tracing, create a new global
variable to store it. Multiple goroutines can access it since it won't be
changed later.
In #11888 we observe a lot of running, WalkDir calls.
There doesn't appear to be any listerners for these calls, so they should be aborted.
Ensure that WalkDir aborts when upstream cancels the request.
Fixes#11888
The background healing can return NoSuchUpload error, the reason is that
healing code can return errFileNotFound with three parameters. Simplify
the code by returning exact errUploadNotFound error in multipart code.
Also ensure that a typed error is always returned whatever the number of
parameters because it is better than showing internal error.
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
baseDirFromPrefix(prefix) for object names without
parent directory incorrectly uses empty path, leading
to long listing at various paths that are not useful
for healing - avoid this listing completely if "baseDir"
returns empty simple use the "prefix" as is.
this improves startup performance significantly
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
some SDKs might incorrectly send duplicate
entries for keys such as "conditions", Go
stdlib unmarshal for JSON does not support
duplicate keys - instead skips the first
duplicate and only preserves the last entry.
This can lead to issues where a policy JSON
while being valid might not properly apply
the required conditions, allowing situations
where POST policy JSON would end up allowing
uploads to unauthorized buckets and paths.
This PR fixes this properly.
This commit adds a `MarshalText` implementation
to the `crypto.Context` type.
The `MarshalText` implementation replaces the
`WriteTo` and `AppendTo` implementation.
It is slightly slower than the `AppendTo` implementation
```
goos: darwin
goarch: arm64
pkg: github.com/minio/minio/cmd/crypto
BenchmarkContext_AppendTo/0-elems-8 381475698 2.892 ns/op 0 B/op 0 allocs/op
BenchmarkContext_AppendTo/1-elems-8 17945088 67.54 ns/op 0 B/op 0 allocs/op
BenchmarkContext_AppendTo/3-elems-8 5431770 221.2 ns/op 72 B/op 2 allocs/op
BenchmarkContext_AppendTo/4-elems-8 3430684 346.7 ns/op 88 B/op 2 allocs/op
```
vs.
```
BenchmarkContext/0-elems-8 135819834 8.658 ns/op 2 B/op 1 allocs/op
BenchmarkContext/1-elems-8 13326243 89.20 ns/op 128 B/op 1 allocs/op
BenchmarkContext/3-elems-8 4935301 243.1 ns/op 200 B/op 3 allocs/op
BenchmarkContext/4-elems-8 2792142 428.2 ns/op 504 B/op 4 allocs/op
goos: darwin
```
However, the `AppendTo` benchmark used a pre-allocated buffer. While
this improves its performance it does not match the actual usage of
`crypto.Context` which is passed to a `KMS` and always encoded into
a newly allocated buffer.
Therefore, this change seems acceptable since it should not impact the
actual performance but reduces the overall code for Context marshaling.
When an object is removed, its parent directory is inspected to check if
it is empty to remove if that is the case.
However, we can use os.Remove() directly since it is only able to remove
a file or an empty directory.
RenameData renames xl.meta and data dir and removes the parent directory
if empty, however, there is a duplicate check for empty dir, since the
parent dir of xl.meta is always the same as the data-dir.
Some old AWS SDKs send BucketLifecycleConfiguration as the root tag in
the bucket lifecycle document. This PR will support both
LifecycleConfiguration and BucketLifecycleConfiguration.
on freshReads if drive returns errInvalidArgument, we
should simply turn-off DirectIO and read normally, there
are situations in k8s like environments where the drives
behave sporadically in a single deployment and may not
have been implemented properly to handle O_DIRECT for
reads.
This PR adds deadlines per Write() calls, such
that slow drives are timed-out appropriately and
the overall responsiveness for Writes() is always
up to a predefined threshold providing applications
sustained latency even if one of the drives is slow
to respond.
- Build and RUN test executable:
```
$ go test -tags testrunmain -covermode count
-coverpkg="./..." -c -tags testrunmain
$ APP_ARGS="server /tmp/test" ./minio.test
-test.run "^TestRunMain$"
-test.coverprofile coverage.cov
```
- Or run the system under test just by calling go test
```
$ APP_ARGS="server /tmp/test" go test
-cover
-tags testrunmain
-coverpkg="./..."
-covermode count
-coverprofile=coverage.cov
```
- Run System-Tests (when using GitBash prefix this
line with MSYS_NO_PATHCONV=1) Note the
SERVER_ENDPOINT must be reachable from
inside the docker container (so don't use localhost!)
```
$ docker run
-e MINT_MODE=full
-e SERVER_ENDPOINT=192.168.47.11:9000
-e ACCESS_KEY=minioadmin
-e SECRET_KEY=minioadmin
-v /tmp/mint/log:/mint/log
minio/mint
``
- Stop system under test by sending SIGTERM
```
$ ctrl+c
```
- Transform coverage file to HTML
```
$ go tool cover -html=./coverage.cov -o coverage.html
```
MRF was starting to heal when it receives a disk connection event, which
is not good when a node having multiple disks reconnects to the cluster.
Besides, MRF needs Remove healing option to remove stale files.
- write in o_dsync instead of o_direct for smaller
objects to avoid unaligned double Write() situations
that may arise for smaller objects < 128KiB
- avoid fallocate() as its not useful since we do not
use Append() semantics anymore, fallocate is not useful
for streaming I/O we can save on a syscall
- createFile() doesn't need to validate `bucket` name
with a Lstat() call since createFile() is only used
to write at `minioTmpBucket`
- use io.Copy() when writing unAligned writes to allow
usage of ReadFrom() from *os.File providing zero
buffer writes().
```
mc admin info --json
```
provides these details, for now, we shall eventually
expose this at Prometheus level eventually.
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit fixes a security issue in the signature v4 chunked
reader. Before, the reader returned unverified data to the caller
and would only verify the chunk signature once it has encountered
the end of the chunk payload.
Now, the chunk reader reads the entire chunk into an in-memory buffer,
verifies the signature and then returns data to the caller.
In general, this is a common security problem. We verifying data
streams, the verifier MUST NOT return data to the upper layers / its
callers as long as it has not verified the current data chunk / data
segment:
```
func (r *Reader) Read(buffer []byte) {
if err := r.readNext(r.internalBuffer); err != nil {
return err
}
if err := r.verify(r.internalBuffer); err != nil {
return err
}
copy(buffer, r.internalBuffer)
}
```
For operations that require the object to exist make it possible to
detect if the file isn't found in *any* pool.
This will allow these to return the error early without having to re-check.
Cases where we have applications making request
for `//` in object names make sure that all
are normalized to `/` and all such requests that
are prefixed '/' are removed. To ensure a
consistent view from all operations.
Some deployments have low parity (EC:2), but we really do not need to
save our config data with the same parity configuration.
N/2 would be better to keep MinIO configurations intact when unexpected
a number of drives fail.
This commit disables the Hashicorp Vault
support but provides a way to temp. enable
it via the `MINIO_KMS_VAULT_DEPRECATION=off`
Vault support has been deprecated long ago
and this commit just requires users to take
action if they maintain a Vault integration.
This commit adds FIPS-specifc build tags to the madmin
package. When madmin is compiled with `--tags "fips"`
it will always use AES-GCM for encryption - not just
when an optimized AES implementation is available.
major performance improvements in range GETs to avoid large
read amplification when ranges are tiny and random
```
-------------------
Operation: GET
Operations: 142014 -> 339421
Duration: 4m50s -> 4m56s
* Average: +139.41% (+1177.3 MiB/s) throughput, +139.11% (+658.4) obj/s
* Fastest: +125.24% (+1207.4 MiB/s) throughput, +132.32% (+612.9) obj/s
* 50% Median: +139.06% (+1175.7 MiB/s) throughput, +133.46% (+660.9) obj/s
* Slowest: +203.40% (+1267.9 MiB/s) throughput, +198.59% (+753.5) obj/s
```
TTFB from 10MiB BlockSize
```
* First Access TTFB: Avg: 81ms, Median: 61ms, Best: 20ms, Worst: 2.056s
```
TTFB from 1MiB BlockSize
```
* First Access TTFB: Avg: 22ms, Median: 21ms, Best: 8ms, Worst: 91ms
```
Full object reads however do see a slight change which won't be
noticeable in real world, so not doing any comparisons
TTFB still had improvements with full object reads with 1MiB
```
* First Access TTFB: Avg: 68ms, Median: 35ms, Best: 11ms, Worst: 1.16s
```
v/s
TTFB with 10MiB
```
* First Access TTFB: Avg: 388ms, Median: 98ms, Best: 20ms, Worst: 4.156s
```
This change should affect all new uploads, previous uploads should
continue to work with business as usual. But dramatic improvements can
be seen with these changes.
A group can have multiple policies, a user subscribed to readwrite &
diagnostics can perform S3 operations & admin operations as well.
However, the current code only returns one policy for one group.
This commit disables SHA-3 for OpenID when building a
FIPS-140 2 compatible binary. While SHA-3 is a
crypto. hash function accepted by NIST there is no
FIPS-140 2 compliant implementation available when
using the boringcrypto Go branch.
Therefore, SHA-3 must not be used when building
a FIPS-140 2 binary.
* Provide information on *actively* healing, buckets healed/queued, objects healed/failed.
* Add concurrent healing of multiple sets (typically on startup).
* Add bucket level resume, so restarts will only heal non-healed buckets.
* Print summary after healing a disk is done.
currently when one of the peer is down, the
drives from that peer are reported as '0/0'
offline instead we should capture/filter the
drives from the peer and populate it appropriately
such that `mc admin info` displays correct info.
This commit adds the `FromContentMD5` function to
parse a client-provided content-md5 as ETag.
Further, it also adds multipart ETag computation
for future needs.
prometheus metrics was using total disks instead
of online disk count, when disks were down, this
PR fixes this and also adds a new metric for
total_disk_count
Creating notification events for replica creation
is not particularly useful to send as the notification
event generated at source already includes replication
completion events.
For applications using replica cluster as failover, avoiding
duplicate notifications for replica event will allow seamless
failover.
also re-use storage disks for all `mc admin server info`
calls as well, implement a new LocalStorageInfo() API
call at ObjectLayer to lookup local disks storageInfo
also fixes bugs where there were double calls to StorageInfo()
While starting up a request that needs all IAM data will start another load operation if the first on startup hasn't finished. This slows down both operations.
Block these requests until initial load has completed.
Blocking calls will be ListPolicies, ListUsers, ListServiceAccounts, ListGroups - and the calls that eventually trigger these. These will wait for the initial load to complete.
Fixes issue seen in #11305
Implicit permissions for any user is to be allowed to
change their own password, we need to restrict this
further even if there is an implicit allow for this
scenario - we have to honor Deny statements if they
are specified.
ListObjectVersions would skip past the object in the marker when version id is specified.
Make `listPath` return the object with the marker and truncate it if not needed.
Avoid having to parse unintended objects to find a version marker.
The previous code was iterating over replies from peers and assigning
pool numbers to them, thus missing to add it for the local server.
Fixed by iterating over the server properties of all the servers
including the local one.
There was an io.LimitReader was missing for the 'length'
parameter for ranged requests, that would cause client to
get truncated responses and errors.
fixes#11651
The base profiles contains no valuable data, don't record them.
Reduce block rate by 2 orders of magnitude, should still capture just as valuable data with less CPU strain.
most of the delete calls today spend time in
a blocking operation where multiple calls need
to be recursively sent to delete the objects,
instead we can use rename operation to atomically
move the objects from the namespace to `tmp/.trash`
we can schedule deletion of objects at this
location once in 15, 30mins and we can also add
wait times between each delete operation.
this allows us to make delete's faster as well
less chattier on the drives, each server runs locally
a groutine which would clean this up regularly.
This commit removes the `GetObject` method
from the `ObjectLayer` interface.
The `GetObject` method is not longer used by
the HTTP handlers implementing the high-level
S3 semantics. Instead, they use the `GetObjectNInfo`
method which returns both, an object handle as well
as the object metadata.
Therefore, it is no longer necessary that a concrete
`ObjectLayer` implements `GetObject`.
store the cache in-memory instead of disks to avoid large
write amplifications for list heavy workloads, store in
memory instead and let it auto expire.
This commit replaces the usage of
github.com/minio/sha256-simd with crypto/sha256
of the standard library in all non-performance
critical paths.
This is necessary for FIPS 140-2 compliance which
requires that all crypto. primitives are implemented
by a FIPS-validated module.
Go can use the Google FIPS module. The boringcrypto
branch of the Go standard library uses the BoringSSL
FIPS module to implement crypto. primitives like AES
or SHA256.
We only keep github.com/minio/sha256-simd when computing
the content-SHA256 of an object. Therefore, this commit
relies on a build tag `fips`.
When MinIO is compiled without the `fips` flag it will
use github.com/minio/sha256-simd. When MinIO is compiled
with the fips flag (go build --tags "fips") then MinIO
uses crypto/sha256 to compute the content-SHA256.
Instead of using O_SYNC, we are better off using O_DSYNC
instead since we are only ever interested in data to be
persisted to disk not the associated filesystem metadata.
For reads we ask customers to turn off noatime, but instead
we can proactively use O_NOATIME flag to avoid atime updates
upon reads.
This removes the Content-MD5 response header on Range requests in Azure
Gateway mode. The partial content MD5 doesn't match the full object MD5
in metadata.
This commit adds a new package `etag` for dealing
with S3 ETags.
Even though ETag is often viewed as MD5 checksum of
an object, handling S3 ETags correctly is a surprisingly
complex task. While it is true that the ETag corresponds
to the MD5 for the most basic S3 API operations, there are
many exceptions in case of multipart uploads or encryption.
In worse, some S3 clients expect very specific behavior when
it comes to ETags. For example, some clients expect that the
ETag is a double-quoted string and fail otherwise.
Non-AWS compliant ETag handling has been a source of many bugs
in the past.
Therefore, this commit adds a dedicated `etag` package that provides
functionality for parsing, generating and converting S3 ETags.
Further, this commit removes the ETag computation from the `hash`
package. Instead, the `hash` package (i.e. `hash.Reader`) should
focus only on computing and verifying the content-sha256.
One core feature of this commit is to provide a mechanism to
communicate a computed ETag from a low-level `io.Reader` to
a high-level `io.Reader`.
This problem occurs when an S3 server receives a request and
has to compute the ETag of the content. However, the server
may also wrap the initial body with several other `io.Reader`,
e.g. when encrypting or compressing the content:
```
reader := Encrypt(Compress(ETag(content)))
```
In such a case, the ETag should be accessible by the high-level
`io.Reader`.
The `etag` provides a mechanism to wrap `io.Reader` implementations
such that the `ETag` can be accessed by a type-check.
This technique is applied to the PUT, COPY and Upload handlers.
server startup code expects the object layer to properly
convert error into a proper type, so that in situations when
servers are coming up and quorum is not available servers
wait on each other.
using isServerResolvable for expiration can lead to chicken
and egg problems, a lock might expire knowingly when server
is booting up causing perpetual locks getting expired.
- In username search filter and username format variables we support %s for
replacing with the username.
- In group search filter we support %s for username and %d for the full DN of
the username.
root-disk implemented currently had issues where root
disk partitions getting modified might race and provide
incorrect results, to avoid this lets rely again back on
DeviceID and match it instead.
In-case of containers `/data` is one such extra entity that
needs to be verified for root disk, due to how 'overlay'
filesystem works and the 'overlay' presents a completely
different 'device' id - using `/data` as another entity
for fallback helps because our containers describe 'VOLUME'
parameter that allows containers to automatically have a
virtual `/data` that points to the container root path this
can either be at `/` or `/var/lib/` (on different partition)
also additionally make sure errors during deserializer closes
the reader with right error type such that Write() end
actually see the final error, this avoids a waitGroup usage
and waiting.
reduce the page-cache pressure completely by moving
the entire read-phase of our operations to O_DIRECT,
primarily this is going to be very useful for chatty
metadata operations such as listing, scanner, ilm, healing
like operations to avoid filling up the page-cache upon
repeated runs.
since we have changed our default envs to MINIO_ROOT_USER,
MINIO_ROOT_PASSWORD this was not supported by minio-go
credentials package, update minio-go to v7.0.10 for this
support. This also addresses few bugs related to users
had to specify AWS_ACCESS_KEY_ID as well to authenticate
with their S3 backend if they only used MINIO_ROOT_USER.
We can use this metric to check if there are too many S3 clients in the
queue and could explain why some of those S3 clients are timing out.
```
minio_s3_requests_waiting_total{server="127.0.0.1:9000"} 9981
```
If max_requests is 10000 then there is a strong possibility that clients
are timing out because of the queue deadline.
If the periodic `case <-t.C:` save gets held up for a long time it will end up
synchronize all disk writes for saving the caches.
We add jitter to per set writes so they don't sync up and don't hold a
lock for the write, since it isn't needed anyway.
If an outage prevents writes for a long while we also add individual
waits for each disk in case there was a queue.
Furthermore limit the number of buffers kept to 2GiB, since this could get
huge in large clusters. This will not act as a hard limit but should be enough
for normal operation.
in the case of active-active replication.
This PR also has the following changes:
- add docs on replication design
- fix corner case of completing versioned delete on a delete marker
when the target is down and `mc rm --vid` is performed repeatedly. Instead
the version should still be retained in the `PENDING|FAILED` state until
replication sync completes.
- remove `s3:Replication:OperationCompletedReplication` and
`s3:Replication:OperationFailedReplication` from ObjectCreated
events type
currently crawler waits for an entire readdir call to
return until it processes usage, lifecycle, replication
and healing - instead we should pass the applicator all
the way down to avoid building any special stack for all
the contents in a single directory.
This allows for
- no need to remember the entire list of entries per directory
before applying the required functions
- no need to wait for entire readdir() call to finish before
applying the required functions
This PR fixes
- allow 's3:versionid` as a valid conditional for
Get,Put,Tags,Object locking APIs
- allow additional headers missing for object APIs
- allow wildcard based action matching
additionally simply timedValue to have RWMutex
to avoid concurrent calls to DiskInfo() getting
serialized, this has an effect on all calls that
use GetDiskInfo() on the same disks.
Such as getOnlineDisks, getOnlineDisksWithoutHealing
The function used for getting host information
(host.SensorsTemperaturesWithContext) returns warnings in some cases.
Returning with error in such cases means we miss out on the other useful
information already fetched (os info).
If the OS info has been succesfully fetched, it should always be
included in the output irrespective of whether the other data (CPU
sensors, users) could be fetched or not.
To avoid large delays in metacache cleanup, use rename
instead of recursive delete calls, renames are cheaper
move the content to minioMetaTmpBucket and then cleanup
this folder once in 24hrs instead.
If the new cache can replace an existing one, we should
let it replace since that is currently being saved anyways,
this avoids pile up of 1000's of metacache entires for
same listing calls that are not necessary to be stored
on disk.
Skip notifications on objects that might have had
an error during deletion, this also avoids unnecessary
replication attempt on such objects.
Refactor some places to make sure that we have notified
the client before we
- notify
- schedule for replication
- lifecycle etc.
continuation of PR#11491 for multiple server pools and
bi-directional replication.
Moving proxying for GET/HEAD to handler level rather than
server pool layer as this was also causing incorrect proxying
of HEAD.
Also fixing metadata update on CopyObject - minio-go was not passing
source version ID in X-Amz-Copy-Source header
filter out relevant objects for each pool to
avoid calling, further delete operations on
subsequent pools where some of these objects
might not exist.
This is mainly useful to avoid situations
during bi-directional bucket replication.
top-level options shouldn't be passed down for
GetObjectInfo() while verifying the objects in
different pools, this is to make sure that
we always get the value from the pool where
the object exists.
This change moves away from a unified constructor for plaintext and encrypted
usage. NewPutObjReader is simplified for the plain-text reader use. For
encrypted reader use, WithEncryption should be called on an initialized PutObjReader.
Plaintext:
func NewPutObjReader(rawReader *hash.Reader) *PutObjReader
The hash.Reader is used to provide payload size and md5sum to the downstream
consumers. This is different from the previous version in that there is no need
to pass nil values for unused parameters.
Encrypted:
func WithEncryption(encReader *hash.Reader,
key *crypto.ObjectKey) (*PutObjReader, error)
This method sets up encrypted reader along with the key to seal the md5sum
produced by the plain-text reader (already setup when NewPutObjReader was
called).
Usage:
```
pReader := NewPutObjReader(rawReader)
// ... other object handler code goes here
// Prepare the encrypted hashed reader
pReader, err = pReader.WithEncryption(encReader, objEncKey)
```
NoncurrentVersionExpiry can remove single delete markers according to
S3 spec:
```
The NoncurrentVersionExpiration action in the same Lifecycle
configuration removes noncurrent objects 30 days after they become
noncurrent. Thus, in this example, all object versions are permanently
removed 90 days after object creation. You will have expired object
delete markers, but Amazon S3 detects and removes the expired object
delete markers for you.
```
Collection of SMART information doesn't work in certain scenarios e.g.
in a container based setup. In such cases, instead of returning an error
(without any data), we should only set the error on the smartinfo
struct, so that other important drive hw info like device, mountpoint,
etc is retained in the output.
The code inside the `hotfix` taget is overriding the values set at the
beginning of the Makefile affecting other make targets as well.
For example, running `TAG=mytag make docker` also ends up tagging
the docker image as a hotfix instead of `mytag`.
Using the `eval` function inside the `hotfix` target fixes this.
during rolling upgrade, provide a more descriptive error
message and discourage rolling upgrade in such situations,
allowing users to take action.
additionally also rename `slashpath -> pathutil` to avoid
a slighly mis-pronounced usage of `path` package.
When lifecycle decides to Delete an object and not a version in a
versioned bucket, the code should create a delete marker and not
removing the scanned version.
This commit fixes the issue.
The connections info of the processes takes up a huge amount of space,
and is not important for adding any useful health checks. Removing it
will significantly reduce the size of the subnet health report.
- lock maintenance loop was incorrectly sleeping
as well as using ticker badly, leading to
extra expiration routines getting triggered
that could flood the network.
- multipart upload cleanup should be based on
timer instead of ticker, to ensure that long
running jobs don't get triggered twice.
- make sure to get right lockers for object name
Replaces #11449
Does concurrent healing but limits concurrency to 50 buckets.
Aborts on first error.
`errgroup.Group` is extended to facilitate this in a generic way.
We use multiple libraries in health info, but the returned error does
not indicate exactly what library call is failing, hence adding named
tags to returned errors whenever applicable.
After recent refactor where lifecycle started to rely on ObjectInfo to
make decisions, it turned out there are some issues calculating
Successor Modtime and NumVersions, hence the lifecycle is not working as
expected in a versioning bucket in some cases.
This commit fixes the behavior.
When a directory object is presented as a `prefix`
param our implementation tend to only list objects
present common to the `prefix` than the `prefix` itself,
to mimic AWS S3 like flat key behavior this PR ensures
that if `prefix` is directory object, it should be
automatically considered to be part of the eventual
listing result.
fixes#11370
few places were still using legacy call GetObject()
which was mainly designed for client response writer,
use GetObjectNInfo() for internal calls instead.
for some flaky networks this may be too fast of a value
choose a defensive value, and let this be addressed
properly in a new refactor of dsync with renewal logic.
Also enable faster fallback delay to cater for misconfigured
IPv6 servers
refer
- https://golang.org/pkg/net/#Dialer
- https://tools.ietf.org/html/rfc6555
- using miniogo.ObjectInfo.UserMetadata is not correct
- using UserTags from Map->String() can change order
- ContentType comparison needs to be removed.
- Compare both lowercase and uppercase key names.
- do not silently error out constructing PutObjectOptions
if tag parsing fails
- avoid notification for empty object info, failed operations
should rely on valid objInfo for notification in all
situations
- optimize copyObject implementation, also introduce a new
replication event
- clone ObjectInfo() before scheduling for replication
- add additional headers for comparison
- remove strings.EqualFold comparison avoid unexpected bugs
- fix pool based proxying with multiple pools
- compare only specific metadata
Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com>
This commit refactors the SSE implementation and add
S3-compatible SSE-KMS context handling.
SSE-KMS differs from SSE-S3 in two main aspects:
1. The client can request a particular key and
specify a KMS context as part of the request.
2. The ETag of an SSE-KMS encrypted object is not
the MD5 sum of the object content.
This commit only focuses on the 1st aspect.
A client can send an optional SSE context when using
SSE-KMS. This context is remembered by the S3 server
such that the client does not have to specify the
context again (during multipart PUT / GET / HEAD ...).
The crypto. context also includes the bucket/object
name to prevent renaming objects at the backend.
Now, AWS S3 behaves as following:
- If the user does not provide a SSE-KMS context
it does not store one - resp. does not include
the SSE-KMS context header in the response (e.g. HEAD).
- If the user specifies a SSE-KMS context without
the bucket/object name then AWS stores the exact
context the client provided but adds the bucket/object
name internally. The response contains the KMS context
without the bucket/object name.
- If the user specifies a SSE-KMS context with
the bucket/object name then AWS again stores the exact
context provided by the client. The response contains
the KMS context with the bucket/object name.
This commit implements this behavior w.r.t. SSE-KMS.
However, as of now, no such object can be created since
the server rejects SSE-KMS encryption requests.
This commit is one stepping stone for SSE-KMS support.
Co-authored-by: Harshavardhana <harsha@minio.io>
```
minio server /tmp/disk{1...4}
mc mb myminio/testbucket/
mkdir -p /tmp/disk{1..4}/testbucket/test-prefix/
```
This would end up being listed in the current
master, this PR fixes this situation.
If a directory is a leaf dir we should it
being listed, since it cannot be deleted anymore
with DeleteObject, DeleteObjects() API calls
because we natively support directories now.
Avoid listing it and let healing purge this folder
eventually in the background.
MINIO_API_REPLICATION_WORKERS env.var and
`mc admin config set api` allow number of replication
workers to be configurable. Defaults to half the number
of cpus available.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
This commit fixes a bug in the S3 gateway that causes
GET requests to fail when the object is encrypted by the
gateway itself.
The gateway was not able to GET the object since it always
specified a `If-Match` pre-condition checking that the object
ETag matches an expected ETag - even for encrypted ETags.
The problem is that an encrypted ETag will never match the ETag
computed by the backend causing the `If-Match` pre-condition
to fail.
This commit fixes this by not sending an `If-Match` header when
the ETag is encrypted. This is acceptable because:
1. A gateway-encrypted object consists of two objects at the backend
and there is no way to provide a concurrency-safe implementation
of two consecutive S3 GETs in the deployment model of the S3
gateway.
Ref: S3 gateways are self-contained and isolated - and there may
be multiple instances at the same time (no lock across
instances).
2. Even if the data object changes (concurrent PUT) while gateway
A has download the metadata object (but not issued the GET to
the data object => data race) then we don't return invalid data
to the client since the decryption (of the currently uploaded data)
will fail - given the metadata of the previous object.
Currently, it is not possible to create a delete-marker when xl.meta
does not exist (no version is created for that object yet). This makes a
problem for replication and mc mirroring with versioning enabled.
This also follows S3 specification.
github.com/ncw/directio doesn't support OpenBSD, but OpenBSD has
syscall.Fsync. (It also has fdatasync:
https://man.openbsd.org/fdatasync
but apparently Golang can't call it).
This commit deprecates the native Hashicorp Vault
support and removes the legacy Vault documentation.
The native Hashicorp Vault documentation is marked as
outdated and deprecated for over a year now. We give
another 6 months before we start removing Hashicorp Vault
support and show a deprecation warning when a MinIO server
starts with a native Vault configuration.
currently we had a restriction where older setups would
need to follow previous style of "stripe" count being same
expansion, we can relax that instead newer pools can be
expanded for older setups with newer constraints of
common parity ratio.
In PR #11165 due to incorrect proxying for 2
way replication even when the object was not
yet replicated
Additionally, fix metadata comparisons when
deciding to do full replication vs metadata copy.
fixes#11340
Previously we added heal trigger when bit-rot checks
failed, now extend that to support heal when parts
are not found either. This healing gets only triggered
if we can successfully decode the object i.e read
quorum is still satisfied for the object.
under large deployments loading credentials might be
time consuming, while this is okay and we will not
respond quickly for `mc admin user list` like queries
but it is possible to support `mc admin user info`
just like how we handle authentication by fetching
the user directly from persistent store.
additionally support service accounts properly,
reloaded from etcd during watch() - this was missing
This PR is also half way remedy for #11305
This change allows the MinIO server to be configured with a special (read-only)
LDAP account to perform user DN lookups.
The following configuration parameters are added (along with corresponding
environment variables) to LDAP identity configuration (under `identity_ldap`):
- lookup_bind_dn / MINIO_IDENTITY_LDAP_LOOKUP_BIND_DN
- lookup_bind_password / MINIO_IDENTITY_LDAP_LOOKUP_BIND_PASSWORD
- user_dn_search_base_dn / MINIO_IDENTITY_LDAP_USER_DN_SEARCH_BASE_DN
- user_dn_search_filter / MINIO_IDENTITY_LDAP_USER_DN_SEARCH_FILTER
This lookup-bind account is a service account that is used to lookup the user's
DN from their username provided in the STS API. When configured, searching for
the user DN is enabled and configuration of the base DN and filter for search is
required. In this "lookup-bind" mode, the username format is not checked and must
not be specified. This feature is to support Active Directory setups where the
DN cannot be simply derived from the username.
When the lookup-bind is not configured, the old behavior is enabled: the minio
server performs LDAP lookups as the LDAP user making the STS API request and the
username format is checked and configuring it is required.
STS tokens can be obtained by using local APIs
once the remote JWT token is presented, current
code was not validating the incoming token in the
first place and was incorrectly making a network
operation using that token.
For the most part this always works without issues,
but under adversarial scenarios it exposes client
to hand-craft a request that can reach internal
services without authentication.
This kind of proxying should be avoided before
validating the incoming token.
The user can see __XLDIR__ prefix in mc admin heal when the command
heals an empty object with a trailing slash. This commit decodes the
name of the object before sending it back to the upper level.
In-case user enables O_DIRECT for reads and backend does
not support it we shall proceed to turn it off instead
and print a warning. This validation avoids any unexpected
downtimes that users may incur.
DNSCache dialer is a global value initialized in
init(), whereas `go` keeps `var =` before `init()`
, also we don't need to keep proxy routers as
global entities - register the forwarder as
necessary to avoid crashes.
```
mc admin config set alias/ storage_class standard=EC:3
```
should only succeed if parity ratio is valid for all
server pools, if not we should fail proactively.
This PR also needs to bring other changes now that
we need to cater for variadic drive counts per pool.
Bonus fixes also various bugs reproduced with
- GetObjectWithPartNumber()
- CopyObjectPartWithOffsets()
- CopyObjectWithMetadata()
- PutObjectPart,PutObject with truncated streams
Trace data can be rather large and compresses fine.
Compress profile data in zip files:
```
277.895.314 before.profiles.zip
152.800.318 after.profiles.zip
```
Delete marker can have `metaSys` set to nil, that
can lead to crashes after the delete marker has
been healed.
Additionally also fix isObjectDangling check
for transitioned objects, that do not have parts
should be treated similar to Delete marker.
During expansion we need to validate if
- new deployment is expanded with newer constraints
- existing deployment is expanded with older constraints
- multiple server pools rejected if they have different
deploymentID and distribution algo
to verify moving content and preserving legacy content,
we have way to detect the objects through readdir()
this path is not necessary for most common cases on
newer setups, avoid readdir() to save multiple system
calls.
also fix the CheckFile behavior for most common
use case i.e without legacy format.
If an erasure set had a drive replacement recently, we don't
need to attempt healing on another drive with in the same erasure
set - this would ensure we do not double heal the same content
and also prioritizes usage for such an erasure set to be calculated
sooner.
For objects with `N` prefix depth, this PR reduces `N` such network
operations by converting `CheckFile` into a single bulk operation.
Reduction in chattiness here would allow disks to be utilized more
cleanly, while maintaining the same functionality along with one
extra volume check stat() call is removed.
Update tests to test multiple sets scenario
Fixes support for using multiple base DNs for user search in the LDAP directory
allowing users from different subtrees in the LDAP hierarchy to request
credentials.
- The username in the produced credentials is now the full DN of the LDAP user
to disambiguate users in different base DNs.
parentDirIsObject is not using set level understanding
to check for parent objects, without this it can lead to
objects that can actually reside on a separate set as
objects and would conflict.
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.
This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.
If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
This ensures that all the prometheus monitoring and usage
trackers to avoid alerts configured, although we cannot
support v1 to v2 here - we can v2 to v3.
When the file name or the prefix contains `#`,
whatever comes after is ingored.
This is fixed by encoding the object path using
encodeURIComponent.
Fixes#11112
A lot of memory is consumed when uploading small files in parallel, use
the default upload parameters and add MINIO_AZURE_UPLOAD_CONCURRENCY for
users to tweak.
Synchronous replication can be enabled by setting the --sync
flag while adding a remote replication target.
This PR also adds proxying on GET/HEAD to another node in a
active-active replication setup in the event of a 404 on the current node.
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.
After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.
Benefits are seen from almost all objects from
1KiB - 32KiB
Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.
All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
30 seconds white spaces is long for some setups which time out when no
read activity in short time, reduce the subnet health white space ticker
to 5 seconds, since it has no cost at all.
Use separate sync.Pool for writes/reads
Avoid passing buffers for io.CopyBuffer()
if the writer or reader implement io.WriteTo or io.ReadFrom
respectively then its useless for sync.Pool to allocate
buffers on its own since that will be completely ignored
by the io.CopyBuffer Go implementation.
Improve this wherever we see this to be optimal.
This allows us to be more efficient on memory usage.
```
385 // copyBuffer is the actual implementation of Copy and CopyBuffer.
386 // if buf is nil, one is allocated.
387 func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
388 // If the reader has a WriteTo method, use it to do the copy.
389 // Avoids an allocation and a copy.
390 if wt, ok := src.(WriterTo); ok {
391 return wt.WriteTo(dst)
392 }
393 // Similarly, if the writer has a ReadFrom method, use it to do the copy.
394 if rt, ok := dst.(ReaderFrom); ok {
395 return rt.ReadFrom(src)
396 }
```
From readahead package
```
// WriteTo writes data to w until there's no more data to write or when an error occurs.
// The return value n is the number of bytes written.
// Any error encountered during the write is also returned.
func (a *reader) WriteTo(w io.Writer) (n int64, err error) {
if a.err != nil {
return 0, a.err
}
n = 0
for {
err = a.fill()
if err != nil {
return n, err
}
n2, err := w.Write(a.cur.buffer())
a.cur.inc(n2)
n += int64(n2)
if err != nil {
return n, err
}
```
with changes present to automatically throttle crawler
at runtime, there is no need to have an environment
value to disable crawling. crawling is a fundamental
piece for healing, lifecycle and many other features
there is no good reason anyone would need to disable
this on a production system.
* Apply suggestions from code review
globalSubscribers.NumSubscribers() is heavily used in S3 requests and it
uses mutex, use atomic.Load instead since it is faster
Co-authored-by: Anis Elleuch <anis@min.io>
Rewrite parentIsObject() function. Currently if a client uploads
a/b/c/d, we always check if c, b, a are actual objects or not.
The new code will check with the reverse order and quickly quit if
the segment doesn't exist.
So if a, b, c in 'a/b/c' does not exist in the first place, then returns
false quickly.
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.
This is causing an extract call to the disk
which can be penalizing in case of busy system
where disks receive many concurrent access.
mc admin trace does not show the correct handler name in the output: it
is printing `maxClients' for all handlers. The reason is that the wrong
order of handler wrapping.
Fixes two problems
- Double healing when bitrot is enabled, instead heal attempt
once in applyActions() before lifecycle is applied.
- If applyActions() is successful and getSize() returns proper
value, then object is accounted for and should be removed
from the oldCache namespace map to avoid double heal attempts.
main reason is that HealObjects starts a recursive listing
for each object, this can be a really really long time on
large namespaces instead avoid recursive listing just
perform HealObject() instead at the prefix.
delete's already handle purging dangling content, we
don't need to achieve this by doing recursive listing,
this in-turn can delay crawling significantly.
Optimizations include
- do not write the metacache block if the size of the
block is '0' and it is the first block - where listing
is attempted for a transient prefix, this helps to
avoid creating lots of empty metacache entries for
`minioMetaBucket`
- avoid the entire initialization sequence of cacheCh
, metacacheBlockWriter if we are simply going to skip
them when discardResults is set to true.
- No need to hold write locks while writing metacache
blocks - each block is unique, per bucket, per prefix
and also is written by a single node.
current implementation was incorrect, it in-fact
assumed only read quorum number of disks. in-fact
that value is only meant for read quorum good entries
from all online disks.
This PR fixes this behavior properly.
This commit refactors the code in `cmd/crypto`
and separates SSE-S3, SSE-C and SSE-KMS.
This commit should not cause any behavior change
except for:
- `IsRequested(http.Header)`
which now returns the requested type {SSE-C, SSE-S3,
SSE-KMS} and does not consider SSE-C copy headers.
However, SSE-C copy headers alone are anyway not valid.
Additional cases handled
- fix address situations where healing is not
triggered on failed writes and deletes.
- consider object exists during listing when
metadata can be successfully decoded.
always find the right set of online peers for remote listing,
this may have an effect on listing if the server is down - we
should do this to avoid always performing transient operations
on bucket->peerClient that is permanently or down for a long
period.
additionally also configure http2 healthcheck
values to quickly detect unstable connections
and let them timeout.
also use single transport for proxying requests
with missing nextMarker with delimiter based listing,
top level prefixes beyond 4500 or max-keys value
wouldn't be sent back for client to ask for the next
batch.
reproduced at a customer deployment, create prefixes
as shown below
```
for year in $(seq 2017 2020)
do
for month in {01..12}
do for day in {01..31}
do
mc -q cp file myminio/testbucket/dir/day_id=$year-$month-$day/;
done
done
done
```
Then perform
```
aws s3api --profile minio --endpoint-url http://localhost:9000 list-objects \
--bucket testbucket --prefix dir/ --delimiter / --max-keys 1000
```
You shall see missing NextMarker, this would disallow listing beyond max-keys
requested and also disallow beyond 4500 (maxKeyObjectList) prefixes being listed
because client wouldn't know the NextMarker available.
This PR addresses this situation properly by making the implementation
more spec compatible. i.e NextMarker in-fact can be either an object, a prefix
with delimiter depending on the input operation.
This issue was introduced after the list caching changes and has been present
for a while.
regression was introduced by cce5d7152a
which incorrectly implemented the tests and fixed a wrong requirement,
CompleteMultipart should never succeed in the tests.
issue was introduced in #11106 the following
pattern
<-t.C // timer fired
if !t.Stop() {
<-t.C // timer hangs
}
Seems to hang at the last `t.C` line, this
issue happens because a fired timer cannot be
Stopped() anymore and t.Stop() returns `false`
leading to confusing state of usage.
Refactor the code such that use timers appropriately
with exact requirements in place.
Both Azure & S3 gateways call for object information before returning
the stream of the object, however, the object content/length could be
modified meanwhile, which means it can return a corrupted object.
Use ETag to ensure that the object was not modified during the GET call
Currently, the test doesn't complete nor abort the multipart upload
after tesing the list operation. It caused the `cleanup()` to fail since
the object can't not be deleted which causes the bucket deletion to
fail.
This PR adds a `CompleteMultipartUpload` call to commit the object so
the cleanup can successfully delete the object and bucket.
crawler should only ListBuckets once not for each serverPool,
buckets are same across all pools, across sets and ListBuckets
always returns an unified view, once list buckets returns
sort it by create time to scan the latest buckets earlier
with the assumption that latest buckets would have lesser
content than older buckets allowing them to be scanned faster
and also to be able to provide more closer to latest view.
When searching the caches don't copy the ids, instead inline the loop.
```
Benchmark_bucketMetacache_findCache-32 19200 63490 ns/op 8303 B/op 5 allocs/op
Benchmark_bucketMetacache_findCache-32 20338 58609 ns/op 111 B/op 4 allocs/op
```
Add a reasonable, but still the simplistic benchmark.
Bonus - make nicer zero alloc logging
With new refactor of bucket healing, healing bucket happens
automatically including its metadata, there is no need to
redundant heal buckets also in ListBucketsHeal remove
it.
optimization mainly to avoid listing the entire
`.minio.sys/buckets/.minio.sys` directory, this
can get really huge and comes in the way of startup
routines, contents inside `.minio.sys/buckets/.minio.sys`
are rather transient and not necessary to be healed.
Tests environments (go test or manual testing) should always consider
the passed disks are root disks and should not rely on disk.IsRootDisk()
function. The reason is that this latter can return a false negative
when called in a busy system. However, returning a false negative will
only occur in a testing environment and not in a production, so we can
accept this trade-off for now.
Perform cleanup operations on copied data. Avoids read locking
data while determining which caches to keep.
Also, reduce the log(N*N) operation to log(N*M) where M caches
with the same root or below when checking potential replacements.
This refactor is done for few reasons below
- to avoid deadlocks in scenarios when number
of nodes are smaller < actual erasure stripe
count where in N participating local lockers
can lead to deadlocks across systems.
- avoids expiry routines to run 1000 of separate
network operations and routes per disk where
as each of them are still accessing one single
local entity.
- it is ideal to have since globalLockServer
per instance.
- In a 32node deployment however, each server
group is still concentrated towards the
same set of lockers that partipicate during
the write/read phase, unlike previous minio/dsync
implementation - this potentially avoids send
32 requests instead we will still send at max
requests of unique nodes participating in a
write/read phase.
- reduces overall chattiness on smaller setups.
The bucket forwarder handler considers MakeBucket to be always local but
it mistakenly thinks that PUT bucket lifecycle to be a MakeBucket call.
Fix the check of the MakeBucket call by ensuring that the query is empty
in the PUT url.
till now we used to match the inode number of the root
drive and the drive path minio would use, if they match
we knew that its a root disk.
this may not be true in all situations such as running
inside a container environment where the container might
be mounted from a different partition altogether, root
disk detection might fail.
partNumber was miscalculting the start and end of parts when partNumber
query is specified in the GET request. This commit fixes it and also
fixes the ContentRange header in that case.
PR 038bcd9079 introduced
version '3', we need to make sure that we do not
print an unexpected error instead log a message to
indicate we will auto update the version.
hotfix target will fetch the release tag prior to the latest commit and create a binary
with the same release tag plus '.hotfix' suffix
e.g. RELEASE.2020-12-03T05-49-24Z.hotfix
Due to botched upstream renames of project repositories
and incomplete migration to go.mod support, our current
dependency version of `go.mod` had bugs i.e it was
using commits from master branch which didn't have
the required fixes present in release-3.4 branches
which leads to some rare bugs
https://github.com/etcd-io/etcd/pull/11477 provides
a workaround for now and we should migrate to this.
release-3.5 eventually claims to fix all of this
properly until then we cannot use /v3 import right now
supports `mc admin config set <alias> heal sleep=100ms` to
enable more aggressive healing under certain times.
also optimize some areas that were doing extra checks than
necessary when bitrotscan was enabled, avoid double sleeps
make healing more predictable.
fixes#10497
- accountInfo API that returns information about
user, access to buckets and the size per bucket
- addUser - user is allowed to change their secretKey
- getUserInfo - returns user info if the incoming
is the same user requesting their information
In some cases a writer could be left behind unclosed, leaking compression blocks.
Always close and set compression concurrency to 2 which should be fine to keep up.
Due to https://github.com/philhofer/fwd/issues/20 when skipping a metadata entry that is >2048 bytes and the buffer is full (2048 bytes) the skip will fail with `io.ErrNoProgress`.
Enlarge the buffer so we temporarily make this much more unlikely.
If it still happens we will have to rewrite the skips to reads.
Fixes#10959
dangling object when deleted means object doesn't exist
anymore, so we should return appropriate errors, this
allows crawler heal to ensure that it removes the tracker
for dangling objects.
AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY are used in
azure CLI to specify the azure blob storage access & secret keys. With this commit,
it is possible to set them if you want the gateway's own credentials to be
different from the Azure blob credentials.
Co-authored-by: Harshavardhana <harsha@minio.io>
dangling objects when removed `mc admin heal -r` or crawler
auto heal would incorrectly return error - this can interfere
with usage calculation as the entry size for this would be
returned as `0`, instead upon success use the resultant
object size to calculate the final size for the object
and avoid reporting this in the log messages
Also do not set ObjectSize in healResultItem to be '-1'
this has an effect on crawler metrics calculating 1 byte
less for objects which seem to be missing their `xl.meta`
X-Minio-Replication-Delete-Status header shows the
status of the replication of a permanent delete of a version.
All GETs are disallowed and return 405 on this object version.
In the case of replicating delete markers.
X-Minio-Replication-DeleteMarker-Status shows the status
of replication, and would similarly return 405.
Additionally, this PR adds reporting of delete marker event completion
and updates documentation
Instead of having less/more fields inside a structure depending on the
platform (non-linux/linux), it would be better to have the same standard
definition in all platforms, and certain fields of the structure to be
populated or left unpopulated depending on the platform.
Alternative to #10927
Instead of having an upstream fix, do unwrap when checking network errors.
'As' will also work when destination is an interface as checked by the tests.
This PR adds transition support for ILM
to transition data to another MinIO target
represented by a storage class ARN. Subsequent
GET or HEAD for that object will be streamed from
the transition tier. If PostRestoreObject API is
invoked, the transitioned object can be restored for
duration specified to the source cluster.
allow directories to be replicated as well, along with
their delete markers in replication.
Bonus fix to fix bloom filter updates for directories
to be preserved.
fixes a regression introduced in #10859, due
to the error returned by rest.Client being typed
i.e *rest.NetworkError - IsNetworkHostDown function
didn't work as expected to detect network issues.
This in-turn aggravated the situations when nodes
are disconnected leading to performance loss.
Do listings with prefix filter when bloom filter is dirty.
This will forward the prefix filter to the lister which will make it
only scan the folders/objects with the specified prefix.
If we have a clean bloom filter we try to build a more generally
useful cache so in that case, we will list all objects/folders.
"hw.physmem" only reports reasonable results on 32 bit systems and is
kept to not break binary compatibility. On other systems, it reports
"-1". "hw.phymem64" reports reasonable results on all systems.
MinIO getHwPhysmem method currently uses the deprecated
"hw.physmem" key which results in a parse error (due to the -1).
This pull request asks for changing this. The change was tested
successfully on NetBSD/amd64 9.1.
Add shortcut for `APN/1.0 Veeam/1.0 Backup/10.0`
It requests unique blocks with a specific prefix. We skip
scanning the parent directory for more objects matching the prefix.
Allow each crawler operation to sleep up to 10 seconds on very heavily loaded systems.
This will of course make minimum crawler speed less, but should be more effective at stopping.
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).
This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
This will make the health check clients 'silent'.
Use `IsNetworkOrHostDown` determine if network is ok so it mimics the functionality in the actual client.
this is needed such that we make sure to heal the
users, policies and bucket metadata right away as
we do listing based on list cache which only lists
'3' sufficiently good drives, to avoid possibly
losing access to these users upon upgrade make
sure to heal them.
If a scanning server shuts down unexpectedly we may have "successful" caches that are incomplete on a set.
In this case mark the cache with an error so it will no longer be handed out.
Add `MINIO_API_EXTEND_LIST_CACHE_LIFE` that will extend
the life of generated caches for a while.
This changes caches to remain valid until no updates have been
received for the specified time plus a fixed margin.
This also changes the caches from being invalidated when the *first*
set finishes until the *last* set has finished plus the specified time
has passed.
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.
Additionally, remove all unused walkers from the storage layer
A new field called AccessKey is added to the ReqInfo struct and populated.
Because ReqInfo is added to the context, this allows the AccessKey to be
accessed from 3rd-party code, such as a custom ObjectLayer.
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Kaloyan Raev <kaloyan@storj.io>
On extremely long running listings keep the transient list 15 minutes after last update instead of using start time.
Also don't do overlap checks on transient lists.
Add trashcan that keeps recently updated lists after bucket deletion.
All caches were deleted once a bucket was deleted, so caches still running would report errors. Now they are canceled.
Fix `.minio.sys` not being transient.
Bonus fixes, remove package retry it is harder to get it
right, also manage context remove it such that we don't have
to rely on it anymore instead use a simple Jitter retry.
WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers
used by `io.CopyBuffer` even if they are pooled.
Since all writers appear to write byte buffers, just send those
instead and write directly. The files are opened through the `os`
package so they have no special properties anyway.
This removes the alloc and copy for each operation.
REST sends content length so a precise alloc can be made.
this reduces allocations in order of magnitude
Also, revert "erasure: delete dangling objects automatically (#10765)"
affects list caching should be investigated.
Add store and a forward option for a single part
uploads when an async mode is enabled with env
MINIO_CACHE_COMMIT=writeback
It defaults to `writethrough` if unspecified.
Bonus fixes, we do not need reload format anymore
as the replaced drive is healed locally we only need
to ensure that drive heal reloads the drive properly.
We preserve the UUID of the original order, this means
that the replacement in `format.json` doesn't mean that
the drive needs to be reloaded into memory anymore.
fixes#10791
when server is booting up there is a possibility
that users might see '503' because object layer
when not initialized, then the request is proxied
to neighboring peers first one which is online.
* Fix caches having EOF marked as a failure.
* Simplify cache updates.
* Provide context for checkMetacacheState failures.
* Log 499 when the client disconnects.
`decryptObjectInfo` is a significant bottleneck when listing objects.
Reduce the allocations for a significant speedup.
https://github.com/minio/sio/pull/40
```
λ benchcmp before.txt after.txt
benchmark old ns/op new ns/op delta
Benchmark_decryptObjectInfo-32 24260928 808656 -96.67%
benchmark old MB/s new MB/s speedup
Benchmark_decryptObjectInfo-32 0.04 1.24 31.00x
benchmark old allocs new allocs delta
Benchmark_decryptObjectInfo-32 75112 48996 -34.77%
benchmark old bytes new bytes delta
Benchmark_decryptObjectInfo-32 287694772 4228076 -98.53%
```
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
only newly replaced drives get the new `format.json`,
this avoids disks reloading their in-memory reference
format, ensures that drives are online without
reloading the in-memory reference format.
keeping reference format in-tact means UUIDs
never change once they are formatted.
lockers currently might leave stale lockers,
in unknown ways waiting for downed lockers.
locker check interval is high enough to safely
cleanup stale locks.
reference format should be source of truth
for inconsistent drives which reconnect,
add them back to their original position
remove automatic fix for existing offline
disk uuids
Bonus fixes
- logging improvements to ensure that we don't use
`go logger.LogIf` to avoid runtime.Caller missing
the function name. log where necessary.
- remove unused code at erasure sets
Test TestDialContextWithDNSCacheRand was failing sometimes because it depends
on a random selection of addresses when testing random DNS resolution from cache.
Lower addr selection exception to 10%
Allow requests to come in for users as soon as object
layer and config are initialized, this allows users
to be authenticated sooner and would succeed automatically
on servers which are yet to fully initialize.
Go stdlib resolver doesn't support caching DNS
resolutions, since we compile with CGO disabled
we are more probe to DNS flooding for all network
calls to resolve for DNS from the DNS server.
Under various containerized environments such as
VMWare this becomes a problem because there are
no DNS caches available and we may end up overloading
the kube-dns resolver under concurrent I/O.
To circumvent this issue implement a DNSCache resolver
which resolves DNS and caches them for around 10secs
with every 3sec invalidation attempted.
connect disks pre-emptively upon startup, to ensure we have
enough disks are connected at startup rather than wait
for them.
we need to do this to avoid long wait times for server to
be online when we have servers come up in rolling upgrade
fashion
Only use dynamic delays for the crawler. Even though the max wait was 1 second the number
of waits could severely impact crawler speed.
Instead of relying on a global metric, we use the stateless local delays to keep the crawler
running at a speed more adjusted to current conditions.
The only case we keep it is before bitrot checks when enabled.
This PR fixes a hang which occurs quite commonly at higher concurrency
by allowing following changes
- allowing lower connections in time_wait allows faster socket open's
- lower idle connection timeout to ensure that we let kernel
reclaim the time_wait connections quickly
- increase somaxconn to 4096 instead of 2048 to allow larger tcp
syn backlogs.
fixes#10413
This change tracks bandwidth for a bucket and object
- [x] Add Admin API
- [x] Add Peer API
- [x] Add BW throttling
- [x] Admin APIs to set replication limit
- [x] Admin APIs for fetch bandwidth
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.
allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.
In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
- select lockers which are non-local and online to have
affinity towards remote servers for lock contention
- optimize lock retry interval to avoid sending too many
messages during lock contention, reduces average CPU
usage as well
- if bucket is not set, when deleteObject fails make sure
setPutObjHeaders() honors lifecycle only if bucket name
is set.
- fix top locks to list out always the oldest lockers always,
avoid getting bogged down into map's unordered nature.
This is to allow remote targets to be generalized
for replication/ILM transition
Also adding a field in BucketTarget to identify
a remote target with a label.
This commit fixes a misuse of the `http.ResponseWriter.WriteHeader`.
A caller should **either** call `WriteHeader` exactly once **or**
write to the response writer and causing an implicit 200 OK.
Writing the response headers more than once causes a `http: superfluous
response.WriteHeader call` log message. This commit fixes this
by preventing a 2nd `WriteHeader` call being forwarded to the underlying
`ResponseWriter`.
Updates #10587
* add NVMe drive info [model num, serial num, drive temp. etc.]
* Ignore fuse partitions
* Add the nvme logic only for linux
* Move smart/nvme structs to a separate file
Co-authored-by: wlan0 <sidharthamn@gmail.com>
throw proper error when port is not accessible
for the regular user, this is possibly a regression.
```
ERROR Unable to start the server: Insufficient permissions to use specified port
> Please ensure MinIO binary has 'cap_net_bind_service=+ep' permissions
HINT:
Use 'sudo setcap cap_net_bind_service=+ep /path/to/minio' to provide sufficient permissions
```
After #10594 let's invalidate the bloom filters to force the next cycles to go through all data.
There is a small chance that the linked PR could have caused missing bloom filter data.
This will invalidate the current bloom filters and make the crawler go through everything.
Routing using on source IP if found. This should distribute
the listing load for V1 and versioning on multiple nodes
evenly between different clients.
If source IP is not found from the http request header, then falls back
to bucket name instead.
Disallow versioning suspension on a bucket with
pre-existing replication configuration
If versioning is suspended on the target,replication
should fail.
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.
Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
Always check if the auto-generated code is still compatible with the
existing written code to avoid a possible forgetting or sometimes a non
intentional change.
- Add owner information for expiry, locking, unlocking a resource
- TopLocks returns now locks in quorum by default, provides
a way to capture stale locks as well with `?stale=true`
- Simplify the quorum handling for locks to avoid from storage
class, because there were challenges to make it consistent
across all situations.
- And other tiny simplifications to reset locks.
context canceled errors bubbling up from the network
layer has the potential to be misconstrued as network
errors, taking prematurely a server offline and triggering
a health check routine avoid this potential occurrence.
isEnded() was incorrectly calculating if the current healing sequence is
ended or not. h.currentStatus.Items could be empty if healing is very
slow and mc admin heal consumed all items.
As the bulk/recursive delete will require multiple connections to open at an instance,
The default open connections limit will be reached which results in the following error
```FATAL: sorry, too many clients already```
By setting the open connections to a reasonable value - `2`, We ensure that the max open connections
will not be exhausted and lie under bounds.
The queries are simple inserts/updates/deletes which is operational and sufficient with the
the maximum open connection limit is 2.
Fixes#10553
Allow user configuration for MaxOpenConnections
It is possible the heal drives are not reported from
the maintenance check because the background heal
state simply relied on the `format.json` for capturing
unformatted drives. It is possible that drives might
be still healing - make sure that applications which
rely on cluster health check respond back this detail.
Also, revamp the way ListBuckets work make few portions
of the healing logic parallel
- walk objects for healing disks in parallel
- collect the list of buckets in parallel across drives
- provide consistent view for listBuckets()
Healing was not working correctly in the distributed mode because
errFileVersionNotFound was not properly converted in storage rest
client.
Besides, fixing the healing delete marker is not working as expected.
performance improves by around 100x or more
```
go test -v -run NONE -bench BenchmarkGetPartFile
goos: linux
goarch: amd64
pkg: github.com/minio/minio/cmd
BenchmarkGetPartFileWithTrie
BenchmarkGetPartFileWithTrie-4 1000000000 0.140 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/minio/minio/cmd 1.737s
```
fixes#10520
* Fix cases where minimum timeout > default timeout.
* Add defensive code for too small/negative timeouts.
* Never set timeout below the maximum value of a request.
* Protect against (unlikely) int64 wraps.
* Decrease timeout slower.
* Don't re-lock before copying.
This bug was introduced in 14f0047295
almost 3yrs ago, as a side affect of removing stale `fs.json`
but we in-fact end up removing existing good `fs.json` for an
existing object, leading to some form of a data loss.
fixes#10496
from 20s for 10000 parts to less than 1sec
Without the patch
```
~ time aws --endpoint-url=http://localhost:9000 --profile minio s3api \
list-parts --bucket testbucket --key test \
--upload-id c1cd1f50-ea9a-4824-881c-63b5de95315a
real 0m20.394s
user 0m0.589s
sys 0m0.174s
```
With the patch
```
~ time aws --endpoint-url=http://localhost:9000 --profile minio s3api \
list-parts --bucket testbucket --key test \
--upload-id c1cd1f50-ea9a-4824-881c-63b5de95315a
real 0m0.891s
user 0m0.624s
sys 0m0.182s
```
fixes#10503
It was observed in VMware vsphere environment during a
pod replacement, `mc admin info` might report incorrect
offline nodes for the replaced drive. This issue eventually
goes away but requires quite a lot of time for all servers
to be in sync.
This PR fixes this behavior properly.
If the ILM document requires removing noncurrent versions, the
the server should be able to remove 'null' versions as well.
'null' versions are created when versioning is not enabled
or suspended.
The entire encryption layer is dependent on the fact that
KMS should be configured for S3 encryption to work properly
and we only support passing the headers as is to the backend
for encryption only if KMS is configured.
Make sure that this predictability is maintained, currently
the code was allowing encryption to go through and fail
at later to indicate that KMS was not configured. We should
simply reply "NotImplemented" if KMS is not configured, this
allows clients to simply proceed with their tests.
This is to ensure that Go contexts work properly, after some
interesting experiments I found that Go net/http doesn't
cancel the context when Body is non-zero and hasn't been
read till EOF.
The following gist explains this, this can lead to pile up
of go-routines on the server which will never be canceled
and will die at a really later point in time, which can
simply overwhelm the server.
https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150
To avoid this refactor the locking such that we take locks after we
have started reading from the body and only take locks when needed.
Also, remove contextReader as it's not useful, doesn't work as expected
context is not canceled until the body reaches EOF so there is no point
in wrapping it with context and putting a `select {` on it which
can unnecessarily increase the CPU overhead.
We will still use the context to cancel the lockers etc.
Additional simplification in the locker code to avoid timers
as re-using them is a complicated ordeal avoid them in
the hot path, since locking is very common this may avoid
lots of allocations.
This commit fixes the nats TLS tests by generating new certificates
(root CA, server and client) - each valid for 10y. The new certificates
don't have a common name (deprecated by X.509) but SANs instead.
Since Go 1.15 the Go `crypto/x509` package rejects certificates that
only have a common name and no SAN. See: https://golang.org/doc/go1.15#commonname
configurable remote transport timeouts for some special cases
where this value needs to be bumped to a higher value when
transferring large data between federated instances.
In `(*cacheObjects).GetObjectNInfo` copy the metadata before spawning a goroutine.
Clean up a few map[string]string copies as well, reducing allocs and simplifying the code.
Fixes#10426
From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions
```
When specifying the number of days in the NoncurrentVersionTransition
and NoncurrentVersionExpiration actions in a Lifecycle configuration,
note the following:
It is the number of days from when the version of the object becomes
noncurrent (that is, when the object is overwritten or deleted), that
Amazon S3 will perform the action on the specified object or objects.
Amazon S3 calculates the time by adding the number of days specified in
the rule to the time when the new successor version of the object is
created and rounding the resulting time to the next day midnight UTC.
For example, in your bucket, suppose that you have a current version of
an object that was created at 1/1/2014 10:30 AM UTC. If the new version
of the object that replaces the current version is created at 1/15/2014
10:30 AM UTC, and you specify 3 days in a transition rule, the
transition date of the object is calculated as 1/19/2014 00:00 UTC.
```
This PR adds a DNS target that ensures to update an entry
into Kubernetes operator when a bucket is created or deleted.
See minio/operator#264 for details.
Co-authored-by: Harshavardhana <harsha@minio.io>
MaxConnsPerHost can potentially hang a call without any
way to timeout, we do not need this setting for our proxy
and gateway implementations instead IdleConn settings are
good enough.
Also ensure to use NewRequestWithContext and make sure to
take the disks offline only for network errors.
Fixes#10304
inconsistent drive healing when one of the drive is offline
while a new drive was replaced, this change is to ensure
that we can add the offline drive back into the mix by
healing it again.
Add context to all (non-trivial) calls to the storage layer.
Contexts are propagated through the REST client.
- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.
The "dangerous" part is that now a caller disconnecting *will* propagate down, so a
"delete" operation will now be interrupted. In some cases we might want to disconnect
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
This commit refactors the certificate management implementation
in the `certs` package such that multiple certificates can be
specified at the same time. Therefore, the following layout of
the `certs/` directory is expected:
```
certs/
│
├─ public.crt
├─ private.key
├─ CAs/ // CAs directory is ignored
│ │
│ ...
│
├─ example.com/
│ │
│ ├─ public.crt
│ └─ private.key
└─ foobar.org/
│
├─ public.crt
└─ private.key
...
```
However, directory names like `example.com` are just for human
readability/organization and don't have any meaning w.r.t whether
a particular certificate is served or not. This decision is made based
on the SNI sent by the client and the SAN of the certificate.
***
The `Manager` will pick a certificate based on the client trying
to establish a TLS connection. In particular, it looks at the client
hello (i.e. SNI) to determine which host the client tries to access.
If the manager can find a certificate that matches the SNI it
returns this certificate to the client.
However, the client may choose to not send an SNI or tries to access
a server directly via IP (`https://<ip>:<port>`). In this case, we
cannot use the SNI to determine which certificate to serve. However,
we also should not pick "the first" certificate that would be accepted
by the client (based on crypto. parameters - like a signature algorithm)
because it may be an internal certificate that contains internal hostnames.
We would disclose internal infrastructure details doing so.
Therefore, the `Manager` returns the "default" certificate when the
client does not specify an SNI. The default certificate the top-level
`public.crt` - i.e. `certs/public.crt`.
This approach has some consequences:
- It's the operator's responsibility to ensure that the top-level
`public.crt` does not disclose any information (i.e. hostnames)
that are not publicly visible. However, this was the case in the
past already.
- Any other `public.crt` - except for the top-level one - must not
contain any IP SAN. The reason for this restriction is that the
Manager cannot match a SNI to an IP b/c the SNI is the server host
name. The entire purpose of SNI is to indicate which host the client
tries to connect to when multiple hosts run on the same IP. So, a
client will not set the SNI to an IP.
If we would allow IP SANs in a lower-level `public.crt` a user would
expect that it is possible to connect to MinIO directly via IP address
and that the MinIO server would pick "the right" certificate. However,
the MinIO server cannot determine which certificate to serve, and
therefore always picks the "default" one. This may lead to all sorts
of confusing errors like:
"It works if I use `https:instance.minio.local` but not when I use
`https://10.0.2.1`.
These consequences/limitations should be pointed out / explained in our
docs in an appropriate way. However, the support for multiple
certificates should not have any impact on how deployment with a single
certificate function today.
Co-authored-by: Harshavardhana <harsha@minio.io>
- do not fail the healthcheck if heal status
was not obtained from one of the nodes,
if many nodes fail then report this as a
catastrophic error.
- add "x-minio-write-quorum" value to match
the write tolerance supported by server.
- admin info now states if a drive is healing
where madmin.Disk.Healing is set to true
and madmin.Disk.State is "ok"
Currently, cache purges are triggered as soon as the low watermark is exceeded.
To reduce IO this should only be done when reaching the high watermark.
This simplifies checks and reduces all calls for a GC to go through
`dcache.diskSpaceAvailable(size)`. While a comment claims that
`dcache.triggerGC <- struct{}{}` was non-blocking I don't see how
that was possible. Instead, we add a 1 size to the queue channel
and use channel semantics to avoid blocking when a GC has
already been requested.
`bytesToClear` now takes the high watermark into account to it will
not request any bytes to be cleared until that is reached.
This commit reduces the retry delay when retrying a request
to a KES server by:
- reducing the max. jitter delay from 3s to 1.5s
- skipping the random delay when there are more KES endpoints
available.
If there are more KES endpoints we can directly retry to the request
by sending it to the next endpoint - as pointed out by @krishnasrinivas
- delete-marker should be created on a suspended bucket as `null`
- delete-marker should delete any pre-existing `null` versioned
object and create an entry `null`
When checking parts we already do a stat for each part.
Since we have the on disk size check if it is at least what we expect.
When checking metadata check if metadata is 0 bytes.
The `getNonLoopBackIP` may grab an IP from an interface that
doesn't allow binding (on Windows), so this test consistently fails.
We exclude that specific error.
* readDirN: Check if file is directory
`syscall.FindNextFile` crashes if the handle is a file.
`errFileNotFound` matches 'unix' functionality: d19b434ffc/cmd/os-readdir_unix.go (L106)Fixes#10384
This commit addresses a maintenance / automation problem when MinIO-KES
is deployed on bare-metal. In orchestrated env. the orchestrator (K8S)
will make sure that `n` KES servers (IPs) are available via the same DNS
name. There it is sufficient to provide just one endpoint.
ListObjectsV1 requests are actually redirected to a specific node,
depending on the bucket name. The purpose of this behavior was
to optimize listing.
However, the current code sends a Bad Gateway error if the
target node is offline, which is a bad behavior because it means
that the list request will fail, although this is unnecessary since
we can still use the current node to list as well (the default behavior
without using proxying optimization)
Currently, you can see mint fails when there is one offline node, after
this PR, mint will always succeed.
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.
Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.
Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for
- cleanupStaleMultipartUploads()
If DiskInfo calls failed the information returned was used anyway
resulting in no endpoint being set.
This would make the drive be attributed to the local system since
`disk.Endpoint == disk.DrivePath` in that case.
Instead, if the call fails record the endpoint and the error only.
bonus make sure to ignore objectNotFound, and versionNotFound
errors properly at all layers, since HealObjects() returns
objectNotFound error if the bucket or prefix is empty.
When crawling never use a disk we know is healing.
Most of the change involves keeping track of the original endpoint on xlStorage
and this also fixes DiskInfo.Endpoint never being populated.
Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is
Healing, skipping` once on a cycle (no more often than every 5m).
We should only enforce quotas if no error has been returned.
firstErr is safe to access since all goroutines have exited at this point.
If `firstErr` hasn't been set by something else return the context error if cancelled.
Keep dataUpdateTracker while goroutine is starting.
This will ensure the object is updated one `start` returns
Tested with
```
λ go test -cpu=1,2,4,8 -test.run TestDataUpdateTracker -count=1000
PASS
ok github.com/minio/minio/cmd 8.913s
```
Fixes#10295
fresh drive setups when one of the drive is
a root drive, we should ignore such a root
drive and not proceed to format.
This PR handles this properly by marking
the disks which are root disk and they are
taken offline.
I have built a fuzz test and it crashes heavily in seconds and will OOM shortly after.
It seems like supporting Parquet is basically a completely open way to crash the
server if you can upload a file and run s3 select on it.
Until Parquet is more hardened it is DISABLED by default since hostile
crafted input can easily crash the server.
If you are in a controlled environment where it is safe to assume no hostile
content can be uploaded to your cluster you can safely enable Parquet.
To enable Parquet set the environment variable `MINIO_API_SELECT_PARQUET=on`
while starting the MinIO server.
Furthermore, we guard parquet by recover functions.
newDynamicTimeout should be allocated once, in-case
of temporary locks in config and IAM we should
have allocated timeout once before the `for loop`
This PR doesn't fix any issue as such, but provides
enough dynamism for the timeout as per expectation.
Based on our previous conversations I assume we should send the version
id when healing an object.
Maybe we should even list object versions and heal all?
In a non recursive mode, issuing a list request where prefix
is an existing object with a slash and delimiter is a slash will
return entries in the object directory (data dir IDs)
```
$ aws s3api --profile minioadmin --endpoint-url http://localhost:9000 \
list-objects-v2 --bucket testbucket --prefix code_of_conduct.md/ --delimiter '/'
{
"CommonPrefixes": [
{
"Prefix":
"code_of_conduct.md/ec750fe0-ea7e-4b87-bbec-1e32407e5e47/"
}
]
}
```
This commit adds a fast exit track in Walk() in this specific case.
use `/etc/hosts` instead of `/` to check for common
device id, if the device is same for `/etc/hosts`
and the --bind mount to detect root disks.
Bonus enhance healthcheck logging by adding maintenance
tags, for all messages.
adds a feature where we can fetch the MinIO
command-line remotely, this
is primarily meant to add some stateless
nature to the MinIO deployment in k8s
environments, MinIO operator would run a
webhook service endpoint
which can be used to fetch any environment
value in a generalized approach.
It is possible in situations when server was deployed
in asymmetric configuration in the past such as
```
minio server ~/fs{1...4}/disk{1...5}
```
Results in setDriveCount of 10 in older releases
but with fairly recent releases we have moved to
having server affinity which means that a set drive
count ascertained from above config will be now '4'
While the object layer make sure that we honor
`format.json` the storageClass configuration however
was by mistake was using the global value obtained
by heuristics. Which leads to prematurely using
lower parity without being requested by the an
administrator.
This PR fixes this behavior.
Bonus also fix a bug where we did not purge relevant
service accounts generated by rotating credentials
appropriately, service accounts should become invalid
as soon as its corresponding parent user becomes invalid.
Since service account themselves carry parent claim always
we would never reach this problem, as the access get
rejected at IAM policy layer.
- Add methods to set/remove replication rules (poornas)
- fix: only SSE-C headers should be applied to destination (Harshavardhana)
- fix: avoid data race by copying the buffer (Harshavardhana)
- remove deprecated build badges (Harshavardhana)
- fix: handle readFull bug with certain readers (Harshavardhana)
- fix a typo in README.md (Julien K)
- lifecycle: Fix marshaling expiration date/days (Anis Elleuch)
- add replication-status, expiration headers (Harshavardhana)
- Return object's version id in StatObject(Anis Elleuch)
- display appropriate funcName with nested callers (Harshavardhana)
- allow KMS tests to be run in the CI/CD (Harshavardhana)
- fix: removing lifecycle properly (Harshavardhana)
- feat: Add ListenNotification API to listen for all events (Harshavardhana)
when source and destination are same and versioning is enabled
on the destination bucket - we do not need to re-create the entire
object once again to optimize on space utilization.
Cases this PR is not supporting
- any pre-existing legacy object will not
be preserved in this manner, meaning a new
dataDir will be created.
- key-rotation and storage class changes
of course will never re-use the dataDir
conflicting files can exist on FS at
`.minio.sys/buckets/testbucket/policy.json/`, this is an
expected valid scenario for FS mode allow it to work,
i.e ignore and move forward
With reduced parity our write quorum should be same
as read quorum, but code was still assuming
```
readQuorum+1
```
In all situations which is not necessary.
Generalize replication target management so
that remote targets for a bucket can be
managed with ARNs. `mc admin bucket remote`
command will be used to manage targets.
Context timeout might race on each other when timeouts are lower
i.e when two lock attempts happened very quickly on the same resource
and the servers were yet trying to establish quorum.
This situation can lead to locks held which wouldn't be unlocked
and subsequent lock attempts would fail.
This would require a complete server restart. A potential of this
issue happening is when server is booting up and we are trying
to hold a 'transaction.lock' in quick bursts of timeout.
replace dummy buffer with nullReader{} instead,
to avoid large memory allocations in memory
constrainted environments. allows running
obd tests in such environments.
Currently, listing directories on HDFS incurs a per-entry remote Stat() call
penalty, the cost of which can really blow up on directories with many
entries (+1,000) especially when considered in addition to peripheral
calls (such as validation) and the fact that minio is an intermediary to the
client (whereas other clients listed below can query HDFS directly).
Because listing directories this way is expensive, the Golang HDFS library
provides the [`Client.Open()`] function which creates a [`FileReader`] that is
able to batch multiple calls together through the [`Readdir()`] function.
This is substantially more efficient for very large directories.
In one case we were witnessing about +20 seconds to list a directory with 1,500
entries, admittedly large, but the Java hdfs ls utility as well as the HDFS
library sample ls utility were much faster.
Hadoop HDFS DFS (4.02s):
λ ~/code/minio → use-readdir
» time hdfs dfs -ls /directory/with/1500/entries/
…
hdfs dfs -ls 5.81s user 0.49s system 156% cpu 4.020 total
Golang HDFS library (0.47s):
λ ~/code/hdfs → master
» time ./hdfs ls -lh /directory/with/1500/entries/
…
./hdfs ls -lh 0.13s user 0.14s system 56% cpu 0.478 total
mc and minio **without** optimization (16.96s):
λ ~/code/minio → master
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.22s user 0.29s system 3% cpu 16.968 total
mc and minio **with** optimization (0.40s):
λ ~/code/minio → use-readdir
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.13s user 0.28s system 102% cpu 0.403 total
[`Client.Open()`]: https://godoc.org/github.com/colinmarc/hdfs#Client.Open
[`FileReader`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader
[`Readdir()`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader.Readdir
If there are many listeners to bucket notifications or to the trace
subsystem, healing fails to work properly since it suspends itself when
the number of concurrent connections is above a certain threshold.
These connections are also continuous and not costly (*no disk access*),
it is okay to just ignore them in waitForLowHTTPReq().
this is to detect situations of corruption disk
format etc errors quickly and keep the disk online
in such scenarios for requests to fail appropriately.
String x might contain trimming spaces. And it needs to be trimmed. For
example, in csv files, there might be trimming spaces in a field that
ought to meet a query condition that contains the value without
trimming spaces. This applies to both intCast and floatCast functions.
Fixes two different types of problems
- continuation of the problem seen in FS #9992
as not fixed for erasure coded deployments,
reproduced this issue with spark and its fixed now
- another issue was leaking walk go-routines which
would lead to high memory usage and crash the system
this is simply because all the walks which were purged
at the top limit had leaking end walkers which would
consume memory endlessly.
closes#9966closes#10088
not all claims need to be present for
the JWT claim, let the policies not
exist and only apply which are present
when generating the credentials
once credentials are generated then
those policies should exist, otherwise
the request will fail.
- copyObject in-place decryption failed
due to incorrect verification of headers
- do not decode ETag when object is encrypted
with SSE-C, so that pre-conditions don't fail
prematurely.
This PR adds support for healing older
content i.e from 2yrs, 1yr. Also handles
other situations where our config was
not encrypted yet.
This PR also ensures that our Listing
is consistent and quorum friendly,
such that we don't list partial objects
In federated NAS gateway setups, multiple hosts in srvRecords
was picked at random which could mean that if one of the
host was down the request can indeed fail and if client
retries it would succeed. Instead allow server to figure
out the current online host quickly such that we can
exclude the host which is down.
At the max the attempt to look for a downed node is to
300 millisecond, if the node is taking longer to respond
than this value we simply ignore and move to the node,
total attempts are equal to number of srvRecords if no
server is online we simply fallback to last dialed host.
healing was not working properly when drives were
replaced, due to the error check in root disk
calculation this PR fixes this behavior
This PR also adds additional fix for missing
metadata entries from .minio.sys as part of
disk healing as well.
Added code to ignore and print more context
sensitive errors for better debugging.
This PR is continuation of fix in 7b14e9b660
Enforce bucket quotas when crawling has finished.
This ensures that we will not do quota enforcement on old data.
Additionally, delete less if we are closer to quota than we thought.
for users who don't have access to HDFS rootPath '/'
can optionally specify `minio gateway hdfs hdfs://namenode:8200/path`
for which they have access to, allowing all writes to be
performed at `/path`.
NOTE: once configured in this manner you need to make
sure command line is correctly specified, otherwise
your data might not be visible
closes#10011
this PR to allow legacy support for big-data
applications which run older Java versions
which do not support the secure ciphers
currently defaulted in MinIO. This option
allows optionally to turn them off such
that client and server can negotiate the
best ciphers themselves.
This env is purposefully not documented,
meant as a last resort when client
application cannot be changed easily.
- admin info node offline check is now quicker
- admin info now doesn't duplicate the code
across doing the same checks for disks
- rely on StorageInfo to return appropriate errors
instead of calling locally.
- diskID checks now return proper errors when
disk not found v/s format.json missing.
- add more disk states for more clarity on the
underlying disk errors.
while we handle all situations for writes and reads
on older format, what we didn't cater for properly
yet was delete where we only ended up deleting
just `xl.meta` - instead we should allow all the
deletes to go through for older format without
versioning enabled buckets.
This commit fixes a dependency resolution problem w.r.t. minio => etcd.
Building any project depending on minio (e.g. mc) currently fails with
latest master since the module replace directive is not honored for
transitive / indirect dependencies.
This commit fixes this by adding the etcd module directly instead of
using a module replace instruction.
CORS is notorious requires specific headers to be
handled appropriately in request and response,
using cors package as part of handlerFunc() for
options method lacks the necessary control this
package needs to add headers.
Without instantiating a new rest client we can
have a recursive error which can lead to
healthcheck returning always offline, this can
prematurely take the servers offline.
gorilla/mux broke their recent release 1.7.4 which we
upgraded to, we need the current workaround to ensure
that our regex matches appropriately.
An upstream PR is sent, we should remove the
workaround once we have a new release.
- reduce locker timeout for early transaction lock
for more eagerness to timeout
- reduce leader lock timeout to range from 30sec to 1minute
- add additional log message during bootstrap phase
Main issue is that `t.pool[params]` should be `t.pool[oldest]`.
We add a bit more safety features for the code.
* Make writes to the endTimerCh non-blocking in all cases
so multiple releases cannot lock up.
* Double check expectations.
* Shift down deletes with copy instead of truncating slice.
* Actually delete the oldest if we are above total limit.
* Actually delete the oldest found and not the current.
* Unexport the mutex so nobody from the outside can meddle with it.
This commit adds a new admin API for creating master keys.
An admin client can send a POST request to:
```
/minio/admin/v3/kms/key/create?key-id=<keyID>
```
The name / ID of the new key is specified as request
query parameter `key-id=<ID>`.
Creating new master keys requires KES - it does not work with
the native Vault KMS (deprecated) nor with a static master key
(deprecated).
Further, this commit removes the `UpdateKey` method from the `KMS`
interface. This method is not needed and not used anymore.
with the merge of https://github.com/etcd-io/etcd/pull/11823
etcd v3.5.0 will now have a properly imported versioned path
this fixes our pending migration to newer repo
Use a separate client for these calls that can take a long time.
Add request context to these so they are canceled when the client
disconnects instead except for ListObject which doesn't have any equivalent.
Healing an object which has multiple versions was not working because
the healing code forgot to consider errFileVersionNotFound error as a
use case that needs healing
Currently, lifecycle expiry is deleting all object versions which is not
correct, unless noncurrent versions field is specified.
Also, only delete the delete marker if it is the only version of the
given object.
- additionally upgrade to msgp@v1.1.2
- change StatModTime,StatSize fields as
simple Size/ModTime
- reduce 50000 entries per List batch to 10000
as client needs to wait too long to see the
first batch some times which is not desired
and it is worth we write the data as soon
as we have it.
object KMS is configured with auto-encryption,
there were issues when using docker registry -
this has been left unnoticed for a while.
This PR fixes an issue with compatibility.
Additionally also fix the continuation-token implementation
infinite loop issue which was missed as part of #9939
Also fix the heal token to be generated as a client
facing value instead of what is remembered by the
server, this allows for the server to be stateless
regarding the token's behavior.
When manual healing is triggered, one node in a cluster will
become the authority to heal. mc regularly sends new requests
to fetch the status of the ongoing healing process, but a load
balancer could land the healing request to a node that is not
doing the healing request.
This PR will redirect a request to the node based on the node
index found described as part of the client token. A similar
technique is also used to proxy ListObjectsV2 requests
by encoding this information in continuation-token
The S3 specification says that versions are ordered in the response of
list object versions.
mc snapshot needs this to know which version comes first especially when
two versions have the same exact last-modified field.
Readiness as no reasoning to be cluster scope
because that is not how the k8s networking works
for pods, all the pods to a deployment are not
sharing the network in a singleton. Instead they
are run as local scopes to themselves, with
readiness failures the pod is potentially taken
out of the network to be resolvable - this
affects the distributed setup in myriad of
different ways.
Instead readiness should behave like liveness
with local scope alone, and should be a dummy
implementation.
This PR all the startup times and overal k8s
startup time dramatically improves.
Added another handler called as `/minio/health/cluster`
to understand the cluster scope health.
Walk() functionality was missing on gateway
implementations leading to missing functionality
for the browser UI such as remove multiple objects,
download as zip file etc.
This PR brings a generic implementation across
all gateway's, it is not required to repeat the
same code in all gateway's
The default behavior is to cache each range requested
to cache drive. Add an environment variable
`MINIO_RANGE_CACHE` - when set to off, it disables
range caching and instead downloads entire object
in the background.
Fixes#9870
Bonus fix during versioning merge one of the PR was missing
the offline/online disk count fix from #9801 port it correctly
over to the master branch from release.
Additionally, add versionID support for MRF
Fixes#9910Fixes#9931
Users having endpoints with this format http://url:80 or http://url:443
will face signature mismatch error.
The reason is that S3 spec ignores :80 or :443 port in the endpoint
when calculating the signature, so this PR will just strip them.
This PR has the following changes
- Removing duplicate lookupConfigs() calls.
- Deprecate admin config APIs for NAS gateways. This will avoid repeated reloads of the config from the disk.
- WatchConfigNASDisk will be removed
- Migration guide for NAS gateways users to migrate to ENV settings.
NOTE: THIS PR HAS A BREAKING CHANGE
Fixes#9875
Co-authored-by: Harshavardhana <harsha@minio.io>
* Just read files from args (more than 1 now supported)
* Pretty print by default. `-ndjson` will disable.
* Check header.
* Support stdin as '-'
* Don't just ignore errors.
Looking into full disk errors on zoned setup. We don't take the
5% space requirement into account when selecting a zone.
The interesting part is that even considering this we don't
know the size of the object the user wants to upload when
they do multipart uploads.
It seems quite defensive to always upload multiparts to
the zone where there is the most space since all load will
be directed to a part of the cluster.
In these cases we make sure it can at least hold a 1GiB file
and we disadvantage fuller zones more by subtracting the
expected size before weighing.
- x-amz-storage-class specified CopyObject
should proceed regardless, its not a precondition
- sourceVersionID is specified CopyObject should
proceed regardless, its not a precondition
This PR fixes all the below scenarios
and handles them correctly.
- existing data/bucket is replaced with
new content, no versioning enabled old
structure vanishes.
- existing data/bucket - enable versioning
before uploading any data, once versioning
enabled upload new content, old content
is preserved.
- suspend versioning on the bucket again, now
upload content again the old content is purged
since that is the default "null" version.
Additionally sync data after xl.json -> xl.meta
rename(), to avoid any surprises if there is a
crash during this rename operation.
- Add changes to ensure remote disks are not
incorrectly taken online if their order has
changed or are incorrect disks.
- Bring changes to peer to detect disconnection
with separate Health handler, to avoid a
rather expensive call GetLocakDiskIDs()
- Follow up on the same changes for Lockers
as well
Just like GET/DELETE APIs it is possible to preserve
client supplied versionId's, of course the versionIds
have to be uuid, if an existing versionId is found
it is overwritten if no object locking policies
are found.
- PUT /bucketname/objectname?versionId=<id>
- POST /bucketname/objectname?uploads=&versionId=<id>
- PUT /bucketname/objectname?verisonId=<id> (with x-amz-copy-source)
Fixes potentially infinite allocations, especially in FS mode,
since lookups live up to 30 minutes. Limit walk pool sizes to 50
max parameter entries and 4 concurrent operations with the same
parameters.
Fixes#9835
PutObject on multiple-zone with versioning would not
overwrite the correct location of the object if the
object has delete marker, leading to duplicate objects
on two zones.
This PR fixes by adding affinity towards delete marker
when GetObjectInfo() returns error, use the zone index
which has the delete marker.
Bonus change to use channel to serialize triggers,
instead of using atomic variables. More efficient
mechanism for synchronization.
Co-authored-by: Nitish Tiwari <nitish@minio.io>
In the Current bug we were re-using the context
from previously granted lockers, this would
lead to lock timeouts for existing valid
read or write locks, leading to premature
timeout of locks.
This bug affects only local lockers in FS
or standalone erasure coded mode. This issue
is rather historical as well and was present
in lsync for some time but we were lucky to
not see it.
Similar changes are done in dsync as well
to keep the code more familiar
Fixes#9827
When updating all servers following the constructions of mc update,
only the endpoint server will be updated successfully.
All the other peer servers' updating failed due to the error below:
--------------------------------------------------------------------------
parsing time "2006-01-02T15:04:05Z07:00" as "<release version>": cannot parse "-01-02T15:04:05Z07:00" as "0-"
--------------------------------------------------------------------------
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
Historically due to lack of support for middlewares
we ended up writing wrapped handlers for all
middlewares on top of the gorilla/mux, this causes
multiple issues when we want to let's say
- Overload r.Body with some custom implementation
to track the incoming Reads()
- Add other sort of top level checks to avoid
DDOSing the server with large incoming HTTP
bodies.
Since 1.7.x release gorilla/mux provides proper
use of middlewares, which are honored by the muxer
directly. This makes sure that Go can honor its
own internal ServeHTTP(w, r) implementation where
Go net/http can wrap into its own customer readers.
This PR as a side-affect fixes rare issues of client
hangs which were reported in the wild but never really
understood or fixed in our codebase.
Fixes#9759Fixes#7266Fixes#6540Fixes#5455Fixes#5150
Refer https://github.com/boto/botocore/pull/1328 for
one variation of the same issue in #9759
PR #9801 while it is correct, the loop isEndpointConnected()
was changed to rely on endpoint.String() which has the host
information as well, which is not correct value as input to
detect if the disk is down or up, if endpoint is local use
its local path value instead.
This commit changes the data key generation such that
if a MinIO server/nodes tries to generate a new DEK
but the particular master key does not exist - then
MinIO asks KES to create a new master key and then
requests the DEK again.
From now on, a SSE-S3 master key must not be created
explicitly via: `kes key create <key-name>`.
Instead, it is sufficient to just set the env. var.
```
export MINIO_KMS_KES_KEY_NAME=<key-name>
```
However, the MinIO identity (mTLS client certificate)
must have the permission to access the `/v1/key/create/`
API. Therefore, KES policy for MinIO must look similar to:
```
[
/v1/key/create/<key-name-pattern>
/v1/key/generate/<key-name-pattern>
/v1/key/decrypt/<key-name-pattern>
]
```
However, in our guides we already suggest that.
See e.g.: https://github.com/minio/kes/wiki/MinIO-Object-Storage#kes-server-setup
***
The ability to create master keys on request may also be
necessary / useful in case of SSE-KMS.
Current code was relying on globalEndpoints as
the source of secondary truth to obtain
the missing endpoints list when the disk
is offline, this is problematic
- there is no way to know if the getDisks()
returned endpoints total is same as the
ones list of globalEndpoints and it
belongs to a particular set.
- there is no order guarantee as getDisks()
is ordered as per format.json, globalEndpoints
may not be, so potentially end up including
incorrect endpoints.
To fix this bring getEndpoints() just like getDisks()
to ensure that consistently ordered endpoints are
always available for us to ensure that returned values
are consistent with what each erasure set would observe.
Uploading files with names that could not be written to disk
would result in "reduce your request" errors returned.
Instead check explicitly for disallowed characters and reject
files with `Object name contains unsupported characters.`
At a customer setup with lots of concurrent calls
it can be observed that in newRetryTimer there
were lots of tiny alloations which are not
relinquished upon retries, in this codepath
we were only interested in re-using the timer
and use it wisely for each locker.
```
(pprof) top
Showing nodes accounting for 8.68TB, 97.02% of 8.95TB total
Dropped 1198 nodes (cum <= 0.04TB)
Showing top 10 nodes out of 79
flat flat% sum% cum cum%
5.95TB 66.50% 66.50% 5.95TB 66.50% time.NewTimer
1.16TB 13.02% 79.51% 1.16TB 13.02% github.com/ncw/directio.AlignedBlock
0.67TB 7.53% 87.04% 0.70TB 7.78% github.com/minio/minio/cmd.xlObjects.putObject
0.21TB 2.36% 89.40% 0.21TB 2.36% github.com/minio/minio/cmd.(*posix).Walk
0.19TB 2.08% 91.49% 0.27TB 2.99% os.statNolog
0.14TB 1.59% 93.08% 0.14TB 1.60% os.(*File).readdirnames
0.10TB 1.09% 94.17% 0.11TB 1.25% github.com/minio/minio/cmd.readDirN
0.10TB 1.07% 95.23% 0.10TB 1.07% syscall.ByteSliceFromString
0.09TB 1.03% 96.27% 0.09TB 1.03% strings.(*Builder).grow
0.07TB 0.75% 97.02% 0.07TB 0.75% path.(*lazybuf).append
```
For example `{1...17}/{1...52}` symmetrical
distribution of drives cannot be obtained
- Because 17 is a prime number
- Is not divisible by any pre-defined setCounts i.e
from 1 to 16
Manual healing (as background healing) creates a heal task with a
possiblity to override healing options, such as deep or normal mode.
Use a pointer type in heal opts so nil would mean use the default
healing options.
aws cli fails to set a bucket encryption configuration to MinIO server.
The reason is that aws cli does not send MD5-Content header. It seems
that MD5-Content is not required anymore.
This commit also returns Not Implemented header early to help mint tests
to ignore testing this API in gateway modes.
CopyObject was not correctly figuring out the correct
destination object location and would end up creating
duplicate objects on two different zones, reproduced
by doing encryption based key rotation.
Advantages avoids 100's of stats which are needed for each
upload operation in FS/NAS gateway mode when uploading a large
multipart object, dramatically increases performance for
multipart uploads by avoiding recursive calls.
For other gateway's simplifies the approach since
azure, gcs, hdfs gateway's don't capture any specific
metadata during upload which needs handler validation
for encryption/compression.
Erasure coding was already optimized, additionally
just avoids small allocations of large data structure.
Fixes#7206
GetDiskID() in storage rest client does not really issue a REST request
to the remote disk, but returns an in-memory value instead.
However, GetDiskID() should return an error when format.json is not
found or for other similar issues (unmounted disks, etc..)
GetDiskID() is only called when formatting disks and getting storage
informatio, hence this commit should not have a performance degradation.
Additionally also fix STS logs to filter out LDAP
password to be sent out in audit logs.
Bonus fix handle the reload of users properly by
making sure to preserve the newer users during the
reload to be not invalidated.
Fixes#9707Fixes#9644Fixes#9651
Bonus fixes in quota enforcement to use the
new datastructure and use timedValue to cache
a value/reload automatically avoids one less
global variable.
If the requested server is part of the set this will always read
from the local disk, even if the disk contains a parity shard.
In default setup there is a 50% chance that at least
one shard that otherwise would have been fetched remotely
will be read locally instead.
It basically trades RPC call overhead for reed-solomon.
On distributed localhost this seems to be fairly break-even,
with a very small gain in throughput and latency.
However on networked servers this should be a bigger
1MB objects, before:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 76257:
* Avg: 25ms 50%: 24ms 90%: 32ms 99%: 42ms Fastest: 7ms Slowest: 67ms
* First Byte: Average: 23ms, Median: 22ms, Best: 5ms, Worst: 65ms
Throughput:
* Average: 1213.68 MiB/s, 1272.63 obj/s (59.948s, starting 14:45:44 CEST)
```
After:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 78845:
* Avg: 24ms 50%: 24ms 90%: 31ms 99%: 39ms Fastest: 8ms Slowest: 62ms
* First Byte: Average: 22ms, Median: 21ms, Best: 6ms, Worst: 57ms
Throughput:
* Average: 1255.11 MiB/s, 1316.08 obj/s (59.938s, starting 14:43:58 CEST)
```
Bonus fix: Only ask for heal once on an object.
This value is requested on every upload when there are multiple zones.
Since this will result in an RPC call to every remote disk this scales
quite badly in a distributed setup. Load every 1second interval.
2 servers, localhost only. In large distributed setups much bigger
gains can be expected.
```
Operations: 21743 -> 22454
* Average: +3.28% (+0.0 MiB/s) throughput, +3.28% (+11.9) obj/s
* Fastest: +3.37% (+0.0 MiB/s) throughput, +3.37% (+13.0) obj/s
* 50% Median: +3.03% (+0.0 MiB/s) throughput, +3.03% (+11.2) obj/s
* Slowest: +8.03% (+0.0 MiB/s) throughput, +8.03% (+22.8) obj/s
```
For easy management of this a generic helper has been added.
The documentation states that `nVolumeNameSize` and `nFileSystemNameSize` are:
> The length of a volume name buffer, in TCHARs. The maximum buffer size is MAX_PATH+1.
It seems like we allocated too little for them before, so expand it to 260 wchars.
some clients such as veeam expect the x-amz-meta to
be sent in lower cased form, while this does indeed
defeats the HTTP protocol contract it is harder to
change these applications, while these applications
get fixed appropriately in future.
x-amz-meta is usually sent in lowercased form
by AWS S3 and some applications like veeam
incorrectly end up relying on the case sensitivity
of the HTTP headers.
Bonus fixes
- Fix the iso8601 time format to keep it same as
AWS S3 response
- Increase maxObjectList to 50,000 and use
maxDeleteList as 10,000 whenever multi-object
deletes are needed.
No one really uses FS for large scale accounting
usage, neither we crawl in NAS gateway mode. It is
worthwhile to simply disable this feature as its
not useful for anyone.
Bonus disable bucket quota ops as well in, FS
and gateway mode
size calculation in crawler was using the real size
of the object instead of its actual size i.e either
a decrypted or uncompressed size.
this is needed to make sure all other accounting
such as bucket quota and mcs UI to display the
correct values.
This PR adds a new configuration parameter which allows readiness
check to respond within 10secs, this can be reduced to a lower value
if necessary using
```
mc admin config set api ready_deadline=5s
```
or
```
export MINIO_API_READY_DEADLINE=5s
```
net/http exposes ErrorLog but it is log.Logger
instance not an interface which can be overridden,
because of this reason the logging is interleaved
sometimes with TLS with messages like this on the
server
```
http: TLS handshake error from 139.178.70.188:63760: EOF
```
This is bit problematic for us as we need to have
consistent logging view for allow --json or --quiet
flags.
With this PR we ensure that this format is adhered to.
When running a zoned setup simply uploading will run
the system out of memory very fast.
Root cause: nFileSystemNameSize is a DWORD and
not a pointer. No idea how this didn't crash hard.
Furthermore replace poor mans utf16 -> string
conversion to support arbitrary output.
Fixes#9630
Pass '--target core' to docker build in order to only test core.
Pass '--target frontend' to docker build in order to only test frontend.
Pass '--target gateway' to docker build in order to only test gateway.
Numerical references might be supported in the future: https://github.com/moby/buildkit/issues/1503
This commit fixes a layout issue w.r.t. the KMS
Quickstart guide. The problem seems to be caused
by docs server not converting the markdown into html
as expected.
This commit fixes this by converting the ordered list
into subsections.
Groups information shall be now stored as part of the
credential data structure, this is a more idiomatic
way to support large LDAP groups.
Avoids the complication of setups where LDAP groups
can be in the range of 150+ which may lead to excess
HTTP header size > 8KiB, to reduce such an occurrence
we shall save the group information on the server as
part of the credential data structure.
Bonus change support multiple mapped policies, across
all types of users.
This PR is a continuation from #9586, now the
entire parsing logic is fully merged into
bucket metadata sub-system, simplify the
quota API further by reducing the remove
quota handler implementation.
This commit simplifies the KMS configuration guide by
adding a get started section that uses our KES play instance
at `https://play.min.io:7373`.
Further, it removes sections that we don't recommend for production
anyways (MASTER_KEY).
Shuffling arguments that we pass to MinIO server are supported. However,
when that happens, Prometheus returns wrong information about disks usage
and online/offline status.
The commit fixes the issue by avoiding relying on xl.endpoints since
it is not ordered.
this is a major overhaul by migrating off all
bucket metadata related configs into a single
object '.metadata.bin' this allows us for faster
bootups across 1000's of buckets and as well
as keeps the code simple enough for future
work and additions.
Additionally also fixes#9396, #9394
To avoid this issue with refCounter refactor the code
such that
- locker() always increases refCount upon success
- unlocker() always decrements refCount upon success
(as a special case removes the resource if the
refCount is zero)
By these two assumptions we are able to see that we
are never granted two write lockers in any situation.
Thanks to @vcabbage for writing a nice reproducer.
enable linter using golangci-lint across
codebase to run a bunch of linters together,
we shall enable new linters as we fix more
things the codebase.
This PR fixes the first stage of this
cleanup.
There is a disparency of behavior under Linux & Windows about
the returned error when trying to rename a non existant path.
err := os.Rename("/path/does/not/exist", "/tmp/copy")
Linux:
isSysErrNotDir(err) = false
os.IsNotExist(err) = true
Windows:
isSysErrNotDir(err) = true
os.IsNotExist(err) = true
ENOTDIR in Linux is returned when the destination path
of the rename call contains a file in one of the middle
segments of the path (e.g. /tmp/file/dst, where /tmp/file
is an actual file not a directory)
However, as shown above, Windows has more scenarios when
it returns ENOTDIR. For example, when the source path contains
an inexistant directory in its path.
In that case, we want errFileNotFound returned and not
errFileAccessDenied, so this commit will add a further check to close
the disparency between Windows & Linux.
This commit updates the two client env. variables:
```
KES_CLIENT_TLS_KEY_FILE
KES_CLIENT_TLS_CERT_FILE
```
The KES CLI client expects the client key and certificate
as `KES_CLIENT_KEY` resp. `KES_CLIENT_CERT`.
The `ioutil.NopCloser(reader)` was hiding nested hash readers.
We make it an `io.Closer` so it can be attached without wrapping
and allows for nesting, by merging the requests.
test is failing in Azure gateway as a mapping from
x-amz-storage-class to Azure storage tier is failing
and there is no way to not send storage class in s3cmd.
The `keepHTTPResponseAlive` would cause errors to be
returned with status OK.
- Add '32' as a filler byte until a response is ready
- '0' to indicate the response is ready to be consumed
- '1' to indicate response has an error which needs
to be returned to the caller
Clear out 'file not found' errors from dir walker, since it may be
in a folder that has been deleted since it was scanned.
s3:HardwareInfo was removed recently. Users having that admin action
stored in the backend will have an issue starting the server.
To fix this, we need to avoid returning an error in Marshal/Unmarshal
when they encounter an invalid action and validate only in specific
location.
Currently the validation is done and in ParseConfig().
This PR is to ensure that we call the relevant object
layer APIs for necessary S3 API level functionalities
allowing gateway implementations to return proper
errors as NotImplemented{}
This allows for all our tests in mint to behave
appropriately and can be handled appropriately as
well.
S3 is now natively supported by B2 cloud storage provider
there is no reason to use specialized gateway for B2 anymore,
our current S3 gateway with caching would work with B2.
Resolves#8584
requests in federated setups for STS type calls which are
performed at '/' resource should be routed by the muxer,
the assumption is simply such that requests without a bucket
in a federated setup cannot be proxied, so serve them at
current server.
This commit makes the KES client use HTTP/2
when establishing a connection to the KES server.
This is necessary since the next KES server release
will require HTTP/2.
We should allow quorum errors to be send upwards
such that caller can retry while reading bucket
encryption/policy configs when server is starting
up, this allows distributed setups to load the
configuration properly.
Current code didn't facilitate this and would have
never loaded the actual configs during rolling,
server restarts.
This commit updates the KMS guide to reflect the
latest changes in KES. Based on internal design
meetings we made some adjustments to the overall
KES configuration.
This commit ensures that the KMS guide contains
a working KES demo-setup with Vault.
In large setups this avoids unnecessary data transfer
across nodes and potential locks.
This PR also optimizes heal result channel, which should
be avoided for each queueHealTask as its expensive
to create/close channels for large number of objects.
This PR allows setting a "hard" or "fifo" quota
restriction at the bucket level. Buckets that
have reached the FIFO quota configured, will
automatically be cleaned up in FIFO manner until
bucket usage drops to configured quota.
If a bucket is configured with a "hard" quota
ceiling, all further writes are disallowed.
ResponseWriter & RecordAPIStats has similar role, merge them.
This commit will also fix wrong auditing for STS and Web and others
since they are using ResponseWriter instead of the RecordAPIStats.
A user can incorrectly mounts a newly fresh disk. MinIO will detect
that it is writing with a rootfs disk and will mark it down. However,
it is hard for the user to understand what's going on.
This commit will just print a notice so it will be easy to spot
such use case.
- elasticsearch client should rely on the SDK helpers
instead of pure HTTP calls.
- webhook shouldn't need to check for IsActive() for
all notifications, failure should be delayed.
- Remove DialHTTP as its never used properly
Fixes#9460
allow generating service accounts for temporary credentials
which have a designated parent, currently OpenID is not yet
supported.
added checks to ensure that service account cannot generate
further service accounts for itself, service accounts can
never be a parent to any credential.
Audit was not working properly when enabled from the environment
caused by a typo in the code.
This commit fixes that but also consider the following variables:
`MINIO_LOGGER_WEBHOOK_ENABLE_*` and
`MINIO_AUDIT_WEBHOOK_ENABLE_*` so the user can use
this latter to temporarily disable a logger or audit configuration.
data usage tracker and crawler seem to be logging
non-actionable information on console, which is not
useful and is fixed on its own in almost all deployments,
lets keep this logging to minimal.
it is possible in many screnarios that even
if the divisible value is optimal, we may
end up with uneven distribution due to number
of nodes present in the configuration.
added code allow for affinity towards various
ellipses to figure out optimal value across
ellipses such that we can always reach a
symmetric value automatically.
Fixes#9416
By monitoring PUT/DELETE and heal operations it is possible
to track changed paths and keep a bloom filter for this data.
This can help prioritize paths to scan. The bloom filter can identify
paths that have not changed, and the few collisions will only result
in a marginal extra workload. This can be implemented on either a
bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 1M
entries. A bloom table of this size is about ~2500 bytes when serialized.
To not force a full scan of all paths that have changed cycle bloom
filters would need to be kept, so we guarantee that dirty paths have
been scanned within cycle runs. Until cycle bloom filters have been
collected all paths are considered dirty.
this commit avoids lots of tiny allocations, repeated
channel creates which are performed when filtering
the incoming events, unescaping a key just for matching.
also remove deprecated code which is not needed
anymore, avoids unexpected data structure transformations
from the map to slice.
we have policy available for sub-admin users to set/get/delete
config, but we incorrectly decrypt the content using admin secret
key which in-fact should be the credential authenticating the
request.
global WORM mode is a complex piece for which
the time has passed, with the advent of S3 compatible
object locking and retention implementation global
WORM is sort of deprecated, this has been mentioned
in our documentation for some time, now the time
has come for this to go.
re-implement the cache purging routine to
avoid using ioutil.ReadDir which can lead
to high allocations when there are cache
directories with lots of content, or
when cache is installed in memory constrainted
environments.
Instead rely on a callback function where we
are not using memory no-more than 8KiB per
cycle.
Precursor for this change refer #9425, original
issue pointed by Caleb Case <caleb@storj.io>
OSS go sdk lacks licensing terms in their
repository, and there has been no activity
On the issue here https://github.com/aliyun/aliyun-oss-go-sdk/issues/245
This PR is to ensure we remove any dependency code which
lacks explicit license file in their repo.
- keep long running obd network tests alive
- fix error - wrong number of parents in process OBD info
- ensure that osinfo does not error out when inside containers
- remove limit on max number of connections per client transport
The generic client transport uses a default limit of 64 conns per transport.
This could end up limiting and throttling usage, and artificially slowing
down the performance of MinIO even on hardware capable of doing better.
New value defaults to 100K events by default,
but users can tune this value upto any value
they seem necessary.
* increase the limit to maxint64 while validating
This PR also fixes issues when
deletePolicy, deleteUser is idempotent so can lead to
issues when client can prematurely timeout, so a retry
call error response should be ignored when call returns
http.StatusNotFound
Fixes#9347
allow `mc admin config set mygateway/ audit_webhook --env`
to fetch the documentation as needed, this is just to
ensure that our users can still access the relevant
ENV docs while running in gateway mode.
Some tests take a long time on CI:
* `--- PASS: TestRWMutex (226.49s)`
* ` --- PASS: TestRWMutex (7.13s)`
Reduce the number of runs.
Before/after locally:
```
--- PASS: TestRWMutex (20.95s)
--- PASS: TestRWMutex (7.13s)
--- PASS: TestMutex (3.01s)
--- PASS: TestMutex (1.65s)
```
Instead of GlobalContext use a local context for tests.
Most notably this allows stuff created to be shut down
when tests using it is done. After PR #9345 9331 CI is
often running out of memory/time.
Add two new configuration entries, api.requests-max and
api.requests-deadline which have the same role of
MINIO_API_REQUESTS_MAX and MINIO_API_REQUESTS_DEADLINE.
This PR fixes couple of behaviors with service accounts
- not need to have session token for service accounts
- service accounts can be generated by any user for themselves
implicitly, with a valid signature.
- policy input for AddNewServiceAccount API is not fully typed
allowing for validation before it is sent to the server.
- also bring in additional context for admin API errors if any
when replying back to client.
- deprecate GetServiceAccount API as we do not need to reply
back session tokens
- Introduced a function `FetchRegisteredTargets` which will return
a complete set of registered targets irrespective to their states,
if the `returnOnTargetError` flag is set to `False`
- Refactor NewTarget functions to return non-nil targets
- Refactor GetARNList() to return a complete list of configured targets
Picking up all support files from the builder image has the advantage
that the Dockerfile is now fully selfcontained and can be also
run just standalone.
This allows also cross-compilation and pushing with the proper manifests
with Docker Buildkit:
```
docker buildx create --name xbuilder
docker buildx use xbuilder
docker buildx build -f Dockerfile.minio --platform linux/arm/v7,linux/amd64 --progress plain --push -t minio/minio .
```
which also has the advantage that the Dockerfile is the same
for all platforms.
Co-authored-by: Harshavardhana <harsha@minio.io>
- Removes PerfInfo admin API as its not OBDInfo
- Keep the drive path without the metaBucket in OBD
global latency map.
- Remove all the unused code related to PerfInfo API
- Do not redefined global mib,gib constants use
humanize.MiByte and humanize.GiByte instead always
if needed use --no-compat to disable md5sum while
verifying any performance numbers.
bring back --compat behavior as default to avoid
additional documentation and confusing behavior,
as we are working towards improving md5sum to
be faster on AVX instructions, enabling this
should be hardly a problem in future versions
of MinIO.
fixes#8012fixes#7859fixes#7642
Continuing from previous PR #9304, comment
is a special key is not present in the
default KV list. Add it explicitly when
tokenizing fields as it may be possible that
some clients might try to set comments.
This PR adds context-based `k=v` splits based
on the sub-system which was obtained, if the
keys are not provided an error will be thrown
during parsing, if keys are provided with wrong
values an error will be thrown. Keys can now
have values which are of a much more complex
form such as `k="v=v"` or `k=" v = v"`
and other variations.
additionally, deprecate unnecessary postgres/mysql
configuration styles, support only
- connection_string for Postgres
- dsn_string for MySQL
All other parameters are removed.
This commit fixes a performance issue caused
by too many calls to the external KMS - i.e.
for single-part PUT requests.
In general, the issue is caused by a sub-optimal
code structure. In particular, when the server
encrypts an object it requests a new data encryption
key from the KMS. With this key it does some key
derivation and encrypts the object content and
ETag.
However, to behave S3-compatible the MinIO server
has to return the plaintext ETag to the client
in case SSE-S3.
Therefore, the server code used to decrypt the
(previously encrypted) ETag again by requesting
the data encryption key (KMS decrypt API) from
the KMS.
This leads to 2 KMS API calls (1 generate key and
1 decrypt key) per PUT operation - while only
one KMS call is necessary.
This commit fixes this by fetching a data key only
once from the KMS and keeping the derived object
encryption key around (for the lifetime of the request).
This leads to a significant performance improvement
w.r.t. to PUT workloads:
```
Operation: PUT
Operations: 161 -> 239
Duration: 28s -> 29s
* Average: +47.56% (+25.8 MiB/s) throughput, +47.56% (+2.6) obj/s
* Fastest: +55.49% (+34.5 MiB/s) throughput, +55.49% (+3.5) obj/s
* 50% Median: +58.24% (+32.8 MiB/s) throughput, +58.24% (+3.3) obj/s
* Slowest: +1.83% (+0.6 MiB/s) throughput, +1.83% (+0.1) obj/s
```
Fixes#8667
In addition to the above, if the user is mapped to a policy or
belongs in a group, the user-info API returns this information,
but otherwise, the API will now return a non-existent user error.
make rest of the Walk() function more predictable,
it was observed that in nominal deployments even
without much workload the drives are generally
slow for respond for readdir operations, for the
sleepDuration factor of 10 this can cause
unexpected slowness in the Listing calls, while
it is good for all other I/O, it may simply slow
down Listing immensely which is not useful.
fixes#9261
In FS mode under Windows, removing an object will not automatically.
remove parent empty prefixes.
The reason is that path.Dir() was used, however filepath.Dir() is
more appropriate since filepath is physical (meaning it operates
on OS filesystem paths)
This is not caught because failure for Windows CI is not caught.
fs-v1 in server mode only checks to see if the path exist, so that it
returns ready before it is indeed ready.
This change adds a check to ensure that the global object api is
available too before reporting ready.
Fixes#9283
It is some times common and convenient to use
just local IPs for testing purposes, 127.0.0.x
are special IPs regardless of being available on
an interface they can be bound to on all operating
systems.
Allow this behavior to work for minio server
fixes#9274
also, bring in an additional policy to ensure that
force delete bucket is only allowed with the right
policy for the user, just DeleteBucketAction
policy action is not enough.
This PR also tries to simplify the approach taken in
object-locking implementation by preferential treatment
given towards full validation.
This in-turn has fixed couple of bugs related to
how policy should have been honored when ByPassGovernance
is provided.
Simplifies code a bit, but also duplicates code intentionally
for clarity due to complex nature of object locking
implementation.
This commit modifies csv parser, a fork of golang csv
parser to support a custom quote escape character.
The quote escape character is used to escape the quote
character when a csv field contains a quote character
as part of data.
Use the *credentials.Credentials implementation method *Get*
```
func (c *Credentials) Get() (Value, error) {
```
which also handles auto-refresh, this allows for chaining
of various implementations together if necessary or simply
initialize with credentials.NewStaticV4(access, secret, token)
Co-authored-by: Klaus Post <klauspost@gmail.com>
Too many deployments come up with an odd number
of hosts or drives, to facilitate even distribution
among those setups allow for odd and prime numbers
based packs.
- B2 does actually return an MD5 hash for newly uploaded objects
so we can use it to provide better compatibility with S3 client
libraries that assume the ETag is the MD5 hash such as boto.
- depends on change in blazer library.
- new behaviour is only enabled if MinIO's --compat mode is active.
- behaviour for multipart uploads is unchanged (works fine as is).
- Implement a graph algorithm to test network bandwidth from every
node to every other node
- Saturate any network bandwidth adaptively, accounting for slow
and fast network capacity
- Implement parallel drive OBD tests
- Implement a paging mechanism for OBD test to provide periodic updates to client
- Implement Sys, Process, Host, Mem OBD Infos
- total number of S3 API calls per server
- maximum wait duration for any S3 API call
This implementation is primarily meant for situations
where HDDs are not capable enough to handle the incoming
workload and there is no way to throttle the client.
This feature allows MinIO server to throttle itself
such that we do not overwhelm the HDDs.
- acquire since leader lock for all background operations
- healing, crawling and applying lifecycle policies.
- simplify lifecyle to avoid network calls, which was a
bug in implementation - we should hold a leader and
do everything from there, we have access to entire
name space.
- make listing, walking not interfere by slowing itself
down like the crawler.
- effectively use global context everywhere to ensure
proper shutdown, in cache, lifecycle, healing
- don't read `format.json` for prometheus metrics in
StorageInfo() call.
- Add conservative timeouts upto 3 minutes
for internode communication
- Add aggressive timeouts of 30 seconds
for gateway communication
Fixes#9105Fixes#8732Fixes#8881Fixes#8376Fixes#9028
- extract userTags from Get/Head request (#1249)
- fix: Context cancellation not handled (#1250)
- Check for correct http status in remove object tagging (#1248)
- simplify extracting metadata in Head/Get object (#1245)
- fix: close and remove .minio.part file on errors (#1243)
NAS gateway creates non-multipart-uploads with mode 0666.
But multipart-uploads are created with a differing mode of 0644.
Both modes should be equal! Else it leads to files with different
permissions based on its file-size. This patch solves that by
using 0666 for both cases.
This is to improve responsiveness for all
admin API operations and allowing callers
to cancel any on-going admin operations,
if they happen to be waiting too long.
canonicalize the ENVs such that we can bring these ENVs
as part of the config values, as a subsequent change.
- fix location of per bucket usage to `.minio.sys/buckets/<bucket_name>/usage-cache.bin`
- fix location of the overall usage in `json` at `.minio.sys/buckets/.usage.json`
(avoid conflicts with a bucket named `usage.json` )
- fix location of the overall usage in `msgp` at `.minio.sys/buckets/.usage.bin`
(avoid conflicts with a bucket named `usage.bin`
As an optimization of the healing, HealObjects() avoid sending an
object to the background healing subsystem when the object is
present in all disks.
However, HealObjects() should have checked the scan type, if this
deep, always pass the object to the healing subsystem.
Currently, a tree walking, needed to a list objects in a specific
set quits listing as long as it finds no entries in a disk, which
is wrong.
This affected background healing, because the latter is using
tree walk directly. If one object does not exist in the first
disk for example, it will be seemed like the object does not
exist at all and no healing work is needed.
This commit fixes the behavior.
The staleness of a lock should be determined by
the quorum number of entries returning stale,
this allows for situations when locks are held
when nodes are down - we don't accidentally
clear locks unintentionally when they are valid
and correct.
Also lock maintenance should be run by all servers,
not one server, stale locks need to be run outside
the requirement for holding distributed locks.
Thanks @klauspost for reproducing this issue
Some AWS SDKs latently rely on this value some times
to calculate the right number of parts during a parallel
GetObject request, this is feature used along with
content-range - we should support this as well.
This commit fixes the env. variable in the
KMS guide used to specify the CA certificates
for the KES server.
Before the env. variable `MINIO_KMS_KES_CAPATH` has
been used - which works in non-containerized environments
due to how MinIO merges the config file and environment
variables. In containerized environments (e.g. docker)
this does not work and trying to specify `MINIO_KMS_KES_CAPATH`
instead of `MINIO_KMS_KES_CA_PATH` eventually leads to MinIO not
trusting the certificate presented by the kes server.
See: cfd12914e1/cmd/crypto/config.go (L186)
- avoid setting last heal activity when starting self-healing
This can be confusing to users thinking that the self healing
cycle was already performed.
- add info about the next background healing round
OperationTimedout error occurs when locking
timesout, trying to acquire a lock. This
error should be returned appropriately to
the client with http status "408" (request timedout)
This translation was broken, fix it.
Bulk delete API was using cleanupObjectsBulk() which calls posix
listing and delete API to remove objects internal files in the
backend (xl.json and parts) one by one.
Add DeletePrefixes in the storage API to remove the content
of a directory in a single call.
Also use a remove goroutine for each disk to accelerate removal.
Currently the code assumed some orthogonal requirements
which led situations where when we have a setup where
we have let's say for example 168 drives, the final
set_drive_count chosen was 14. Indeed 168 drives are
divisible by 12 but this wasn't allowed due to an
unexpected requirement to have 12 to be a perfect modulo
of 14 which is not possible. This assumption was incorrect.
This PR fixes this old assumption properly, also adds
few tests and some negative tests as well. Improvements
are seen in error messages as well.
- Remove the requirement to honor storage class for deletes
- Improve `posix.DeleteFileBulk` code to Stat the volumeDir
only once per call, rather than for all object paths.
Recent modification in the code led to incorrect calculation
of offline disks.
This commit saves the endpoint list in a xlObjects then we know
the name of each disk.
lock ownership is limited to endpoints on first zone,
as we do not hold locks on other zones in an expanded
setup. current code unintentionally expired active locks
when it couldn't see ownership from the secondary zone
which leads to unexpected bugs as locking fails to work
as expected.
this PR enforces md5sum verification for following
API's to be compatible with AWS S3 spec
- PutObjectRetention
- PutObjectLegalHold
Co-authored-by: Harshavardhana <harsha@minio.io>
Allow downloading goroutine dump to help detect leaks
or overuse of goroutines.
Extensions are now type dependent.
Change `profiling` -> `profile` prefix, since that is what they are
not the abstract concept.
This is a precursor change before versioning,
removes/deprecates the requirement of remembering
partName and partETag which are not useful after
a multipart transaction has finished.
This PR reduces the overall size of the backend
JSON for large file uploads.
For a non-existent user server would return STS not initialized
```
aws --profile harsha --endpoint-url http://localhost:9000 \
sts assume-role \
--role-arn arn:xxx:xxx:xxx:xxxx \
--role-session-name anything
```
instead return an appropriate error as expected by STS API
Additionally also format the `trace` output for STS APIs
Upgrades between releases are failing due to strict
rule to avoid rolling upgrades, it is enough to
bump up APIs between versions to allow for quorum
failure and wait times. Authentication failures are
catastrophic in nature which leads to server not
be able to upgrade properly.
Fixes#9021Fixes#8968
To allow better control the cache eviction process.
Introduce MINIO_CACHE_WATERMARK_LOW and
MINIO_CACHE_WATERMARK_HIGH env. variables to specify
when to stop/start cache eviction process.
Deprecate MINIO_CACHE_EXPIRY environment variable. Cache
gc sweeps at 30 minute intervals whenever high watermark is
reached to clear least recently accessed entries in the cache
until sufficient space is cleared to reach the low watermark.
Garbage collection uses an adaptive file scoring approach based
on last access time, with greater weights assigned to larger
objects and those with more hits to find the candidates for eviction.
Thanks to @klauspost for this file scoring algorithm
Co-authored-by: Klaus Post <klauspost@minio.io>
Change distributed locking to allow taking bulk locks
across objects, reduces usually 1000 calls to 1.
Also allows for situations where multiple clients sends
delete requests to objects with following names
```
{1,2,3,4,5}
```
```
{5,4,3,2,1}
```
will block and ensure that we do not fail the request
on each other.
Metrics used to have its own code to calculate offline disks.
StorageInfo() was avoided because it is an expensive operation
by sending calls to all nodes.
To make metrics & server info share the same code, a new
argument `local` is added to StorageInfo() so it will only
query local disks when needed.
Metrics now calls StorageInfo() as server info handler does
but with the local flag set to false.
Co-authored-by: Praveen raj Mani <praveen@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Add dummy calls which respond success when ACL's
are set to be private and fails, if user tries
to change them from their default 'private'
Some applications such as nuxeo may have an
unnecessary requirement for this operation,
we support this anyways such that don't have
to fully implement the functionality just that
we can respond with success for default ACLs
Avoid GetObjectNInfo call from cache in CopyObjectHandler
- in the case of server side copy with metadata replacement,
the reader returned from cache is never consumed, but the net
effect of GetObjectNInfo from cache layer, is cache holding a
write lock to fill the cache. Subsequent stat operation on cache in
CopyObject is not able to acquire a read lock, thus causing the hang.
Fixes#8991
we don't need to validateFormats again once we have obtained
reference format, because it is possible that at this stage
another server is doing a disk heal during startup, once
in a while due to delays we get false positives and our
server doesn't start.
Format in quorum as reference format can be assumed as valid
and we proceed further, until and unless HealFormat re-inits
the disks after a successful heal.
Also use separate port for healing tests to avoid any
conflicts with regular build testing.
Fixes#8884
AWS S3 doesn't enforce the URL in XMLNS, accordingly, removing the
URL in XMLNS for ObjectLegalHold.
This was found while testing https://github.com/minio/minio-go/pull/1226
RegisterNotificationTargets() cleans up all connections
that it makes to notification targets when an error occurs
during its execution.
However there is a typo in the code that makes the function to always
try to access to a nil pointer in the defer code since the function
in question will always return nil in the case of any error.
This commit fixes the typo in the code.
http.Request.ContentLength can be negative, which affects
the gateway_s3_bytes_received value in Prometheus output.
The commit only increases the value of the total received bytes
in gateway mode when r.ContentLength is greater than zero.
First step is to ensure that Path component is not decoded
by gorilla/mux to avoid routing issues while handling
certain characters while uploading through PutObject()
Delay the decoding and use PathUnescape() to escape
the `object` path component.
Thanks to @buengese and @ncw for neat test cases for us
to test with.
Fixes#8950Fixes#8647
We added support for caching and S3 related metrics in #8591. As
a continuation, it would be helpful to add support for Azure & GCS
gateway related metrics as well.
The logging subsystem was initialized under init() method in
both gateway-main.go and server-main.go which are part of
same package. This created two logging targets and hence
errors were logged twice. This PR moves the init() method
to common-main.go
Streams are returning a readcloser and returning would
decrement io count instantly, fix it.
change maxActiveIOCount to 3, meaning it will pause
crawling if 3 operations are running.
Remove the random sleep. This is running in 4 goroutines,
so mostly doing nothing.
We use the getSize latency to estimate system load,
meaning when there is little load on the system and
we get the result fast we sleep a little.
If it took a long time we have high load and release
ourselves longer.
We are sleeping inside the mutex so this affects all
goroutines doing IO.
Due to a typo in the code, a cluster was not correctly creating
`background-ops` in all disks and nodes print the following error:
minio3_1 | API: SYSTEM()
minio3_1 | Time: 19:32:45 UTC 02/06/2020
minio3_1 | DeploymentID: d67c20fa-4a1e-41f5-b319-7e3e90f425d8
minio3_1 | Error: Bucket not found: .minio.sys/background-ops
minio3_1 | 2: cmd/data-usage.go:109:cmd.runDataUsageInfo()
minio3_1 | 1: cmd/data-usage.go:56:cmd.runDataUsageInfoUpdateRoutine()
This commit fixes the typo.
Adding mutex slows down the crawler to avoid large
spikes in CPU, also add millisecond interval jitter
in calculation of disk usage to slow down the spikes
further.
This is to fix a situation where an object name incorrectly
is sent with '//' in its path heirarchy, we should reject
such object names because they may be hashed to a set where
the object might not originally belong because, this can
cause situations where once object is uploaded we cannot
delete it anymore.
Fixes#8873
This commit fixes typos in the displayed server info
w.r.t. the KMS and removes the update status.
For more information about why the update status
is removed see: PR #8943
This commit removes the `Update` functionality
from the admin API. While this is technically
a breaking change I think this will not cause
any harm because:
- The KMS admin API is not complete, yet.
At the moment only the status can be fetched.
- The `mc` integration hasn't been merged yet.
So no `mc` client could have used this API
in the past.
The `Update`/`Rewrap` status is not useful anymore.
It provided a way to migrate from one master key version
to another. However, KES does not support the concept of
key versions. Instead, key migration should be implemented
as migration from one master key to another.
Basically, the `Update` functionality has been implemented just
for Vault.
- pkg/bucket/encryption provides support for handling bucket
encryption configuration
- changes under cmd/ provide support for AES256 algorithm only
Co-Authored-By: Poorna <poornas@users.noreply.github.com>
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit updates the KMS getting started guide
and replaces the legacy MinIO<-->Vault setup with a
MinIO<-->KES<-->Vault setup.
Therefore, add some architecture ASCII diagrams and
provide a step-by-step guide to setup Vault, KES and
MinIO such that MinIO can encrypt objects with KES +
Vault.
The legacy Vault guide has been moved to `./vault-legacy.md`.
Co-authored-by: Harshavardhana <harsha@minio.io>
XL crawling wrongly returns a zero buckets count when
there are no objects uploaded in the server yet. The reason is
data of the crawler of posix returns invalid result when all
disks has zero objects.
A simple fix is to always pick the crawling result of the first
disk but choose over the result of the disk which has the most
objects in it.
looks like 1024 buffer size is not enough in
all situations, use 8192 instead which
can satisfy all the rare situations that
may arise in base64 decoding.
multi-delete API failed with write quorum errors
under following situations
- list of files requested for delete doesn't exist
anymore can lead to quorum errors and failure
- due to usage of query param for paths, for really
long paths MinIO server rejects these requests as
malformed as unexpected.
This was reproduced with warp
The server info handler makes a http connection to other
nodes to check if they are up but does not load the custom
CAs in ~/.minio/certs/CAs.
This commit fix it.
Co-authored-by: Harshavardhana <harsha@minio.io>
JWT parsing is simplified by using a custom claim
data structure such as MapClaims{}, also writes
a custom Unmarshaller for faster unmarshalling.
- Avoid as much reflections as possible
- Provide the right types for functions as much
as possible
- Avoid strings.Join, strings.Split to reduce
allocations, rely on indexes directly.
For 'snapshot' type profiles, record a 'before' profile that can be used
as `go tool pprof -base=before ...` to compare before and after.
"Before" profiles are included in the zipped package.
[`runtime.MemProfileRate`](https://golang.org/pkg/runtime/#pkg-variables)
should not be updated while the application is running, so we set it at startup.
Co-authored-by: Harshavardhana <harsha@minio.io>
On every restart of the server, usage was being
calculated which is not useful instead wait for
sufficient time to start the crawling routine.
This PR also avoids lots of double allocations
through strings, optimizes usage of string builders
and also avoids crawling through symbolic links.
Fixes#8844
Choosing maxAllowedIOError is arbitrary and
prone to errors, when drives might be perfectly
capable of taking I/O with only few locations
return I/O error. This is a hindrance of sort
where backend filesystems like ZFS can automatically
fix and handle these scenarios.
The added problem with current approach that we
take the drive offline, making it virtually impossible
to bring it online without restart the server which
is not desirable on a busy cluster. Remove this state
such that let the backend return error appropriately
to caller and let the caller decide what to do with
the error.
return http.ErrServerClosed with proper body when
server is shutting down, allowing more context instead
of just returning '503' which doesn't mean the same
thing.
code of this form is always racy, when the
map itself is being written to as well
```
func (r Map) retMap() map[string]string {
.. lock ..
return r.internalMap
}
func (r Map) addMap(k, v string) {
.. lock ..
r.internalMap[k] = v
}
```
Anyone reading from `retMap()` is not protected
because of locking and we need to make sure
to avoid code in this manner. Always safe to
copy the map and return.
Zone abstraction of object layer was returning `nil`
incorrectly under situations where disk healing is
not required. Returning `nil` is considered as healing
successful, which leads to unexpected ReloadFormat()
peer notification calls during startup.
This PR fixes this behavior properly for zones.
Maven repository requires HTTPS now. This lead to issues
building mint image in aws-sdk-java & minio-java.
The PR fixes the issue and also bump aws sdk version in
aws-sdk-java to the latest.
Meta volumes directories, tmp/, background-ops/, etc..
undr .minio.sys are created when disks are formatted
but also when the cluster is started.
However using MakeVolBulk() is not appropriate in the
case of a user migrating from a version which does not
have .minio.sys/background-ops/. The reason is that
MakeVolBulk() exits early when an error is occured:
errVolumeExists in this case, which is expected since
some directories such as tmp/ already exist.
This commit will avoid use MakeVolBulk and use MakeVol
instead.
Also the PR will make each node creates meta volumes
in its local disks and stop relying on the first disk
since the first node could be offline.
instead perform a liveness check call to
verify if server is online and print relevant
errors.
Also introduce a StorageErr string error type
instead of errors.New() deprecate usage of
VerifyFileError, DeleteFileError for gob,
change in datastructure also requires bump in
storage REST version to v13.
Fixes#8811
Enabling the memory profiling has a significant impact on performance.
Reduce the profiling rate by 2 orders of magnitude. It is still 128x smaller than default so it should be plenty.
object lock config is enabled for a bucket.
Creating a bucket with object lock configuration
enabled does not automatically cause WORM protection
to be applied. PUT operation needs to specifically
request object locking or bucket has to have default
retention settings configured.
Fixes regression introduced in #8657
When formatting a set validate if a host failure will likely lead to data loss.
While we don't know what config will be set in the future
evaluate to our best knowledge, assuming default settings.
X-Cache sets cache status of HIT if object is
served from the disk cache, or MISS otherwise.
X-Cache-Lookup is set to HIT if object was found
in the cache even if not served (for e.g. if cache
entry was invalidated by ETag verification)
A new key was added in identity_openid recently
required explicitly for client to set the optional
value without that it would be empty, handle this
appropriately.
Fixes#8787
Use reference format to initialize lockers
during startup, also handle `nil` for NetLocker
in dsync and remove *errorLocker* implementation
Add further tuning parameters such as
- DialTimeout is now 15 seconds from 30 seconds
- KeepAliveTimeout is not 20 seconds, 5 seconds
more than default 15 seconds
- ResponseHeaderTimeout to 10 seconds
- ExpectContinueTimeout is reduced to 3 seconds
- DualStack is enabled by default remove setting
it to `true`
- Reduce IdleConnTimeout to 30 seconds from
1 minute to avoid idleConn build up
Fixes#8773
- Stop spawning store replay routines when testing the notification targets
- Properly honor the target.Close() to clean the resources used
Fixes#8707
Co-authored-by: Harshavardhana <harsha@minio.io>
In certain organizations policy claim names
can be not just 'policy' but also things like
'roles', the value of this field might also
be *string* or *[]string* support this as well
In this PR we are still not supporting multiple
policies per STS account which will require a
more comprehensive change.
Exponential backoff does not seem like a good fit for
this function since we can expect a few roundtrips on
initial startup.
This retry loop get slow pretty quickly with initial
wait being 1 second and each try being double the
wait until 30 seconds is reached.
Instead simply try 2 times per second.
This PR adds jsoniter package to replace encoding/json
in places where faster json unmarshal is necessary
whenever input JSON is large enough.
Some benchmarking comparison between jsoniter and enconding/json
benchmark old MB/s new MB/s speedup
BenchmarkParseUnmarshal/N10-4 110.02 331.17 3.01x
BenchmarkParseUnmarshal/N100-4 125.74 524.09 4.17x
BenchmarkParseUnmarshal/N500-4 131.68 542.60 4.12x
BenchmarkParseUnmarshal/N1000-4 133.93 514.88 3.84x
BenchmarkParseUnmarshal/N5000-4 122.10 415.36 3.40x
BenchmarkParseUnmarshal/N10000-4 132.13 403.90 3.06x
Fixes scenario where zones are appropriately
handled, along with supporting overriding set
count. The new fix also ensures that we handle
the various setup types properly.
Update documentation to properly indicate the
behavior.
Fixes#8750
Co-authored-by: Nitish Tiwari <nitish@minio.io>
Currently when connections to vault fail, client
perpetually retries this leads to assumptions that
the server has issues and masks the problem.
Re-purpose *crypto.Error* type to send appropriate
errors back to the client.
In existing functionality we simply return a generic
error such as "MalformedPolicy" which indicates just
a generic string "invalid resource" which is not very
meaningful when there might be multiple types of errors
during policy parsing. This PR ensures that we send
these errors back to client to indicate the actual
error, brings in two concrete types such as
- iampolicy.Error
- policy.Error
Refer #8202
minio server /data{1..4} shows an error about inability to bind a port, though
the real problem is /data{1..4} cannot be created because of the lack of
permissions.
This commit fix the behavior.
- We should declare a cluster ready even if read quorum is achieved (atleast n/2 disks are online).
- Such that, all the zones should have enough read quorum. Thus making the cluster ready for reads.
We had messy cyclical dependency problem with `mc`
due to dependencies in pkg/console, moved the pkg/console
to minio for more control and also to avoid any further
cyclical dependencies of `mc` clobbering up the
dependencies on server.
Fixes#8659
If two distinct clusters are started with different domains
along with single common domain, this situation was leading
to conflicting buckets getting created on different clusters
To avoid this do not prematurely error out if the key has no
entries, let the caller decide on which entry matches and
which entry is valid. This allows support for MINIO_DOMAIN
with one common domain, but each cluster may have their own
domains.
Fixes#8705
This is to ensure that when we have multiple tenants
deployed all sharing the same etcd for global bucket
should avoid listing each others buckets, this leads
to information leak which should be avoided unless
etcd is not namespaced for IAM assets in which case
it can be assumed that its a federated setup.
Federated setup and namespaced IAM assets on etcd
is not supported since namespacing is only useful
when you wish to separate the tenants as isolated
instances of MinIO.
This PR allows a new type of behavior, primarily
driven by the usecase of m3(mkube) multi-tenant
deployments with global bucket support.
Currently all bucket events are sent to all watchers
with matching prefix and event names, this becomes
problematic and prone to performance issues, fix this
situation by filtering based on buckets as well.
This PR fixes the issue where we might allow policy changes
for temporary credentials out of band, this situation allows
privilege escalation for those temporary credentials. We
should disallow any external actions on temporary creds
as a practice and we should clearly differentiate which
are static and which are temporary credentials.
Refer #8667
Changes in IP underneath are dynamic in replica sets
with multiple tenants, so deploying in that fashion
will not work until we wait for atleast one participatory
server to be local.
This PR also ensures that multi-tenant zone expansion also
works in replica set k8s deployments.
Introduces a new ENV `KUBERNETES_REPLICA_SET` check to call
appropriate code paths.
Continuation from #8629 which basically broke
zone deployments on k8s statefulset environment
due to incorrect assumptions which made it work
on replicated set.
Fix this properly such that this container works
for both replicated set and stateful set deployment
This commit removes github.com/minio/kes as
a dependency and implements the necessary
client-side functionality without relying
on the KES project.
This resolves the licensing issue since
KES is licensed under AGPL while MinIO
is licensed under Apache.
With this PR,cache eviction will continue until
no LRU entries older than an hour can be cache
evicted or sufficient percentage of disk space
has been reclaimed.
This ensures that we can update the
- .minio.sys is updated for accounting/data usage purposes
- .minio.sys is updated to indicate if backend is encrypted
or not.
The approach is that now safe mode is only invoked when
we cannot read the config or under some catastrophic
situations, but not under situations when config entries
are invalid or unreachable. This allows for maximum
availability for MinIO and not fail on our users unlike
most of our historical releases.
This commit adds support for the minio/kes KMS.
See: https://github.com/minio/kes
In particular you can configure it as KMS by:
- `export MINIO_KMS_KES_ENDPOINT=` // Server URL
- `export MINIO_KMS_KES_KEY_FILE=` // TLS client private key
- `export MINIO_KMS_KES_CERT_FILE=` // TLS client certificate
- `export MINIO_KMS_KES_CA_PATH=` // Root CAs issuing server cert
- `export MINIO_KMS_KES_KEY_NAME=` // The name of the (default)
master key
Admin data usage info API returns the following
(Only FS & XL, for now)
- Number of buckets
- Number of objects
- The total size of objects
- Objects histogram
- Bucket sizes
In replica sets, hosts resolve to localhost
IP automatically until the deployment fully
comes up. To avoid this issue we need to
wait for such resolution.
Final update to all messages across sub-systems
after final review, the only change here is that
NATS now has TLS and TLSSkipVerify to be consistent
for all other notification targets.
This PR adds support below metrics
- Cache Hit Count
- Cache Miss Count
- Data served from Cache (in Bytes)
- Bytes received from AWS S3
- Bytes sent to AWS S3
- Number of requests sent to AWS S3
Fixes#8549
Fixes an issue reported by @klauspost and @vadmeste
This PR also allows users to expand their clusters
from single node XL deployment to distributed mode.
Currently, we use the top-level prefix "config/"
for all our IAM assets, instead of to provide
tenant-level separation bring 'path_prefix'
to namespace the access properly.
Fixes#8567
StorageInfo() call is supposed to give each
server/disk information independently, rely
on this appropriately so that `mc admin info server`
gets correct information all the time.
Regression from #8509 which changes objectsToDelete entry
from a list to map. This will cause index out of range
panic if object is not selected for delete.
level - this PR builds on #8120 which
added PutBucketObjectLockConfiguration and
GetBucketObjectLockConfiguration APIS
This PR implements PutObjectRetention,
GetObjectRetention API and enhances
PUT and GET API operations to display
governance metadata if permissions allow.
- added ability to specify CA for self-signed certificates
- added option to authenticate using client certificates
- added unit tests for nats connections
We should only read ahead if we are reading big files. We enable it for files >= 16MB.
Benchmark on 64KB objects.
Before:
```
Operation: GET
Errors: 0
Average: 59.976s, 87.13 MB/s, 1394.07 ops ended/s.
Fastest: 1s, 90.99 MB/s, 1455.00 ops ended/s.
50% Median: 1s, 87.53 MB/s, 1401.00 ops ended/s.
Slowest: 1s, 81.39 MB/s, 1301.00 ops ended/s.
```
After:
```
Operation: GET
Errors: 0
Average: 59.992s, 207.99 MB/s, 3327.85 ops ended/s.
Fastest: 1s, 219.20 MB/s, 3507.00 ops ended/s.
50% Median: 1s, 210.54 MB/s, 3368.00 ops ended/s.
Slowest: 1s, 179.14 MB/s, 2865.00 ops ended/s.
```
The 64KB buffer is actually a small disadvantage for this case, but I believe it will be better in general than no buffer.
- Migrate and save only settings which are enabled
- Rename logger_http to logger_webhook and
logger_http_audit to audit_webhook
- No more pretty printing comments, comment
is a key=value pair now.
- Avoid quotes on values which do not have space in them
- `state="on"` is implicit for all SetConfigKV unless
specified explicitly as `state="off"`
- Disabled IAM users should be disabled always
The code does not properly set a new deployemnt ID when not present
in format.json: it loops twice without releasing write lock on format.json
causing an infinite locking error on the same file.
This commit fixes and simplifies a little the code.
This PR implements locking from a global entity into
a more localized set level entity, allowing for locks
to be held only on the resources which are writing
to a collection of disks rather than a global level.
In this process this PR also removes the top-level
limit of 32 nodes to an unlimited number of nodes. This
is a precursor change before bring in bucket expansion.
This PR also fixes issues related to
- Add proper newline for `mc admin config get` output
for more than one targets
- Fixes issue of temporary user credentials to have
consistent output
- Fixes a crash when setting a key with empty values
- Fixes a parsing issue with `mc admin config history`
- Fixes gateway ENV handling for etcd server and gateway
This PR fixes issues found in config migration
- StorageClass migration error when rrs is empty
- Plain-text migration of older config
- Do not run in safe mode with incorrect credentials
- Update logger_http documentation for _STATE env
Refer more reported issues at #8434
This PR refactors object layer handling such
that upon failure in sub-system initialization
server reaches a stage of safe-mode operation
wherein only certain API operations are enabled
and available.
This allows for fixing many scenarios such as
- incorrect configuration in vault, etcd,
notification targets
- missing files, incomplete config migrations
unable to read encrypted content etc
- any other issues related to notification,
policies, lifecycle etc
The JSON stream library has no safe way of aborting while
Since we cannot expect the called to safely handle "Read" and "Close" calls we must handle this.
Also any Read error returned from upstream will crash the server. We preserve the errors and instead always return io.EOF upstream, but send the error on Close.
`readahead v1.3.1` handles Read after Close better.
Updates to `progressReader` is mostly to ensure safety.
Fixes#8481
This PR brings support for `history` list to
list in the following agreed format
```
~ mc admin config history list -n 2 myminio
RestoreId: df0ebb1e-69b0-4043-b9dd-ab54508f2897
Date: Mon, 04 Nov 2019 17:27:27 GMT
region name="us-east-1" state="on"
region name="us-east-1" state="on"
region name="us-east-1" state="on"
region name="us-east-1" state="on"
RestoreId: ecc6873a-0ed3-41f9-b03e-a2a1bab48b5f
Date: Mon, 04 Nov 2019 17:28:23 GMT
region name=us-east-1 state=off
```
This PR also moves the help templating and coloring to
fully `mc` side instead than `madmin` API.
This PR adds code to appropriately handle versioning issues
that come up quite constantly across our API changes. Currently
we were also routing our requests wrong which sort of made it
harder to write a consistent error handling code to appropriately
reject or honor requests.
This PR potentially fixes issues
- old mc is used against new minio release which is incompatible
returns an appropriate for client action.
- any older servers talking to each other, report appropriate error
- incompatible peer servers should report error and reject the calls
with appropriate error
This is to avoid making calls to backend and requiring
gateways to allow permissions for ListBuckets() operation
just for Liveness checks, we can avoid this and make
our liveness checks to be more performant.
- Supports migrating only when the credential ENVs are set,
so any FS mode deployments which do not have ENVs set will
continue to remain as is.
- Credential ENVs can be rotated using MINIO_ACCESS_KEY_OLD
and MINIO_SECRET_KEY_OLD envs, in such scenarios it allowed
to rotate the encrypted content to a new admin key.
This commit replaces the returned error message by
the hardware info handler from `Method-Not-Allowed`
to `Bad-Request` since the current HTTP error is not
correct according to the HTTP spec.
In particular:
```
The origin server MUST generate an Allow header field
in a 405 response containing a list of the target
resource's currently supported methods.
```
From: https://tools.ietf.org/html/rfc7231#section-6.5.5
- This PR allows config KVS to be validated properly
without being affected by ENV overrides, rejects
invalid values during set operation
- Expands unit tests and refactors the error handling
for notification targets, returns error instead of
ignoring targets for invalid KVS
- Does all the prep-work for implementing safe-mode
style operation for MinIO server, introduces a new
global variable to toggle safe mode based operations
NOTE: this PR itself doesn't provide safe mode operations
This commit bumps the version of the `sio` library
from v0.2.0 => v0.3.0. Now, `madmin` can use the
`Algorithm` type constants that make the encrypt/decrypt
code simpler.
The new auto healing model selects one node always responsible
for auto-healing the whole cluster, erasure set by erasure set.
If that node dies, another node will be elected as a leading
operator to perform healing.
This code also adds a goroutine which checks each 10 minutes
if there are any new unformatted disks and performs its healing
in that case, only the erasure set which has the new disk will
be healed.
While Tracing requests on server, type assertion on logger.ResponseWriter
caused nil pointer exception because of recordAPIStats{} being
used as ResponseWriter. This PR avoids the type assertion and
initializes a new logger.ResponseWriter.
Fixes regression introduced in #8003
- adding oauth support to MinIO browser (#8400) by @kanagaraj
- supports multi-line get/set/del for all config fields
- add support for comments, allow toggle
- add extensive validation of config before saving
- support MinIO browser to support proper claims, using STS tokens
- env support for all config parameters, legacy envs are also
supported with all documentation now pointing to latest ENVs
- preserve accessKey/secretKey from FS mode setups
- add history support implements three APIs
- ClearHistory
- RestoreHistory
- ListHistory
- add help command support for each config parameters
- all the bug fixes after migration to KV, and other bug
fixes encountered during testing.
The measures are consolidated to the following metrics
- `disk_storage_used` : Disk space used by the disk.
- `disk_storage_available`: Available disk space left on the disk.
- `disk_storage_total`: Total disk space on the disk.
- `disks_offline`: Total number of offline disks in current MinIO instance.
- `disks_total`: Total number of disks in current MinIO instance.
- `s3_requests_total`: Total number of s3 requests in current MinIO instance.
- `s3_errors_total`: Total number of errors in s3 requests in current MinIO instance.
- `s3_requests_current`: Total number of active s3 requests in current MinIO instance.
- `internode_rx_bytes_total`: Total number of internode bytes received by current MinIO server instance.
- `internode_tx_bytes_total`: Total number of bytes sent to the other nodes by current MinIO server instance.
- `s3_rx_bytes_total`: Total number of s3 bytes received by current MinIO server instance.
- `s3_tx_bytes_total`: Total number of s3 bytes sent by current MinIO server instance.
- `minio_version_info`: Current MinIO version with commit-id.
- `s3_ttfb_seconds_bucket`: Histogram that holds the latency information of the requests.
And this PR also modifies the current StorageInfo queries
- Decouples StorageInfo from ServerInfo .
- StorageInfo is enhanced to give endpoint information.
NOTE: ADMIN API VERSION IS BUMPED UP IN THIS PR
Fixes#7873
xl.isObject() returns 'nil' for not found disks when
calculating the existance of xl.json for a given object,
which what StatFile() is also doing (setting nil) if
xl.json exists.
This commit avoids this confusion by setting errDiskNotFound
error when the storage disk is not found.
When `mc admin user add` is attempted in gateway mode without
etcd setup, NoSuchBucket error is returned instead of MethodNotAllowed.
Regression from commit - e48005ddc7
Background append creates a temporary file which appends
uploaded parts as long as they are available, but when a
client stops the upload, the temporary file is not removed
by any way.
This commit removes the temporary file when the server does
its regular removing stale multipart uploads.
specific errors, `application` errors or `all` by default.
console logging on server by default lists all logs -
enhance admin console API to accept `type` as query parameter to
subscribe to application/minio logs.
- This PR fixes situation to avoid underflow, this is possible
because of disconnected operations in replay/sendEvents
- Hold right locks if Del() operation is performed in Get()
- Remove panic in the code and use loggerOnce
- Remove Timer and instead use Ticker instead for proper ticks
On Kubernetes/Docker setups DNS resolves inappropriately
sometimes where there are situations same endpoints with
multiple disks come online indicating either one of them
is local and some of them are not local. This situation
can never happen and its only a possibility in orchestrated
deployments with dynamic DNS. Following code ensures that we
treat if one of the endpoint says its local for a given host
it is true for all endpoints for the same host. Following code
ensures that this assumption is true and it works in all
scenarios and it is safe to assume for a given host.
This PR also adds validation such that we do not crash the
server if there are bugs in the endpoints list in dsync
initialization.
Thanks to Daniel Valdivia <hola@danielvaldivia.com> for
reproducing this, this fix is needed as part of the
https://github.com/minio/m3 project.
This change is related to larger config migration PR
change, this is a first stage change to move our
configs to `cmd/config/` - divided into its subsystems
Current master repeatedly calls ListBuckets() during
initialization of multiple sub-systems
Use single ListBuckets() call for each sub-system as
follows
- LifeCycle
- Policy
- Notification
- Heal if the part.1 is truncated from its original size
- Heal if the part.1 fails while being verified in between
- Heal if the part.1 fails while being at a certain offset
Other cleanups include make sure to flush the HTTP responses
properly from storage-rest-server, avoid using 'defer' to
improve call latency. 'defer' incurs latency avoid them
in our hot-paths such as storage-rest handlers.
Fixes#8319
Minio V2 listing uses object names/prefixes as continuation tokens. This
is problematic when object names contain some characters that are forbidden
in XML documents. This PR will use base64 encoded form of continuation
and next continuation tokens to address that corner case.
errors.errorString() cannot be marshalled by gob
encoder, so using a slice of []error would fail
to be encoded. This leads to no errors being
generated instead gob.Decoder on the storage-client
would see an io.EOF
To avoid such bugs introduce a typed error for
handling such translations and register this type
for gob encoding support.
On objects bigger than 100MiB can have a corrupted object
stored due to partial blockListing attempted right after
each blocks uploaded. Simplify this code to ensure that
all the blocks successfully uploaded are committed right
away.
This PR also updates the azure-sdk-go to latest release.
This commit updates the reedsolomon dependency
since it contains an fix for an unexpected property
of the `Split` function.
See: klauspost/reedsolomon#109
Force sum/average to be calculated as a float.
As noted in #8221
> run SELECT AVG(CAST (Score as int)) FROM S3Object on
```
Name,Score
alice,80
bob,81
```
> AWS S3 gives 80.5 and MinIO gives 80.
This also makes overflows much more unlikely.
Updates #7475
The Java implementation has a 128KB buffer and a message must be emitted before that is used. #7475 therefore limits the message size to 128KB. But up to 256 bytes are written to the buffer in each call. This means we must emit a message before shorter than 128KB.
Therefore we change the limit to 128KB minus 256 bytes.
- Choose a unique uuid such that under situations of duplicate
mounts we do not append to an existing json entry.
- Avoid AppendFile instead use WriteAll() to write the entire
byte array atomically.
When url encoding is passed in v2 listing handler, continuationToken
and nextContinuationToken needs to be encoded. The reason is that
both represents an object name/prefix in Minio server and it could
contain a character unsupported by XML specification.
This PR additionally also adds support for missing
- Session policy support for AD/LDAP
- Add API request/response parameters detail
- Update example to take ldap username,
password input from the command line
- Fixes session policy handling for
ClientGrants and WebIdentity
This commit removes the SSE-S3 key rotation functionality
from CopyObject since there will be a dedicated Admin-API
for this purpose.
Also update the security documentation to link to mc and
the admin documentation.
This commit allows the MinIO server to parse the metadata if:
- either the `X-Minio-Internal-Server-Side-Encryption-S3-Key-Id`
and the `X-Minio-Internal-Server-Side-Encryption-S3-Kms-Sealed-Key`
entries are present.
- or *both* headers are not present.
This is in service to support a K/V data key store.
Queue output items and reuse them.
Remove the unneeded type system in sql and just use the Go type system.
In best case this is more than an order of magnitude speedup:
```
BenchmarkSelectAll_1M-12 1 1841049400 ns/op 274299728 B/op 4198522 allocs/op
BenchmarkSelectAll_1M-12 14 84833400 ns/op 169228346 B/op 3146541 allocs/op
```
In commit a8296445ad we changed the code to handle
some corner cases on ARM and other platforms, this
PR just avoids the return for unknown filetypes
prematurely and let the name be populated appropriately.
This fixes bug for older XFS implementations such as
in Ubuntu 14.04
This commit fixes a subtle bug that (probably)
caused an issue affecting encrypted multipart objects.
When a cluster has no quorum this bug causes `ListObjectParts`
to return nil as error instead of a quorum error.
Thanks to @harshavardhana for detecting this.
This PR changes cache on PUT behavior to background fill the cache
after PutObject completes. This will avoid concurrency issues as in #8219.
Added cleanup of partially filled cache to prevent cache corruption
- Fixes#8208
VerifyFile in the distributed setup does not work with
the non streaming highway hash. The reason is that the
internode mux router did not expect `storageRESTBitrotHash`
parameter.
endpoint.IsLocal will not have .Host entries so
using them to skip double entries will never work.
change the code such that we look for endpoint.Host
outside of endpoint.IsLocal logic to skip double
hosts appropriately.
Move these functions to their appropriate file.
- ListObjectsHeal should list only objects
which need healing, not the entire namespace.
- DeleteObjects() to be used to delete 1000s of
objects in bulk instead of serially.
Add LDAP based users-groups system
This change adds support to integrate an LDAP server for user
authentication. This works via a custom STS API for LDAP. Each user
accessing the MinIO who can be authenticated via LDAP receives
temporary credentials to access the MinIO server.
LDAP is enabled only over TLS.
User groups are also supported via LDAP. The administrator may
configure an LDAP search query to find the group attribute of a user -
this may correspond to any attribute in the LDAP tree (that the user
has access to view). One or more groups may be returned by such a
query.
A group is mapped to an IAM policy in the usual way, and the server
enforces a policy corresponding to all the groups and the user's own
mapped policy.
When LDAP is configured, the internal MinIO users system is disabled.
It looks like from implementation point of view fastjson
parser pool doesn't behave the same way as expected
when dealing many `xl.json` from multiple disks.
The fastjson parser pool usage ends up returning incorrect
xl.json entries for checksums, with references pointing
to older entries. This led to the subtle bug where checksum
info is duplicated from a previous xl.json read of a different
file from different disk.
With this PR, liveness check responds with 200 OK with "server-not-
initialized" header while objectLayer gets initialized. The header
is removed as objectLayer is initialized. This is to allow
MinIO distributed cluster to get started when running on an
orchestration platforms like Docker Swarm.
This PR also updates sample Swarm yaml files to use correct values
for healthcheck fields.
Fixes#8140
This commit adds an admin API route and handler for
requesting status information about a KMS key.
Therefore, the client specifies the KMS key ID (when
empty / not set the server takes the currently configured
default key-ID) and the server tries to perform a dummy encryption,
re-wrap and decryption operation. If all three succeed we know that
the server can access the KMS and has permissions to generate, re-wrap
and decrypt data keys (policy is set correctly).
This is a defensive change to avoid any future issues,
from this part of the code. New change also ensures
to populate UUID if present for the right disk.
This commit fixes the web ZIP download handler for
encrypted objects. The decryption logic has moved into
`getObjectNInfo`. So trying to decrypt the (already decrypted)
content again in the ZIP handler obviously causes an error.
This commit fixes this by removing the decryption logic from the
the handler.
Fixes#7965
The underlying errors are important, for IAM
requirements and should wait appropriately at
the caller level, this allows for distributed
setups to run properly and not fail prematurely
during startup.
Also additionally fix the onlineDisk counting
The change now is to ensure that we take custom URL as
well for updating the deployment, this is required for
hotfix deliveries for certain deployments - other than
the community release.
This commit changes the previous work d65a2c6725
with newer set of requirements.
Also deprecates PeerUptime()
net.Error is very unreliable in providing better error
handling, we need to ensure that we always have a fallback
option in case of network failures.
This fixes an important issue in our distributed server
setups when one of the servers is down, all deployments
out there are recommended to upgrade after this fix is
merged to ensure that availability is not lost.
Fixes#8127Fixes#8016Fixes#7964
This API implementation simply behaves like listObjects()
but returns back single version for each object, this
implementation should be considered dummy it is only
meant for some applications which rely on this.
This is to avoid using unsafe.Pointer type
code dependency for MinIO, this causes
crashes on ARM64 platforms
Refer #8005 collection of runtime crashes due
to unsafe.Pointer usage incorrectly. We have
seen issues like this before when using
jsoniter library in the past.
This PR hopes to fix this using fastjson
Add better dynamic timeouts for locks, also
add jitters before launching daily sweep to ensure
that not all the servers in distributed setup
are not trying to hold locks to begin the sweep
round.
Also, add enough delay for incoming requests based
on totalSetCount*totalDriveCount.
A possible fix for #8071
There are multiple possibilities for running MinIO within
a container e.g. configurable address, non-root user etc.
This makes it difficult to identify actual IP / Port to
use to check healthcheck status from within a container.
It is simpler to use external healthcheck mechanisms
like healthcheck command in docker-compose to check
for MinIO health status. This is similar to how checks
work in Kubernetes as well.
This PR removes the healthcheck script used inside
Docker container and ad documentation on how to
use docker-compose based healthcheck mechanism.
It is observed that when `mc admin trace` is being
used due to ResponseWriter wrapper, we loose information
about statusCode,statusText for audit logging.
This PR fixes this behavior
This avoids a network call, also fixes an issue
when empty paths are passed the underlying call
fails with "405 Method Not Allowed".
This is reproducible when you are deleting a
non-existent object.
Fixes#8083
Add API to set policy mapping for a user or group
Contains a breaking Admin APIs change.
- Also enforce all applicable policies
- Removes the previous /set-user-policy API
Bump up peerRESTVersion
Add get user info API to show groups of a user
This change will allow users to navigate to their desired locations,
including buckets and directories that haven't been "created" yet
Fixes#7883
Add tests
Change tooltip wording
Migrate to Font Awesome 5 to use path icon
Fix sidebar not closing on mobile
* Cleanup ui-errors and print proper error messages
Change HELP to HINT instead, handle more error
cases when starting up MinIO. One such is related
to #8048
* Apply suggestions from code review
When a peer client which higher version sends a request to a peer
server with lower version, the returned status code is 200 OK instead
of 405 code. The reason is that the peer client request reaches the
browser handler, which registers itself by '/minio' route but without
any other constraints. Adding filtering by user agent header to the
browser route so internal requests to old endpoints versions return
405 error code.
This is a behavior change from AWS S3, but it is done with
better judgment on our end to allow the listing of buckets only
which user has access to.
The advantage is this declutters the UI for users and only
lists bucket which they have access to.
Precursor for this feature to be applicable is a policy
must have the following actions
```
s3:ListAllMyBuckets
```
and
```
s3:ListBucket
```
enabled in the policy.
Fixes#7458Fixes#7573Fixes#7938Fixes#6934Fixes#6265Fixes#6630
This will allow the cache to consistently work for
server and gateways. Range GET requests will
be cached in the background after the request
is served from the backend.
- All cached content is automatically bitrot protected.
- Avoid ETag verification if a cache-control header
is set and the cached content is still valid.
- This PR changes the cache backend format, and all existing
content will be migrated to the new format. Until the data is
migrated completely, all content will be served from the backend.
Refactor the Dirent parsing code such that when we
calculate offsets are correct based on the platform
This PR fixes a silent potential crash on ARM
architecture.
Without explicit conversion to UTC() from Unix
time the zone information is lost, this leads
to XML marshallers marshaling the time into
a wrong format.
This PR fixes the compatibility issue with AWS STS
API by keeping Expiration format close to ISO8601
or RFC3339
Fixes#8041
This commit fixes a DoS issue that is caused by an incorrect
SHA-256 content verification during STS requests.
Before that fix clients could write arbitrary many bytes
to the server memory. This commit fixes this by limiting the
request body size.
This change adds admin APIs and IAM subsystem APIs to:
- add or remove members to a group (group addition and deletion is
implicit on add and remove)
- enable/disable a group
- list and fetch group info
When checking if federation is necessary, the code compares
the SRV record stored in etcd against the list of endpoints
that the MinIO server is exposing. If there is an intersection
in this list the request is forwarded.
The SRV record includes both the host and the port, but the
intersection check previously only looked at the IP address. This
would prevent federation from working in situations where the endpoint
IP is the same for multiple MinIO servers. Some examples of where this
can occur are:
- running mulitiple copies of MinIO on the same host
- using multiple MinIO servers behind a NAT with port-forwarding
Since MinIO by default is not fully S3 compatible, this fact should be
specified in a prominent place in the quick start guide so people
new to MinIO don't have to spend hours figuring it out the hard way.
This commit adds a new method `UpdateKey` to the KMS
interface.
The purpose of `UpdateKey` is to re-wrap an encrypted
data key (the key generated & encrypted with a master key by e.g.
Vault).
For example, consider Vault with a master key ID: `master-key-1`
and an encrypted data key `E(dk)` for a particular object. The
data key `dk` has been generated randomly when the object was created.
Now, the KMS operator may "rotate" the master key `master-key-1`.
However, the KMS cannot forget the "old" value of that master key
since there is still an object that requires `dk`, and therefore,
the `D(E(dk))`.
With the `UpdateKey` method call MinIO can ask the KMS to decrypt
`E(dk)` with the old key (internally) and re-encrypted `dk` with
the new master key value: `E'(dk)`.
However, this operation only works for the same master key ID.
When rotating the data key (replacing it with a new one) then
we perform a `UnsealKey` operation with the 1st master key ID
and then a `GenerateKey` operation with the 2nd master key ID.
This commit also updates the KMS documentation and removes
the `encrypt` policy entry (we don't use `encrypt`) and
add a policy entry for `rewarp`.
After some extensive refactors, it turned out empty directories
are not healed and heal status is also not reported correctly.
This commit fixes it and adds the appropriate unit tests
This PR fixes relying on r.Context().Done()
by setting
```
Connection: "close"
```
HTTP Header, this has detrimental issues for
client side connection pooling. Since this
header explicitly tells clients to turn-off
connection pooling. This causing pro-active
connections to be closed leaving many conn's
in TIME_WAIT state. This can be observed with
`mc admin trace -a` when running distributed
setup.
This PR also fixes tracing filtering issue
when bucket names have `minio` as prefixes,
trace was erroneously ignoring them.
Golang proactively prints this error
`http: proxy error: context canceled`
when a request arrived to the current deployment and
redirected to another deployment in a federated setup.
Since this error can confuse users, this commit will
just hide it.
Allow renaming/editing a notification config. By replying with
a successful GetBucketNotification response, without checking
for any missing config ARN in targetList.
Fixes#7650
Related to #7982, this PR refactors the code
such that we validate the OPA or JWKS in a
common place.
This is also a refactor which is already done
in the new config migration change. Attempt
to avoid any network I/O during Unmarshal of
JSON from disk, instead do it later when
updating the in-memory data structure.
In situations such as when client uploading data,
prematurely disconnects from server such as pressing
ctrl-c before uploading all the data. Under this
situation in distributed setup we prematurely
disconnect disks causing a reconnect loop. This has
an adverse affect we end up leaving a lot of files
in temporary location which ideally should have been
cleaned up when Put() prematurely fails.
This is also a regression which got introduced in #7610
- Policy mapping is now at `config/iam/policydb/users/myuser1.json`
and includes version.
- User identity file is now versioned.
- Migrate old data to the new format.
Calling ListMultipartUploads fails if an upload is aborted while a
part is being uploaded because the directory for the upload exists
(since fsRenameFile ends up calling os.MkdirAll) but the meta JSON file
doesn't. To fix this we make sure an upload hasn't been aborted during
PutObjectPart by checking the existence of the directory for the upload
while moving the temporary part file into it.
Problem: MinIO incorrectly appends DNS SRV records of buckets that have a prefix match with a given bucket. E.g bucket1 would incorrectly get bucket's DNS records too.
Solution: This fix ensures that we only add SRV records that match the key exactly
This PR is based off @sinhaashish's PR for object lifecycle
management, which includes support only for,
- Expiration of object
- Filter using object prefix (_not_ object tags)
N B the code for actual expiration of objects will be included in a
subsequent PR.
There is no reliable way to handle fallbacks for
MinIO deployments, due to various command line
options and multiple locations which require
access inside container.
Parsing command line options is tricky to figure
out which is the backend disk etc, we did try
to fix this in implementations of check-user.go
but it wasn't complete and introduced more bugs.
This PR simplifies the entire approach to rather
than running Docker container as non-root by default
always, it allows users to opt-in. Such that they
are aware that that is what they are planning to do.
In-fact there are other ways docker containers can
be run as regular users, without modifying our
internal behavior and adding more complexities.
Current code already allows users to GetPolicy/SetPolicy
there was a missing code in ListAllBucketPolicies to allow
access, this fixes this behavior.
Fixes#7913
The fix in #7646 introduced a regression which
was left unnoticed, the fix didn't work for
sub-commands unfortunately. This fixes it
by moving v1.21.0 version of the minio/cli
package.
Fixes#7924
posix.VerifyFile() doesn't know how to check if a file
is corrupted if that file is empty. We do have the part
size in xl.json so we pass it to VerifyFile to return
an error so healing empty parts can work properly.
When MinIO is behind a proxy, proxies end up killing
clients when no data is seen on the connection, adding
the right content-type ensures that proxies do not come
in the way.
Update prompt shows some weird characters under Windows, the reason
is that go-prompt is used to show a yes/no prompt, since go-prompt
does not seem to have a way to support color/fatih, this PR will
implements its own yes/no prompt with the correct text coloration.
The SQL parser as it stands right now ignores alias for aggregate
result, e.g. `SELECT COUNT(*) AS thing FROM s3object` doesn't actually
return record like `{"thing": 42}`, it returns a record like `{"_1": 42}`.
Column alias for aggregate result is supported in AWS's S3 Select, so
this commit fixes that by respecting the `expr.As` in the expression.
Also improve test for S3 select
On top of testing a simple `SELECT` query, we want to test a few more
"advanced" queries (e.g. aggregation).
Convert existing tests into table driven tests[1], and add the new test
cases with "advanced" queries into them.
[1] - https://github.com/golang/go/wiki/TableDrivenTests
This allows for canonicalization of the strings
throughout our code and provides a common space
for all these constants to reside.
This list is rather non-exhaustive but captures
all the headers used in AWS S3 API operations
r.URL.Path is empty when HEAD bucket with virtual
DNS requests come in since bucket is now part of
r.Host, we should use our domain names and fetch
the right bucket/object names.
This fixes an really old issue in our federation
setups.
This API returns the information related to the self healing routine.
For the moment, it returns:
- The total number of objects that are scanned
- The last time when an item was scanned
- return all objects in web-handlers listObjects response
- added local pagination to object list ui
- also fixed infinite loader and removed unused fields
While checking for port availability, ip address should be included.
When a machine has multiple ip addresses, multiple minio instances
or some other applications can be run on same port but different
ip address.
Fixes#7685
- Snappy is not and RLE compressor, it is LZ77 based.
- Add `xz` as a common file type.
- Add most common media container types.
- Never heard of `application/x-spoon`. Google turns up a blank as well.
- Change link to minio blog post on compression & encryption.
This commit removes the encryption key section from
the certool.exe docs because:
- MinIO does not support any TLS cipher that encrypts
something with the private key. We only support PFS
ciphers.
- The doc comment is not really accurate anyway.
This PR adds support for adding session policies
for further restrictions on STS credentials, useful
in situations when applications want to generate
creds for multiple interested parties with different
set of policy restrictions.
This session policy is not mandatory, but optional.
Fixes#7732
This commit relaxes the restriction that the MinIO gateway
does not accept SSE-KMS headers. Now, the S3 gateway allows
SSE-KMS headers for PUT and MULTIPART PUT requests and forwards them
to the S3 gateway backend (AWS). This is considered SSE pass-through
mode.
Fixes#7753
This PR fixes a security issue where an IAM user based
on his policy is granted more privileges than restricted
by the users IAM policy.
This is due to an issue of prefix based Matcher() function
which was incorrectly matching prefix based on resource
prefixes instead of exact match.
Previously the read/write lock applied both for gateway use cases as
well the object store use case. Nothing from sys is touched or looked
at in the gateway usecase though, so we don't need to lock. Don't lock
to make the gateway policy getting a little more efficient, particularly
as where this is called from (checkRequestAuthType) is quite common.
etcd when used in federated setups, currently
mandates that all clusters should have same
config.json, which is too restrictive and makes
federation a restrictive environment.
This change makes it apparent that each cluster
needs to be independently managed if necessary
from `mc admin info` command line.
Each cluster with in federation can have their
own root credentials and as well as separate
regions. This way buckets get further restrictions
and allows for root creds to be not common
across clusters/data centers.
Existing data in etcd gets migrated to backend
on each clusters, upon start. Once done
users can change their config entries
independently.
This allows MinIO containers to run properly without
expecting higher privileges in situations where following
restrictions on containers are used
- docker run --user uid:gid
- docker-compose up (with docker-compose.yml with user)
```yml
...
user: "1001:1001"
command: minio server /data
...
```
- All openshift containers
Fixes#7773
- Background Heal routine receives heal requests from a channel, either to
heal format, buckets or objects
- Daily sweeper lists all objects in all buckets, these objects
don't necessarly have read quorum so they can be removed if
these objects are unhealable
- Heal daily ops receives objects from the daily sweeper
and send them to the heal routine.
Inconsistencies can arise after applying bucket policies in
gateway mode, since all gateway instances do not share a
common shared state. This is by design to keep gateway as
shared nothing architecture.
This PR fixes such inconsistencies by reloading policy
if any from the backend.
Fixes#7723
Consider errors returned by httpClient.Do() as network errors. This is because
the http clients returns different types of errors and it is hard to catch
all the error types.
With these changes we are now able to peak performances
for all Write() operations across disks HDD and NVMe.
Also adds readahead for disk reads, which also increases
performance for reads by 3x.
IsTruncated should not be set to true if there is no further
possible entries beyond maxKeys.
This commit will also move wide testing on object API from xl
to xl sets.
This patch includes the following changes in event store interface
- Removes memory store. We will not persist events in memory anymore, if `queueDir` is not set.
- Orders the events before replaying to the broker.
The problem in current code was we were removing
an entry from a lock lockerMap without considering
the fact that different entry for same resource is
a possibility due the nature of locks that can be
acquired in parallel before we decide if the lock
is considered stale
A sequence of events is as follows
- Lock("resource")
- lockMaintenance(finds a long lived lock in this "resource")
- Owner node rebooted which now retruns Expired() as true for
this "resource"
- Unlock("resource") which succeeded in quorum
- Now by this time application retried and acquired a new
Lock() on the same "resource"
- Now that we have Expired() true from the previous call,
we proceed to purge the entry from the local lockMap()
local lockMap reports a different entry for the expired
UID which results in a spurious log entry.
This PR removes this logging as this situation is an
expected scenario.
This will allow cache to consistently work for
server and gateways. Range GET requests will
be cached in the background after the request
is served from the backend.
Fixes: #7458, #7573, #6265, #6630
This is necessary to avoid connection build up between servers
unexpectedly for example in a situation where 16 servers are
talking to each other and one server now allows a maximum of
15*4096 = 61440 idle connections
Will be kept in pool. Such a large pool is perhaps inefficient for
many reasons and also affects overall system resources.
This PR also reduces idleConnection timeout from 120 secs to 60 secs.
errs was passed to many goroutines but they are all allowed
to update errs if any error happens during deletion, which
can cause a data race.
This commit will avoid issuing bulk delete operations in parallel
to avoid the warning race.
We broke into parts previously as we had checksum for the entire file
to tax less on memory and to have better TTFB. We dont need to now,
after the streaming-bitrot change.
Bulk delete at storage level in Multiple Delete Objects API
In order to accelerate bulk delete in Multiple Delete objects API,
a new bulk delete is introduced in storage layer, which will accept
a list of objects to delete rather than only one. Consequently,
a new API is also need to be added to Object API.
PR #7595 fixed part of the regression, but did not handle the
scenario, where in docker, the internal port is different from
the port on the host.
This PR modifies the regular expression such that all the
scenarios are handled.
Fixes#7619
After recent listing refactor, recursive list doesn't return empty
directories, this commit will fix the behavior and add unit tests
so it won't happen again.
When size is unknown and auto encryption is enabled,
and compression is set to true, putobject API is failing.
Moving adding the SSE-S3 header as part of the request to before
checking if compression can be done, otherwise the size is set to -1
and that seems to cause problems.
Currently we used to reload users every five minutes,
regardless of etcd is configured or not. But with etcd
configured we can do this more asynchronously to trigger
a refresh by using the watch API
Fixes#7515
One user has seen this following error log:
API: CompleteMultipartUpload(bucket=vertica, object=perf-dss-v03/cc2/02596813aecd4e476d810148586c2a3300d00000013557ef_0.gt)
Time: 15:44:07 UTC 04/11/2019
RequestID: 159475EFF4DEDFFB
RemoteHost: 172.26.87.184
UserAgent: vertica-v9.1.1-5
Error: open /data/.minio.sys/tmp/100bb3ec-6c0d-4a37-8b36-65241050eb02/xl.json: file exists
1: cmd/xl-v1-metadata.go:448:cmd.writeXLMetadata()
2: cmd/xl-v1-metadata.go:501:cmd.writeUniqueXLMetadata.func1()
This can happen when CompleteMultipartUpload fails with write quorum,
the S3 client will retry (since write quorum is 500 http response),
however the second call of CompleteMultipartUpload will fail because
this latter doesn't truly use a random uuid under .minio.sys/tmp/
directory but pick the upload id.
This commit fixes the behavior to choose a random uuid for generating
xl.json
Since AssumeRole API was introduced we have a wrong route
match which results in certain clients failing to upload objects
using multipart because, multipart POST conflicts with STS POST
AssumeRole API.
Write a proper matcher function which verifies the route more
appropriately such that both can co-exist.
Other listing optimizations include
- remove double sorting while filtering object entries
- improve error message when upload-id is not in quorum
- use jsoniter for full unmarshal json, instead of gjson
- remove unused code
Allow server to start if one of the local nodes in docker/kubernetes setup is successfully resolved
- The rule is that we need atleast one local node to work. We dont need to resolve the
rest at that point.
- In a non-orchestrational setup, we fail if we do not have atleast one local node up
and running.
- In an orchestrational setup (docker-swarm and kubernetes), We retry with a sleep of 5
seconds until any one local node shows up.
Fixes#6995
When the first login attempt is failed(due to incorrect secretkey)
and the next attempt is successful. Error message shown for
the previous attempts should go away.
Fixes#7514
In distributed mode, use REST API to acquire and manage locks instead
of RPC.
RPC has been completely removed from MinIO source.
Since we are moving from RPC to REST, we cannot use rolling upgrades as the
nodes that have not yet been upgraded cannot talk to the ones that have
been upgraded.
We expect all minio processes on all nodes to be stopped and then the
upgrade process to be completed.
Also force http1.1 for inter-node communication
common prefixes in bucket name if already created
are disallowed when etcd is configured due to the
prefix matching issue. Make sure that when we look
for bucket we are only interested in exact bucket
name not the prefix.
- [x] Support bucket and regular object operations
- [x] Supports Select API on HDFS
- [x] Implement multipart API support
- [x] Completion of ListObjects support
As a part of #7302, MinIO server's (configured with https) response when it
encounters http request has changed from 403 to 400 and the custom message
"SSL Required" is removed.
Accordingly healthcheck script is updated to check for status 400 before
trying https request.
Fixes#7517
Currently, the backend minio server uses the same data directory
as the mint test itself, causing `s3 sync` to fail often.
Now `minio` backend will use a different data directory `/data`
instead of `/mint/data`
There is no written specification about how to encode key names
when url encoding type is passed.
However, this change will encode URLs as url.QueryEscape() does
while considering AWS S3 exceptions.
This commit adds a unit test for the vault
config verification (which covers also `IsEmpty()`).
Vault-related code is hard to test with unit tests
since a Vault service would be necessary. Therefore
this commit only adds tests for a fraction of the code.
Fixes#7409
Most hadoop distributions hortonworks, cloudera all
depend on aws-sdk-java 1.7.x to 1.10.x - the releases
which have bugs related case sensitive check for
ETag header. Go changes the case of the headers set
to be canonical but only preserves them when set
through a direct map.
This fixes most compatibility issues we have had
in the past supporting older hadoop distributions.
This commit fixes a privilege escalation issue against
the S3 and web handlers. An authenticated IAM user
can:
- Read from or write to the internal '.minio.sys'
bucket by simply sending a properly signed
S3 GET or PUT request. Further, the user can
- Read from or write to the internal '.minio.sys'
bucket using the 'Upload'/'Download'/'DownloadZIP'
API by sending a "browser" request authenticated
with its JWT token.
This commit fixes another privilege escalation issue
abusing the inter-node communication of distributed
servers to obtain/modify the server configuration.
The inter-node communication is authenticated using
JWT-Tokens. Further, IAM users accessing the cluster
via the web UI also get a JWT token and the browser
will add this "user" JWT token to each the request.
Now, a user can extract that JWT token an can craft
HTTP POST requests for the inter-node communication
API endpoint. Since the server accepts ANY valid
JWT token it also accepts inter-node commands from
an authenticated user such that the user can execute
arbitrary commands bypassing the IAM policy engine
and impersonate other users, change its own IAM policy
or extract the admin access/secret key.
This is fixed by only accepting "admin" JWT tokens
(tokens containing the admin access key - and therefore
were generated with the admin secret key). Consequently,
only the admin user can execute such inter-node commands.
Simplify the cmd/http package overall by removing
custom plain text v/s tls connection detection, by
migrating to go1.12 and choose minimum version
to be go1.12
Also remove all the vendored deps, since they
are not useful anymore.
A race is detected between a bytes.Buffer generated with cmd/rpc.Pool
and http2 module. An issue is raised in golang (https://github.com/golang/go/issues/31192).
Meanwhile, this commit disables Pool in RPC code and it generates a
new 1kb of bytes.Buffer for each RPC call.
Before this commit, nodes wait indefinitely without showing any
indicate error message when a node is started with different access
and secret keys.
This PR will show '401 Unauthorized' in this case.
Currently message is set to error type value.
Message field is not used in error logs. it is used only in the case of info logs.
This PR sets error message field to store error type correctly.
Copying an encrypted SSEC object when this latter is uploaded using
multipart mechanism was failing because ETag in case of encrypted
multipart upload is not encrypted.
This PR fixes the behavior.
This fixes varying pids for server-respawns. And avoids duplicate process
creating multiple pids when the server restart signal is triggered with
service restart enabled.
Fixes#7350
It is required to set the environment variable in the case of distributed
minio. LoadCredentials is used to notify peers of the change and will not work if
environment variable is set. so, this function will never be called.
In scenario 1
```
- bucket/object-prefix
- bucket/object-prefix/object
```
Server responds with `XMinioParentIsObject`
In scenario 2
```
- bucket/object-prefix/object
- bucket/object-prefix
```
Server responds with `XMinioObjectExistsAsDirectory`
Fixes#6566
Healing scan used to read all objects parts to check for bitrot
checksum. This commit will add a quicker way of healing scan
by only checking if parts are actually present in disks or not.
We should internally handle when http2 input stream has smaller
content than its content-length header
Upstream issue reported https://github.com/golang/go/issues/30648
This a change which we need to handle internally until Go fixes it
correctly, till now our code doesn't expect a custom error to be returned.
- Also, switch to jstream to generate internal record representation
from CSV/JSON readers
- This fixes a bug in which JSON output objects have their keys
reversed from the order they are specified in the Select columns.
- Also includes a fix for tests.
CopyObject precondition checks into GetObjectReader
in order to perform SSE-C pre-condition checks using the
last 32 bytes of encrypted ETag rather than the decrypted
ETag
This also necessitates moving precondition checks for
gateways to gateway layer rather than object handler check
if a bucket with `Captialized letters` is created, `InvalidBucketName` error
will be returned.
In the case of pre-existing buckets, it will be listed.
Fixes#6938
Prevents deferred close functions from being called while still
attempting to copy reader to snappyWriter.
Reduces code duplication when compressing objects.
This change allows indefinitely running go-routines to cleanup
gracefully.
This channel is now closed at the beginning of each test so that
long-running go-routines quit and a new one is assigned.
The side affect of this change memory
increase, but this is a trade-off between
performance and actual memory usage.
For all practical scenarios this should be
an adequate change.
Currently windows support was relying on Symlink as
a way to detect a drive, this doesn't work in latest
Windows 2016, fix this to use a proper mechanism by
using win32 APIs.
Additionally also add support for detecting bind mounts
on Linux.
- The events will be persisted in queueStore if `queueDir` is set.
- Else, if queueDir is not set events persist in memory.
The events are replayed back when the mqtt broker is back online.
Clients like AWS SDK Java and AWS cli XML parsers are
unable to handle on `\r\n` characters to avoid these
errors send XML header first and write white space characters
instead.
Also handle cases to avoid double WriteHeader calls
- Current implementation was spawning renewer goroutines
without waiting for the lease duration to end. Remove vault renewer
and call vault.RenewToken directly and manage reauthentication if
lease expired.
Go script makes it easy to read/maintain. Also updated the timeout
in Dockerfiles from 5s to default 30s and test interval to 1m
Higher timeout makes sense as server may sometimes respond slowly
if under high load as reported in #6974Fixes#6974
We should change the logic for both isObject()
and isObjectDir() leaf detection to be done
with quorum, due to how our directory navigation
works - this allows for properly deleting all
the dangling directories or objects if any.
This commit fixes a nil pointer dereference issue
that can occur when the Vault KMS returns e.g. a 404
with an empty HTTP response. The Vault client SDK
does not treat that as error and returns nil for
the error and the secret.
Further it simplifies the token renewal and
re-authentication mechanism by using a single
background go-routine.
The control-flow of Vault authentications looks
like this:
1. `authenticate()`: Initial login and start of background job
2. Background job starts a `vault.Renewer` to renew the token
3. a) If this succeeds the token gets updated
b) If this fails the background job tries to login again
4. If the login in 3b. succeeded goto 2. If it fails
goto 3b.
Currently, we were sending errors in Select binary format,
which is incompatible with AWS S3 behavior, errors in binary
are sent after HTTP status code is already 200 OK - i.e it
happens during the evaluation of the record reader.
This commit increases storage REST requests to 5 minutes, this includes
the opening TCP connection, and sending/receiving data. This will reduce
clients receiving errors when the server is under high load.
Different gateway implementations due to different backend
API errors, might return different unsupported errors at
our handler layer. Current code posed a problem for us because
this information was lost and we would convert it to InternalError
in this situation all S3 clients end up retrying the request.
To avoid this unexpected situation implement a way to support
this cleanly such that the underlying information is not lost
which is returned by gateway.
Bucket metadata healing in the current code was executed multiple
times each time for a given set. Bucket metadata just like
objects are hashed in accordance with its name on any given set,
to allow hashing to play a role we should let the top level
code decide where to navigate.
Current code also had 3 bucket metadata files hardcoded, whereas
we should make it generic by listing and navigating the .minio.sys
to heal such objects.
We also had another bug where due to isObjectDangling changes
without pre-existing bucket metadata files, we were erroneously
reporting it as grey/corrupted objects.
This PR fixes all of the above items.
This PR also adds some comments and simplifies
the code. Primary handling is done to ensure
that we make sure to honor cached buffer.
Added unit tests as well
Fixes#7141
- Staging buffer is flushed every 500ms. In cases where the result
records are slowly generated (e.g. when a where condition
matches very few records), this change causes the server to send
results even though the staging buffer is not full.
- Refactor messageWriter code to use simpler channel based
co-ordination instead of atomic variables.
foo.CORRUPTED should never be created because when
multiple sets are involved we would hash the file
to wrong a location, this PR removes the code.
But allows DeleteBucket() to work properly to delete
dangling buckets/objects. Also adds another option
to Healing where a user needs to specify `--remove`
such that all dangling objects will be deleted with
user confirmation.
This change adds support for casting strings to Timestamp via CAST:
`CAST('2010T' AS TIMESTAMP)`
It also implements the following date-time functions:
- UTCNOW()
- DATE_ADD()
- DATE_DIFF()
- EXTRACT()
For values passed to these functions, date-types are automatically
inferred.
ListObjectParts is using xl.readXLMetaParts which picks the first
xl meta found in any disk, which is an inconsistent information.
E.g.: In a middle of a multipart upload, one node can go offline
and get back later with an outdated multipart information.
This commit fixes the computation of Before/After healing state
for empty directories.
Issues before the commit:
- Before state doesn't reflect the real status (no StatVol() called)
- For any MakeVol() error, healObjectDir is exited directly, which is
wrong.
Currently during a heal of a bucket, if one disk is offline an empty endpoint entry is added.
Then another entry with the missing endpoint is also added.
This results in more entries than disks being added.
Code that adds empty endpoint has been removed.
Collect historic cpu and mem stats. Also, use actual values
instead of formatted strings while returning to the client. The string
formatting prevents values from being processed by the server or
by the client without parsing it.
This change will allow the values to be processed (eg.
compute rolling-average over the lifetime of the minio server)
and offloads the formatting to the client.
We made a change previously in #7111 which moved support
for AWS envs only for AWS S3 endpoint. Some users requested
that this be added back to Non-AWS endpoints as well as
they require separate credentials for backend authentication
from security point of view.
More than one client can't use the same clientID for MQTT connection.
This causes problem in distributed deployments where config is shared
across nodes, as each Minio instance tries to connect to MQTT using the
same clientID.
This commit removes the clientID field in config, and allows
MQTT client to create random clientID for each node.
Batching records into a single SQL Select message in the response
leads to significant speed up as the message header overhead is made
negligible.
This change leads to a speed up of 3-5x for queries that select many
small records.
- New parser written from scratch, allows easier and complete parsing
of the full S3 Select SQL syntax. Parser definition is directly
provided by the AST defined for the SQL grammar.
- Bring support to parse and interpret SQL involving JSON path
expressions; evaluation of JSON path expressions will be
subsequently added.
- Bring automatic type inference and conversion for untyped
values (e.g. CSV data).
This situation happens only in gateway nas which supports
etcd based `config.json` to support all FS mode features.
The issue was we would try to migrate something which doesn't
exist when etcd is configured which leads to inconsistent
server configs in memory.
This PR fixes this situation by properly loading config after
initialization, avoiding backend disk config migration to be
done only if etcd is not configured.
If it does happen that we have a lot files in '.minio.sys/tmp',
minio startup might block deleting this folder. Rename and
delete in background instead to allow Minio to start serving
requests.
To avoid a large number of concurrent connections between minio
servers and to reduce CPU pressure, it is better to limit the number
of objects healed in parallel to number_of_CPUs.
Requirements like being able to run minio gateway in ec2
pointing to a Minio deployment wouldn't work properly
because IAM creds take precendence on ec2.
Add checks such that we only enable AWS specific features
if our backend URL points to actual AWS S3 not S3 compatible
endpoints.
Returning unexpected errors can cause problems for config handling,
which is what led gateway deployments with etcd to misbehave and
had stopped working properly
- Default support for S3 dualstack endpoints (IPv6 support)
- Support granular policy conditionals in List operations
- Support proxy cookies for stickiness
Deployment ID is not copied into new formats after healing format. Although,
this is not critical since a new deployment ID will be generated and set in the
next cluster restart, it is still much better if we don't change the deployment
id of a cluster for a better tracking.
Fix regexp matcher for special assets for the browser to clash with
less of the object namespace.
Assets should now be loaded with the /minio/ prefix. Previously,
favicon.ico (and others) could be loaded at any path matching
/minio/*/favicon.ico. This clashes with a large part of the object
namespace. With this change, /minio/favicon.ico will serve the favicon
but not /minio/mybucket/favicon.ico
Fixes#7077
This changes causes `getRootCAs` to always load system-wide CAs.
Any additional custom CAs (at `certs/CA/`) are added to the certificate pool
of system CAs.
The previous behavior was incorrect since all no system-wide CAs were
loaded if either there were CAs under `certs/CA` or the `certs/CA`
directory didn't exist at all.
Also add a cross compile script to test always cross
compilation for some well known platforms and architectures
, we support out of box compilation of these platforms even
if we don't make an official release build.
This script is to avoid regressions in this area when we
add platform dependent code.
This PR supports iam and bucket policies to have
policy variable replacements in resource and
condition key values.
For example
- ${aws:username}
- ${aws:userid}
Health checking programs very frequently use /minio/health/live
to check health, hence we can avoid doing StorageInfo() and
ListBuckets() for FS/Erasure backend.
* Use 0-byte file for bitrot verification of whole-file-bitrot files
Also pass the right checksum information for bitrot verification
* Copy xlMeta info from latest meta except []checksums and []Parts while healing
Simplify parallelReader.Read() which also fixes previous
implementation where it was returning before all the parallel
reading go-routines had terminated which caused race conditions.
When auto-encryption is turned on, we pro-actively add SSEHeader
for all PUT, POST operations. This is unusual for V2 signature
calculation because V2 signature doesn't have a pre-defined set
of signed headers in the request like V4 signature. According to
V2 we should canonicalize all incoming supported HTTP headers.
Make sure to validate signatures before we mutate http headers
This commit adds some documentation about the design of the
SSE-C and SSE-S3 implementation. It describes how the Minio server
encrypt objects and manages keys.
Deprecate the use of Admin Peers concept and migrate all peer
communication to Notification subsystem. This finally allows
for a common subsystem for all peer notification in case of
distributed server deployments.
Before this change the CopyObjectHandler and the CopyObjectPartHandler
both looked for a `versionId` parameter on the `X-Amz-Copy-Source` URL
for the version of the object to be copied on the URL unescaped version
of the header. This meant that files that had question marks in were
truncated after the question mark so that files with `?` in their
names could not be server side copied.
After this change the URL unescaping is done during the parsing of the
`versionId` parameter which fixes the problem.
This change also introduces the same logic for the
`X-Amz-Copy-Source-Version-Id` header field which was previously
ignored, namely returning an error if it is present and not `null`
since minio does not currently support versions.
S3 Docs:
- https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html
- https://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html
When source is encrypted multipart object and the parts are not
evenly divisible by DARE package block size, target encrypted size
will not necessarily be the same as encrypted source object.
This commit removes old code preventing PUT requests with '/' as a path,
because this is not needed anymore after the introduction of the virtual
host style in Minio server code.
'PUT /' when global domain is not configured already returns 405 Method
Not Allowed http error.
This PR adds pass-through, single encryption at gateway and double
encryption support (gateway encryption with pass through of SSE
headers to backend).
If KMS is set up (either with Vault as KMS or using
MINIO_SSE_MASTER_KEY),gateway will automatically perform
single encryption. If MINIO_GATEWAY_SSE is set up in addition to
Vault KMS, double encryption is performed.When neither KMS nor
MINIO_GATEWAY_SSE is set, do a pass through to backend.
When double encryption is specified, MINIO_GATEWAY_SSE can be set to
"C" for SSE-C encryption at gateway and backend, "S3" for SSE-S3
encryption at gateway/backend or both to support more than one option.
Fixes#6323, #6696
This is part of implementation for mc admin health command. The
ServerDrivesPerfInfo() admin API returns read and write speed
information for all the drives (local and remote) in a given Minio
server deployment.
Part of minio/mc#2606
It can happen with erroneous clients which do not send `Host:`
header until 4k worth of header bytes have been read. This can lead
to Peek() method of bufio to fail with ErrBufferFull.
To avoid this we should make sure that Peek buffer is as large as
our maxHeaderBytes count.
All GitHub issues are addressed on a best-effort basis at MinIO's sole discretion. There are no Service Level Agreements (SLA) or Objectives (SLO). Remember our [Code of Conduct](https://github.com/minio/minio/blob/master/code_of_conduct.md) when engaging with MinIO Engineers and the larger community.
For urgent issues (e.g. production down, etc.), subscribe to [SUBNET](https://min.io/pricing?jmp=github) for direct to engineering support.
<!--- Provide a general summary of the issue in the Title above -->
## Expected Behavior
@@ -15,6 +29,8 @@
## Steps to Reproduce (for bugs)
<!--- Provide a link to a live example, or an unambiguous set of steps to -->
<!--- reproduce this bug. Include code to reproduce, if relevant -->
<!--- and make sure you have followed https://github.com/minio/minio/tree/release/docs/debugging to capture relevant logs -->
1.
2.
3.
@@ -30,8 +46,6 @@
## Your Environment
<!--- Include as many relevant details about the environment you experienced the bug in -->
We have designed MinIO as an Open Source software for the Open Source software community. This requires applications to consider whether their usage of MinIO is in compliance with the GNU AGPLv3 [license](https://github.com/minio/minio/blob/master/LICENSE).
MinIO cannot make the determination as to whether your application's usage of MinIO is in compliance with the AGPLv3 license requirements. You should instead rely on your own legal counsel or licensing specialists to audit and ensure your application is in compliance with the licenses of MinIO and all other open-source projects with which your application integrates or interacts. We understand that AGPLv3 licensing is complex and nuanced. It is for that reason we strongly encourage using experts in licensing to make any such determinations around compliance instead of relying on apocryphal or anecdotal advice.
[MinIO Commercial Licensing](https://min.io/pricing) is the best option for applications that trigger AGPLv3 obligations (e.g. open sourcing your application). Applications using MinIO - or any other OSS-licensed code - without validating their usage do so at their own risk.
``Minio`` community welcomes your contribution. To make the process as seamless as possible, we recommend you read this contribution guide.
``MinIO`` community welcomes your contribution. To make the process as seamless as possible, we recommend you read this contribution guide.
## Development Workflow
Start by forking the Minio GitHub repository, make changes in a branch and then send a pull request. We encourage pull requests to discuss code changes. Here are the steps in details:
Start by forking the MinIO GitHub repository, make changes in a branch and then send a pull request. We encourage pull requests to discuss code changes. Here are the steps in details:
### Setup your Minio GitHub Repository
Fork [Minio upstream](https://github.com/minio/minio/fork) source repository to your own personal repository. Copy the URL of your Minio fork (you will need it for the `git clone` command below).
### Setup your MinIO GitHub Repository
Fork [MinIO upstream](https://github.com/minio/minio/fork) source repository to your own personal repository. Copy the URL of your MinIO fork (you will need it for the `git clone` command below).
```sh
$ mkdir -p $GOPATH/src/github.com/minio
$ cd $GOPATH/src/github.com/minio
$ git clone <paste saved URL for personal forked minio repo>
Before making code changes, make sure you create a separate branch for these changes
```
$ git checkout -b my-new-feature
git checkout -b my-new-feature
```
### Test Minio server changes
### Test MinIO server changes
After your code changes, make sure
- To add test cases for the new code. If you have questions about how to do it, please ask on our [Slack](slack.minio.io) channel.
- To add test cases for the new code. If you have questions about how to do it, please ask on our [Slack](https://slack.min.io) channel.
- To run `make verifiers`
- To squash your commits into a single commit. `git rebase -i`. It's okay to force update your pull request.
- To run `go test -race ./...` and `go build` completes.
- To run `make test` and `make build` completes.
### Commit changes
After verification, commit your changes. This is a [great post](https://chris.beams.io/posts/git-commit/) on how to write useful commit messages
```
$ git commit -am 'Add some feature'
git commit -am 'Add some feature'
```
### Push to the branch
Push your locally committed changes to the remote origin (your fork)
```
$ git push origin my-new-feature
git push origin my-new-feature
```
### Create a Pull Request
Pull requests can be created via GitHub. Refer to [this document](https://help.github.com/articles/creating-a-pull-request/) for detailed steps on how to create a pull request. After a Pull Request gets peer reviewed and approved, it will be merged.
## FAQs
### How does ``Minio`` manages dependencies?
``Minio`` manages its dependencies using [govendor](https://github.com/kardianos/govendor). To add a dependency
- Run `go get foo/bar`
- Edit your code to import foo/bar
- Run `make pkg-add PKG=foo/bar` from top-level directory
### How does ``MinIO`` manage dependencies?
``MinIO`` uses `go mod` to manage its dependencies.
- Run `go get foo/bar` in the source folder to add the dependency to `go.mod` file.
To remove a dependency
- Edit your code to not import foo/bar
- Run `make pkg-remove PKG=foo/bar` from top-level directory
### What are the coding guidelines for Minio?
``Minio`` is fully conformant with Golang style. Refer: [Effective Go](https://github.com/golang/go/wiki/CodeReviewComments) article from Golang project. If you observe offending code, please feel free to send a pull request or ping us on [Slack](slack.minio.io).
- Edit your code and remove the import reference.
- Run `go mod tidy` in the source folder to remove dependency from `go.mod` file.
### What are the coding guidelines for MinIO?
``MinIO`` is fully conformant with Golang style. Refer: [Effective Go](https://github.com/golang/go/wiki/CodeReviewComments) article from Golang project. If you observe offending code, please feel free to send a pull request or ping us on [Slack](https://slack.min.io).
summary="MinIO is a High Performance Object Storage, API compatible with Amazon S3 cloud storage service."\
description="MinIO object storage is fundamentally different. Designed for performance and the S3 API, it is 100% open-source. MinIO is ideal for large, private cloud environments with stringent security requirements and delivers mission-critical availability across a diverse range of workloads."
summary="MinIO is a High Performance Object Storage, API compatible with Amazon S3 cloud storage service."\
description="MinIO object storage is fundamentally different. Designed for performance and the S3 API, it is 100% open-source. MinIO is ideal for large, private cloud environments with stringent security requirements and delivers mission-critical availability across a diverse range of workloads."
summary="MinIO is a High Performance Object Storage, API compatible with Amazon S3 cloud storage service."\
description="MinIO object storage is fundamentally different. Designed for performance and the S3 API, it is 100% open-source. MinIO is ideal for large, private cloud environments with stringent security requirements and delivers mission-critical availability across a diverse range of workloads."
test-replication:install-racetest-replication-2sitetest-replication-3sitetest-delete-replicationtest-sio-errortest-delete-marker-proxying## verify multi site replication
@echo "Running tests for replicating three sites"
test-site-replication-ldap:install-race## verify automatic site replication
@echo "Running tests for automatic site replication of IAM (with LDAP)"
MinIO creates FIPS builds using a patched version of the Go compiler (that uses BoringCrypto, from BoringSSL, which is [FIPS 140-2 validated](https://csrc.nist.gov/csrc/media/projects/cryptographic-module-validation-program/documents/security-policies/140sp2964.pdf)) published by the Golang Team [here](https://github.com/golang/go/tree/dev.boringcrypto/misc/boring).
MinIO FIPS executables are available at <http://dl.min.io> - they are only published for `linux-amd64` architecture as binary files with the suffix `.fips`. We also publish corresponding container images to our official image repositories.
We are not making any statements or representations about the suitability of this code or build in relation to the FIPS 140-2 standard. Interested users will have to evaluate for themselves whether this is useful for their own purposes.
Minio is an object storage server released under Apache License v2.0. It is compatible with Amazon S3 cloud storage service. It is best suited for storing unstructured data such as photos, videos, log files, backups and container / VM images. Size of an object can range from a few KBs to a maximum of 5TB.
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads. To learn more about what MinIO is doing for AI storage, go to [AI storage documentation](https://min.io/solutions/object-storage-for-ai).
This README provides quickstart instructions on running MinIO on bare metal hardware, including container-based installations. For Kubernetes environments, use the [MinIO Kubernetes Operator](https://github.com/minio/operator/blob/master/README.md).
## Container Installation
Use the following commands to run a standalone MinIO server as a container.
Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication
require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically,
with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html)
for more complete documentation.
## Docker Container
### Stable
```
docker pull minio/minio
docker run -p 9000:9000 minio/minio server /data
Run the following command to run the latest stable image of MinIO as a container using an ephemeral data volume:
```sh
podman run -p 9000:9000 -p 9001:9001 \
quay.io/minio/minio server /data --console-address ":9001"
```
### Edge
```
docker pull minio/minio:edge
docker run -p 9000:9000 minio/minio:edge server /data
```
Note: Docker will not display the autogenerated keys unless you start the container with the `-it`(interactive TTY) argument. Generally, it is not recommended to use autogenerated keys with containers. Please visit Minio Docker quickstart guide for more information [here](https://docs.minio.io/docs/minio-docker-quickstart-guide)
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded
object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the
root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See
[Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers,
see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
> NOTE: To deploy MinIO on with persistent storage, you must map local persistent directories from the host OS to the container using the `podman -v` option. For example, `-v /mnt/data:/data` maps the host OS drive at `/mnt/data` to `/data` on the container.
## macOS
### Homebrew
Install minio packages using [Homebrew](http://brew.sh/)
Use the following commands to run a standalone MinIO server on macOS.
Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html) for more complete documentation.
### Homebrew (recommended)
Run the following command to install the latest stable MinIO package using [Homebrew](https://brew.sh/). Replace ``/data`` with the path to the drive or directory in which you want MinIO to store data.
```sh
brew install minio/stable/minio
minio server /data
```
> NOTE: If you previously installed minio using `brew install minio` then it is recommended that you reinstall minio from `minio/stable/minio` official repo instead.
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html/> to view MinIO SDKs for supported languages.
Use the following command to download and run a standalone MinIO server on macOS. Replace ``/data`` with the path to the drive or directory in which you want MinIO to store data.
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
## GNU/Linux
Use the following command to run a standalone MinIO server on Linux hosts running 64-bit Intel/AMD architectures. Replace ``/data`` with the path to the drive or directory in which you want MinIO to store data.
| 64-bit ARM | <https://dl.min.io/server/minio/release/linux-arm64/minio> |
| 64-bit PowerPC LE (ppc64le) | <https://dl.min.io/server/minio/release/linux-ppc64le/minio> |
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html#) for more complete documentation.
Use the following command to run a standalone MinIO server on the Windows host. Replace ``D:\`` with the path to the drive or directory in which you want MinIO to store data. You must change the terminal or powershell directory to the location of the ``minio.exe`` executable, *or* add the path to that directory to the system ``$PATH``:
```sh
minio.exe server D:\
```
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html#) for more complete documentation.
## Install from Source
Source installation is only intended for developers and advanced users. If you do not have a working Golang environment, please follow [How to install Golang](https://docs.minio.io/docs/how-to-install-golang).
Use the following commands to compile and run a standalone MinIO server from source. Source installation is only intended for developers and advanced users. If you do not have a working Golang environment, please follow [How to install Golang](https://golang.org/doc/install). Minimum version required is [go1.21](https://golang.org/dl/#stable)
```sh
go get -u github.com/minio/minio
go install github.com/minio/minio@latest
```
## Allow port access for Firewalls
The MinIO deployment starts using default root credentials `minioadmin:minioadmin`. You can test the deployment using the MinIO Console, an embedded web-based object browser built into MinIO Server. Point a web browser running on the host machine to <http://127.0.0.1:9000> and log in with the root credentials. You can use the Browser to create buckets, upload objects, and browse the contents of the MinIO server.
By default Minio uses the port 9000 to listen for incoming connections. If your platform blocks the port by default, you may need to enable access to the port.
You can also connect using any S3-compatible tool, such as the MinIO Client `mc` commandline tool. See [Test using MinIO Client `mc`](#test-using-minio-client-mc) for more information on using the `mc` commandline tool. For application developers, see <https://min.io/docs/minio/linux/developers/minio-drivers.html> to view MinIO SDKs for supported languages.
### iptables
> NOTE: Standalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically, with a *minimum* of 4 drives per MinIO server. See [MinIO Erasure Code Overview](https://min.io/docs/minio/linux/operations/concepts/erasure-coding.html) for more complete documentation.
For hosts with iptables enabled (RHEL, CentOS, etc), you can use `iptables`command to enable all traffic coming to specific ports. Use below command to allow
access to port 9000
MinIO strongly recommends *against* using compiled-from-source MinIO servers for production environments.
```sh
iptables -A INPUT -p tcp --dport 9000 -j ACCEPT
service iptables restart
```
## Deployment Recommendations
Below command enables all incoming traffic to ports ranging from 9000 to 9010.
### Allow port access for Firewalls
```sh
iptables -A INPUT -p tcp --dport 9000:9010 -j ACCEPT
service iptables restart
```
By default MinIO uses the port 9000 to listen for incoming connections. If your platform blocks the port by default, you may need to enable access to the port.
### ufw
@@ -135,30 +176,84 @@ Note that `permanent` makes sure the rules are persistent across firewall start,
firewall-cmd --reload
```
## Test using Minio Browser
Minio Server comes with an embedded web based object browser. Point your web browser to http://127.0.0.1:9000 ensure your server has started successfully.
For hosts with iptables enabled (RHEL, CentOS, etc), you can use `iptables` command to enable all traffic coming to specific ports. Use below command to allow
access to port 9000
## Test using Minio Client `mc`
`mc` provides a modern alternative to UNIX commands like ls, cat, cp, mirror, diff etc. It supports filesystems and Amazon S3 compatible cloud storage services. Follow the Minio Client [Quickstart Guide](https://docs.minio.io/docs/minio-client-quickstart-guide) for further instructions.
```sh
iptables -A INPUT -p tcp --dport 9000 -j ACCEPT
service iptables restart
```
## Pre-existing data
When deployed on a single drive, Minio server lets clients access any pre-existing data in the data directory. For example, if Minio is started with the command `minio server /mnt/data`, any pre-existing data in the `/mnt/data` directory would be accessible to the clients.
Below command enables all incoming traffic to ports ranging from 9000 to 9010.
The above statement is also valid for all gateway backends.
```sh
iptables -A INPUT -p tcp --dport 9000:9010 -j ACCEPT
service iptables restart
```
## Test MinIO Connectivity
### Test using MinIO Console
MinIO Server comes with an embedded web based object browser. Point your web browser to <http://127.0.0.1:9000> to ensure your server has started successfully.
> NOTE: MinIO runs console on random port by default, if you wish to choose a specific port use `--console-address` to pick a specific interface and port.
### Things to consider
MinIO redirects browser access requests to the configured server port (i.e. `127.0.0.1:9000`) to the configured Console port. MinIO uses the hostname or IP address specified in the request when building the redirect URL. The URL and port *must* be accessible by the client for the redirection to work.
For deployments behind a load balancer, proxy, or ingress rule where the MinIO host IP address or port is not public, use the `MINIO_BROWSER_REDIRECT_URL` environment variable to specify the external hostname for the redirect. The LB/Proxy must have rules for directing traffic to the Console port specifically.
For example, consider a MinIO deployment behind a proxy `https://minio.example.net`, `https://console.minio.example.net` with rules for forwarding traffic on port :9000 and :9001 to MinIO and the MinIO Console respectively on the internal network. Set `MINIO_BROWSER_REDIRECT_URL` to `https://console.minio.example.net` to ensure the browser receives a valid reachable URL.
`mc` provides a modern alternative to UNIX commands like ls, cat, cp, mirror, diff etc. It supports filesystems and Amazon S3 compatible cloud storage services. Follow the MinIO Client [Quickstart Guide](https://min.io/docs/minio/linux/reference/minio-mc.html#quickstart) for further instructions.
## Upgrading MinIO
Upgrades require zero downtime in MinIO, all upgrades are non-disruptive, all transactions on MinIO are atomic. So upgrading all the servers simultaneously is the recommended way to upgrade MinIO.
> NOTE: requires internet access to update directly from <https://dl.min.io>, optionally you can host any mirrors at <https://my-artifactory.example.com/minio/>
- For deployments that installed the MinIO server binary by hand, use [`mc admin update`](https://min.io/docs/minio/linux/reference/minio-mc-admin/mc-admin-update.html)
```sh
mc admin update <minio alias, e.g., myminio>
```
- For deployments without external internet access (e.g. airgapped environments), download the binary from <https://dl.min.io> and replace the existing MinIO binary let's say for example `/opt/bin/minio`, apply executable permissions `chmod +x /opt/bin/minio` and proceed to perform `mc admin service restart alias/`.
- For installations using Systemd MinIO service, upgrade via RPM/DEB packages **parallelly** on all servers or replace the binary lets say `/opt/bin/minio` on all nodes, apply executable permissions `chmod +x /opt/bin/minio` and process to perform `mc admin service restart alias/`.
### Upgrade Checklist
- Test all upgrades in a lower environment (DEV, QA, UAT) before applying to production. Performing blind upgrades in production environments carries significant risk.
- Read the release notes for MinIO *before* performing any upgrade, there is no forced requirement to upgrade to latest release upon every release. Some release may not be relevant to your setup, avoid upgrading production environments unnecessarily.
- If you plan to use `mc admin update`, MinIO process must have write access to the parent directory where the binary is present on the host system.
- `mc admin update` is not supported and should be avoided in kubernetes/container environments, please upgrade containers by upgrading relevant container images.
- **We do not recommend upgrading one MinIO server at a time, the product is designed to support parallel upgrades please follow our recommended guidelines.**
``Minio Browser`` provides minimal set of UI to manage buckets and objects on ``minio`` server. ``Minio Browser`` is written in javascript and released under [Apache 2.0 License](./LICENSE).
## Installation
### Install yarn
```sh
curl -o- -L https://yarnpkg.com/install.sh | bash
yarn
```
### Install `go-bindata` and `go-bindata-assetfs`
If you do not have a working Golang environment, please follow [Install Golang](https://docs.minio.io/docs/how-to-install-golang)
```sh
go get github.com/jteeuwen/go-bindata/...
go get github.com/elazarl/go-bindata-assetfs/...
```
## Generating Assets
### Generate ui-assets.go
```sh
yarn release
```
This generates ui-assets.go in the current directory. Now do `make` in the parent directory to build the minio binary with the newly generated ``ui-assets.go``
### Run Minio Browser with live reload
```sh
yarn dev
```
Open [http://localhost:8080/minio/](http://localhost:8080/minio/) in your browser to play with the application
### Run Minio Browser with live reload on custom port
You are using Internet Explorer version 12.0 or lower. Due to security issues and lack of support for Web Standards it is highly recommended that you upgrade to a modern browser
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
exportconstSET="alert/SET"
exportconstCLEAR="alert/CLEAR"
exportletalertId=0
exportconstset=alert=>{
constid=alertId++
return(dispatch,getState)=>{
if(alert.type!=="danger"||alert.autoClear){
setTimeout(()=>{
dispatch({
type:CLEAR,
alert:{
id
}
})
},5000)
}
dispatch({
type:SET,
alert:Object.assign({},alert,{
id
})
})
}
}
exportconstclear=()=>{
return{type:CLEAR}
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.