summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLiu Bo <bo.li.liu@oracle.com>2018-01-05 12:51:09 -0700
committerDavid Sterba <dsterba@suse.com>2018-01-22 16:08:21 +0100
commit18e83ac75bfe67009c4ddcdd581bba8eb16f4030 (patch)
tree84a08f0e3aa47fb2e1b9e1d4f08393a42ee1bd33
parenta520a7e0b548690ea8eb77bb45800459351a7cf5 (diff)
Btrfs: fix unexpected EEXIST from btrfs_get_extent
This fixes a corner case that is caused by a race of dio write vs dio read/write. Here is how the race could happen. Suppose that no extent map has been loaded into memory yet. There is a file extent [0, 32K), two jobs are running concurrently against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio read from [0, 4K) or [4K, 8K). t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K). ------------------------------------------------------ t1 t2 btrfs_get_blocks_direct() btrfs_get_blocks_direct() -> btrfs_get_extent() -> btrfs_get_extent() -> lookup_extent_mapping() -> add_extent_mapping() -> lookup_extent_mapping() # load [0, 32K) -> btrfs_new_extent_direct() -> btrfs_drop_extent_cache() # split [0, 32K) and # drop [8K, 32K) -> add_extent_mapping() # add [8K, 32K) -> add_extent_mapping() # handle -EEXIST when adding # [0, 32K) ------------------------------------------------------ About how t2(dio read/write) runs into -EEXIST: a) add_extent_mapping() gets -EEXIST for adding em [0, 32k), b) search_extent_mapping() then returns [0, 8k) as the existing em, even though start == existing->start, em is [0, 32k) so that extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k, c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k) (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree, d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write, which is confusing applications. Here I conclude all the possible situations, 1) start < existing->start +-----------+em+-----------+ +--prev---+ | +-------------+ | | | | | | | +---------+ + +---+existing++ ++ + | + start 2) start == existing->start +------------em------------+ | +-------------+ | | | | | + +----existing-+ + | | + start 3) start > existing->start && start < (existing->start + existing->len) +------------em------------+ | +-------------+ | | | | | + +----existing-+ + | | + start 4) start >= (existing->start + existing->len) +-----------+em+-----------+ | +-------------+ | +--next---+ | | | | | | + +---+existing++ + +---------+ + | + start As we can see, it turns out that if start is within existing em (front inclusive), then the existing em should be returned as is, otherwise, we try our best to merge candidate em with sibling ems to form a larger em (in order to reduce the total number of em). Reported-by: David Vallender <david.vallender@landmark.co.uk> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-rw-r--r--fs/btrfs/inode.c17
1 files changed, 3 insertions, 14 deletions
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f9fd8b14ef08..4ab713bd4139 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7153,19 +7153,12 @@ insert:
* existing will always be non-NULL, since there must be
* extent causing the -EEXIST.
*/
- if (existing->start == em->start &&
- extent_map_end(existing) >= extent_map_end(em) &&
- em->block_start == existing->block_start) {
- /*
- * The existing extent map already encompasses the
- * entire extent map we tried to add.
- */
+ if (start >= existing->start &&
+ start < extent_map_end(existing)) {
free_extent_map(em);
em = existing;
err = 0;
-
- } else if (start >= extent_map_end(existing) ||
- start <= existing->start) {
+ } else {
/*
* The existing extent map is the one nearest to
* the [start, start + len) range which overlaps
@@ -7177,10 +7170,6 @@ insert:
free_extent_map(em);
em = NULL;
}
- } else {
- free_extent_map(em);
- em = existing;
- err = 0;
}
}
write_unlock(&em_tree->lock);