From: Jon Chesterfield <jonathanchesterfield@gmail.com>
Date: Tue, 6 Dec 2022 16:10:42 +0000 (+0000)
Subject: [amdgpu] Reimplement LDS lowering
X-Git-Tag: upstream/17.0.6~25096
X-Git-Url: http://review.tizen.org/git/?a=commitdiff_plain;h=982017240d7f25a8a6969b8b73dc51f9ac5b93ed;p=platform%2Fupstream%2Fllvm.git

[amdgpu] Reimplement LDS lowering

Renames the current lowering scheme to "module" and introduces two new
ones, "kernel" and "table", plus a "hybrid" that chooses between those three
on a per-variable basis.

Unit tests are set up to pass with the default lowering of "module" or "hybrid"
with this patch defaulting to "module", which will be a less dramatic codegen
change relative to the current. This reflects the sparsity of test coverage for
the table lowering method. Hybrid is better than module in every respect and
will be default in a subsequent patch.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D139433
---

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index bff43cb..57f2ece 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -1299,6 +1299,24 @@ SDValue AMDGPUTargetLowering::LowerGlobalAddress(AMDGPUMachineFunction* MFI,
 
   if (G->getAddressSpace() == AMDGPUAS::LOCAL_ADDRESS ||
       G->getAddressSpace() == AMDGPUAS::REGION_ADDRESS) {
+
+    if (G->getAddressSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+      // special case handling for kernel block variable
+      // it's allocated in the kernel at a predictable address
+      // so that uses of it from functions and globals can be
+      // resolved here
+      // This only works if the current function is called from the kernel
+      // with the corresponding global
+      if (const GlobalVariable *GV2 = dyn_cast<const GlobalVariable>(GV)) {
+        if (MFI->isKnownAddressLDSGlobal(*GV2)) {
+          unsigned offset = MFI->calculateKnownAddressOfLDSGlobal(*GV2);
+
+          return DAG.getConstant(offset + G->getOffset(), SDLoc(Op),
+                                 Op.getValueType());
+        }
+      }
+    }
+
     if (!MFI->isModuleEntryFunction() &&
         !GV->getName().equals("llvm.amdgcn.module.lds")) {
       SDLoc DL(Op);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
index be5ac67..3b84f30 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
@@ -6,23 +6,114 @@
 //
 //===----------------------------------------------------------------------===//
 //
-// This pass eliminates LDS uses from non-kernel functions.
+// This pass eliminates local data store, LDS, uses from non-kernel functions.
+// LDS is contiguous memory allocated per kernel execution.
 //
-// The strategy is to create a new struct with a field for each LDS variable
-// and allocate that struct at the same address for every kernel. Uses of the
-// original LDS variables are then replaced with compile time offsets from that
-// known address. AMDGPUMachineFunction allocates the LDS global.
+// Background.
 //
-// Local variables with constant annotation or non-undef initializer are passed
+// The programming model is global variables, or equivalently function local
+// static variables, accessible from kernels or other functions. For uses from
+// kernels this is straightforward - assign an integer to the kernel for the
+// memory required by all the variables combined, allocate them within that.
+// For uses from functions there are performance tradeoffs to choose between.
+//
+// This model means the GPU runtime can specify the amount of memory allocated.
+// If this is more than the kernel assumed, the excess can be made available
+// using a language specific feature, which IR represents as a variable with
+// no initializer. This feature is not yet implemented for non-kernel functions.
+// This lowering could be extended to handle that use case, but would probably
+// require closer integration with promoteAllocaToLDS.
+//
+// Consequences of this GPU feature:
+// - memory is limited and exceeding it halts compilation
+// - a global accessed by one kernel exists independent of other kernels
+// - a global exists independent of simultaneous execution of the same kernel
+// - the address of the global may be different from different kernels as they
+//   do not alias, which permits only allocating variables they use
+// - if the address is allowed to differ, functions need help to find it
+//
+// Uses from kernels are implemented here by grouping them in a per-kernel
+// struct instance. This duplicates the variables, accurately modelling their
+// aliasing properties relative to a single global representation. It also
+// permits control over alignment via padding.
+//
+// Uses from functions are more complicated and the primary purpose of this
+// IR pass. Several different lowering are chosen between to meet requirements
+// to avoid allocating any LDS where it is not necessary, as that impacts
+// occupancy and may fail the compilation, while not imposing overhead on a
+// feature whose primary advantage over global memory is performance. The basic
+// design goal is to avoid one kernel imposing overhead on another.
+//
+// Implementation.
+//
+// LDS variables with constant annotation or non-undef initializer are passed
 // through unchanged for simplification or error diagnostics in later passes.
+// Non-undef initializers are not yet implemented for LDS.
+//
+// LDS variables that are always allocated at the same address can be found
+// by lookup at that address. Otherwise runtime information/cost is required.
 //
-// To reduce the memory overhead variables that are only used by kernels are
-// excluded from this transform. The analysis to determine whether a variable
-// is only used by a kernel is cheap and conservative so this may allocate
-// a variable in every kernel when it was not strictly necessary to do so.
+// The simplest strategy possible is to group all LDS variables in a single
+// struct and allocate that struct in every kernel such that the original
+// variables are always at the same address. LDS is however a limited resource
+// so this strategy is unusable in practice. It is not implemented here.
 //
-// A possible future refinement is to specialise the structure per-kernel, so
-// that fields can be elided based on more expensive analysis.
+// Strategy | Precise allocation | Zero runtime cost | General purpose |
+//  --------+--------------------+-------------------+-----------------+
+//   Module |                 No |               Yes |             Yes |
+//    Table |                Yes |                No |             Yes |
+//   Kernel |                Yes |               Yes |              No |
+//   Hybrid |                Yes |           Partial |             Yes |
+//
+// Module spends LDS memory to save cycles. Table spends cycles and global
+// memory to save LDS. Kernel is as fast as kernel allocation but only works
+// for variables that are known reachable from a single kernel. Hybrid picks
+// between all three. When forced to choose between LDS and cycles it minimises
+// LDS use.
+
+// The "module" lowering implemented here finds LDS variables which are used by
+// non-kernel functions and creates a new struct with a field for each of those
+// LDS variables. Variables that are only used from kernels are excluded.
+// Kernels that do not use this struct are annoteated with the attribute
+// amdgpu-elide-module-lds which allows the back end to elide the allocation.
+//
+// The "table" lowering implemented here has three components.
+// First kernels are assigned a unique integer identifier which is available in
+// functions it calls through the intrinsic amdgcn_lds_kernel_id. The integer
+// is passed through a specific SGPR, thus works with indirect calls.
+// Second, each kernel allocates LDS variables independent of other kernels and
+// writes the addresses it chose for each variable into an array in consistent
+// order. If the kernel does not allocate a given variable, it writes undef to
+// the corresponding array location. These arrays are written to a constant
+// table in the order matching the kernel unique integer identifier.
+// Third, uses from non-kernel functions are replaced with a table lookup using
+// the intrinsic function to find the address of the variable.
+//
+// "Kernel" lowering is only applicable for variables that are unambiguously
+// reachable from exactly one kernel. For those cases, accesses to the variable
+// can be lowered to ConstantExpr address of a struct instance specific to that
+// one kernel. This is zero cost in space and in compute. It will raise a fatal
+// error on any variable that might be reachable from multiple kernels and is
+// thus most easily used as part of the hybrid lowering strategy.
+//
+// Hybrid lowering is a mixture of the above. It uses the zero cost kernel
+// lowering where it can. It lowers the variable accessed by the greatest
+// number of kernels using the module strategy as that is free for the first
+// variable. Any futher variables that can be lowered with the module strategy
+// without incurring LDS memory overhead are. The remaining ones are lowered
+// via table.
+//
+// Consequences
+// - No heuristics or user controlled magic numbers, hybrid is the right choice
+// - Kernels that don't use functions (or have had them all inlined) are not
+//   affected by any lowering for kernels that do.
+// - Kernels that don't make indirect function calls are not affected by those
+//   that do.
+// - Variables which are used by lots of kernels, e.g. those injected by a
+//   language runtime in most kernels, are expected to have no overhead
+// - Implementations that instantiate templates per-kernel where those templates
+//   use LDS are expected to hit the "Kernel" lowering strategy
+// - The runtime properties impose a cost in compiler implementation complexity
 //
 //===----------------------------------------------------------------------===//
 
@@ -31,34 +122,68 @@
 #include "Utils/AMDGPUMemoryUtils.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SetOperations.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/StringSwitch.h"
 #include "llvm/Analysis/CallGraph.h"
 #include "llvm/IR/Constants.h"
 #include "llvm/IR/DerivedTypes.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/InlineAsm.h"
 #include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/Pass.h"
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/OptimizedStructLayout.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
 #include "llvm/Transforms/Utils/ModuleUtils.h"
+
 #include <tuple>
 #include <vector>
 
+#include <cstdio>
+
 #define DEBUG_TYPE "amdgpu-lower-module-lds"
 
 using namespace llvm;
 
-static cl::opt<bool> SuperAlignLDSGlobals(
+namespace {
+
+cl::opt<bool> SuperAlignLDSGlobals(
     "amdgpu-super-align-lds-globals",
     cl::desc("Increase alignment of LDS if it is not on align boundary"),
     cl::init(true), cl::Hidden);
 
-namespace {
+enum class LoweringKind { module, table, kernel, hybrid };
+cl::opt<LoweringKind> LoweringKindLoc(
+    "amdgpu-lower-module-lds-strategy",
+    cl::desc("Specify lowering strategy for function LDS access:"), cl::Hidden,
+    cl::init(LoweringKind::module),
+    cl::values(
+        clEnumValN(LoweringKind::table, "table", "Lower via table lookup"),
+        clEnumValN(LoweringKind::module, "module", "Lower via module struct"),
+        clEnumValN(
+            LoweringKind::kernel, "kernel",
+            "Lower variables reachable from one kernel, otherwise abort"),
+        clEnumValN(LoweringKind::hybrid, "hybrid",
+                   "Lower via mixture of above strategies")));
+
+bool isKernelLDS(const Function *F) {
+  // Some weirdness here. AMDGPU::isKernelCC does not call into
+  // AMDGPU::isKernel with the calling conv, it instead calls into
+  // isModuleEntryFunction which returns true for more calling conventions
+  // than AMDGPU::isKernel does. There's a FIXME on AMDGPU::isKernel.
+  // There's also a test that checks that the LDS lowering does not hit on
+  // a graphics shader, denoted amdgpu_ps, so stay with the limited case.
+  // Putting LDS in the name of the function to draw attention to this.
+  return AMDGPU::isKernel(F->getCallingConv());
+}
+
 class AMDGPULowerModuleLDS : public ModulePass {
 
   static void removeFromUsedList(Module &M, StringRef Name,
@@ -92,15 +217,14 @@ class AMDGPULowerModuleLDS : public ModulePass {
       ArrayType *ATy =
           ArrayType::get(Type::getInt8PtrTy(M.getContext()), Init.size());
       GV =
-          new llvm::GlobalVariable(M, ATy, false, GlobalValue::AppendingLinkage,
+          new GlobalVariable(M, ATy, false, GlobalValue::AppendingLinkage,
                                    ConstantArray::get(ATy, Init), Name);
       GV->setSection("llvm.metadata");
     }
   }
 
-  static void
-  removeFromUsedLists(Module &M,
-                      const std::vector<GlobalVariable *> &LocalVars) {
+  static void removeFromUsedLists(Module &M,
+                                  const DenseSet<GlobalVariable *> &LocalVars) {
     // The verifier rejects used lists containing an inttoptr of a constant
     // so remove the variables from these lists before replaceAllUsesWith
 
@@ -225,6 +349,350 @@ public:
     initializeAMDGPULowerModuleLDSPass(*PassRegistry::getPassRegistry());
   }
 
+  using FunctionVariableMap = DenseMap<Function *, DenseSet<GlobalVariable *>>;
+
+  using VariableFunctionMap = DenseMap<GlobalVariable *, DenseSet<Function *>>;
+
+  static void getUsesOfLDSByFunction(CallGraph const &CG, Module &M,
+                                     FunctionVariableMap &kernels,
+                                     FunctionVariableMap &functions) {
+
+    // Get uses from the current function, excluding uses by called functions
+    // Two output variables to avoid walking the globals list twice
+    for (auto &GV : M.globals()) {
+      if (!AMDGPU::isLDSVariableToLower(GV)) {
+        continue;
+      }
+
+      SmallVector<User *, 16> Stack(GV.users());
+      for (User *V : GV.users()) {
+        if (auto *I = dyn_cast<Instruction>(V)) {
+          Function *F = I->getFunction();
+          if (isKernelLDS(F)) {
+            kernels[F].insert(&GV);
+          } else {
+            functions[F].insert(&GV);
+          }
+        }
+      }
+    }
+  }
+
+  struct LDSUsesInfoTy {
+    FunctionVariableMap direct_access;
+    FunctionVariableMap indirect_access;
+  };
+
+  static LDSUsesInfoTy getTransitiveUsesOfLDS(CallGraph const &CG, Module &M) {
+
+    FunctionVariableMap direct_map_kernel;
+    FunctionVariableMap direct_map_function;
+    getUsesOfLDSByFunction(CG, M, direct_map_kernel, direct_map_function);
+
+    // Collect variables that are used by functions whose address has escaped
+    DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
+    for (Function &F : M.functions()) {
+      if (!isKernelLDS(&F))
+          if (F.hasAddressTaken(nullptr,
+                                /* IgnoreCallbackUses */ false,
+                                /* IgnoreAssumeLikeCalls */ false,
+                                /* IgnoreLLVMUsed */ true,
+                                /* IgnoreArcAttachedCall */ false)) {
+          set_union(VariablesReachableThroughFunctionPointer,
+                    direct_map_function[&F]);
+        }
+    }
+
+    auto functionMakesUnknownCall = [&](const Function *F) -> bool {
+      assert(!F->isDeclaration());
+      for (CallGraphNode::CallRecord R : *CG[F]) {
+        if (!R.second->getFunction()) {
+          return true;
+        }
+      }
+      return false;
+    };
+
+    // Work out which variables are reachable through function calls
+    FunctionVariableMap transitive_map_function = direct_map_function;
+
+    // If the function makes any unknown call, assume the worst case that it can
+    // access all variables accessed by functions whose address escaped
+    for (Function &F : M.functions()) {
+      if (!F.isDeclaration() && functionMakesUnknownCall(&F)) {
+        if (!isKernelLDS(&F)) {
+          set_union(transitive_map_function[&F],
+                    VariablesReachableThroughFunctionPointer);
+        }
+      }
+    }
+
+    // Direct implementation of collecting all variables reachable from each
+    // function
+    for (Function &Func : M.functions()) {
+      if (Func.isDeclaration() || isKernelLDS(&Func))
+        continue;
+
+      DenseSet<Function *> seen; // catches cycles
+      SmallVector<Function *, 4> wip{&Func};
+
+      while (!wip.empty()) {
+        Function *F = wip.pop_back_val();
+
+        // Can accelerate this by referring to transitive map for functions that
+        // have already been computed, with more care than this
+        set_union(transitive_map_function[&Func], direct_map_function[F]);
+
+        for (CallGraphNode::CallRecord R : *CG[F]) {
+          Function *ith = R.second->getFunction();
+          if (ith) {
+            if (!seen.contains(ith)) {
+              seen.insert(ith);
+              wip.push_back(ith);
+            }
+          }
+        }
+      }
+    }
+
+    // direct_map_kernel lists which variables are used by the kernel
+    // find the variables which are used through a function call
+    FunctionVariableMap indirect_map_kernel;
+
+    for (Function &Func : M.functions()) {
+      if (Func.isDeclaration() || !isKernelLDS(&Func))
+        continue;
+
+      for (CallGraphNode::CallRecord R : *CG[&Func]) {
+        Function *ith = R.second->getFunction();
+        if (ith) {
+          set_union(indirect_map_kernel[&Func], transitive_map_function[ith]);
+        } else {
+          set_union(indirect_map_kernel[&Func],
+                    VariablesReachableThroughFunctionPointer);
+        }
+      }
+    }
+
+    return {std::move(direct_map_kernel), std::move(indirect_map_kernel)};
+  }
+
+  struct LDSVariableReplacement {
+    GlobalVariable *SGV = nullptr;
+    DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP;
+  };
+
+  // remap from lds global to a constantexpr gep to where it has been moved to
+  // for each kernel
+  // an array with an element for each kernel containing where the corresponding
+  // variable was remapped to
+
+  static Constant *getAddressesOfVariablesInKernel(
+      LLVMContext &Ctx, ArrayRef<GlobalVariable *> Variables,
+      DenseMap<GlobalVariable *, Constant *> &LDSVarsToConstantGEP) {
+    // Create a ConstantArray containing the address of each Variable within the
+    // kernel corresponding to LDSVarsToConstantGEP, or poison if that kernel
+    // does not allocate it
+    // TODO: Drop the ptrtoint conversion
+
+    Type *I32 = Type::getInt32Ty(Ctx);
+
+    ArrayType *KernelOffsetsType = ArrayType::get(I32, Variables.size());
+
+    SmallVector<Constant *> Elements;
+    for (size_t i = 0; i < Variables.size(); i++) {
+      GlobalVariable *GV = Variables[i];
+      if (LDSVarsToConstantGEP.count(GV) != 0) {
+        auto elt = ConstantExpr::getPtrToInt(LDSVarsToConstantGEP[GV], I32);
+        Elements.push_back(elt);
+      } else {
+        Elements.push_back(PoisonValue::get(I32));
+      }
+    }
+    return ConstantArray::get(KernelOffsetsType, Elements);
+  }
+
+  static GlobalVariable *buildLookupTable(
+      Module &M, ArrayRef<GlobalVariable *> Variables,
+      ArrayRef<Function *> kernels,
+      DenseMap<Function *, LDSVariableReplacement> &KernelToReplacement) {
+    if (Variables.empty()) {
+      return nullptr;
+    }
+    LLVMContext &Ctx = M.getContext();
+
+    const size_t NumberVariables = Variables.size();
+    const size_t NumberKernels = kernels.size();
+
+    ArrayType *KernelOffsetsType =
+        ArrayType::get(Type::getInt32Ty(Ctx), NumberVariables);
+
+    ArrayType *AllKernelsOffsetsType =
+        ArrayType::get(KernelOffsetsType, NumberKernels);
+
+    std::vector<Constant *> overallConstantExprElts(NumberKernels);
+    for (size_t i = 0; i < NumberKernels; i++) {
+      LDSVariableReplacement Replacement = KernelToReplacement[kernels[i]];
+      overallConstantExprElts[i] = getAddressesOfVariablesInKernel(
+          Ctx, Variables, Replacement.LDSVarsToConstantGEP);
+    }
+
+    Constant *init =
+        ConstantArray::get(AllKernelsOffsetsType, overallConstantExprElts);
+
+    return new GlobalVariable(
+        M, AllKernelsOffsetsType, true, GlobalValue::InternalLinkage, init,
+        "llvm.amdgcn.lds.offset.table", nullptr, GlobalValue::NotThreadLocal,
+        AMDGPUAS::CONSTANT_ADDRESS);
+  }
+
+  void replaceUsesInInstructionsWithTableLookup(
+      Module &M, ArrayRef<GlobalVariable *> ModuleScopeVariables,
+      GlobalVariable *LookupTable) {
+
+    LLVMContext &Ctx = M.getContext();
+    IRBuilder<> Builder(Ctx);
+    Type *I32 = Type::getInt32Ty(Ctx);
+
+    // Accesses from a function use the amdgcn_lds_kernel_id intrinsic which
+    // lowers to a read from a live in register. Emit it once in the entry
+    // block to spare deduplicating it later.
+
+    DenseMap<Function *, Value *> tableKernelIndexCache;
+    auto getTableKernelIndex = [&](Function *F) -> Value * {
+      if (tableKernelIndexCache.count(F) == 0) {
+        LLVMContext &Ctx = M.getContext();
+        FunctionType *FTy = FunctionType::get(Type::getInt32Ty(Ctx), {});
+        Function *Decl =
+            Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_lds_kernel_id, {});
+
+        BasicBlock::iterator it =
+            F->getEntryBlock().getFirstNonPHIOrDbgOrAlloca();
+        Instruction &i = *it;
+        Builder.SetInsertPoint(&i);
+
+        tableKernelIndexCache[F] = Builder.CreateCall(FTy, Decl, {});
+      }
+
+      return tableKernelIndexCache[F];
+    };
+
+    for (size_t Index = 0; Index < ModuleScopeVariables.size(); Index++) {
+      auto *GV = ModuleScopeVariables[Index];
+
+      for (Use &U : make_early_inc_range(GV->uses())) {
+        auto *I = dyn_cast<Instruction>(U.getUser());
+        if (!I)
+          continue;
+
+        Value *tableKernelIndex = getTableKernelIndex(I->getFunction());
+
+        // So if the phi uses this value multiple times, what does this look
+        // like?
+        if (auto *Phi = dyn_cast<PHINode>(I)) {
+          BasicBlock *BB = Phi->getIncomingBlock(U);
+          Builder.SetInsertPoint(&(*(BB->getFirstInsertionPt())));
+        } else {
+          Builder.SetInsertPoint(I);
+        }
+
+        Value *GEPIdx[3] = {
+            ConstantInt::get(I32, 0),
+            tableKernelIndex,
+            ConstantInt::get(I32, Index),
+        };
+
+        Value *Address = Builder.CreateInBoundsGEP(
+            LookupTable->getValueType(), LookupTable, GEPIdx, GV->getName());
+
+        Value *loaded = Builder.CreateLoad(I32, Address);
+
+        Value *replacement =
+            Builder.CreateIntToPtr(loaded, GV->getType(), GV->getName());
+
+        U.set(replacement);
+      }
+    }
+  }
+
+  static DenseSet<Function *> kernelsThatIndirectlyAccessAnyOfPassedVariables(
+      Module &M, LDSUsesInfoTy &LDSUsesInfo,
+      DenseSet<GlobalVariable *> const &VariableSet) {
+
+    DenseSet<Function *> KernelSet;
+
+    if (VariableSet.empty()) return KernelSet;
+
+    for (Function &Func : M.functions()) {
+      if (Func.isDeclaration() || !isKernelLDS(&Func))
+        continue;
+      for (GlobalVariable *GV : LDSUsesInfo.indirect_access[&Func]) {
+        if (VariableSet.contains(GV)) {
+          KernelSet.insert(&Func);
+          break;
+        }
+      }
+    }
+
+    return KernelSet;
+  }
+
+  static GlobalVariable *
+  chooseBestVariableForModuleStrategy(const DataLayout &DL,
+                                      VariableFunctionMap &LDSVars) {
+    // Find the global variable with the most indirect uses from kernels
+
+    struct CandidateTy {
+      GlobalVariable *GV = nullptr;
+      size_t UserCount = 0;
+      size_t Size = 0;
+
+      CandidateTy() = default;
+
+      CandidateTy(GlobalVariable *GV, uint64_t UserCount, uint64_t AllocSize)
+          : GV(GV), UserCount(UserCount), Size(AllocSize) {}
+
+      bool operator<(const CandidateTy &Other) const {
+        // Fewer users makes module scope variable less attractive
+        if (UserCount < Other.UserCount) {
+          return true;
+        }
+        if (UserCount > Other.UserCount) {
+          return false;
+        }
+
+        // Bigger makes module scope variable less attractive
+        if (Size < Other.Size) {
+          return false;
+        }
+
+        if (Size > Other.Size) {
+          return true;
+        }
+
+        // Arbitrary but consistent
+        return GV->getName() < Other.GV->getName();
+      }
+    };
+
+    CandidateTy MostUsed;
+
+    for (auto &K : LDSVars) {
+      GlobalVariable *GV = K.first;
+      if (K.second.size() <= 1) {
+        // A variable reachable by only one kernel is best lowered with kernel
+        // strategy
+        continue;
+      }
+      CandidateTy Candidate(GV, K.second.size(),
+                      DL.getTypeAllocSize(GV->getValueType()).getFixedValue());
+      if (MostUsed < Candidate)
+        MostUsed = Candidate;
+    }
+
+    return MostUsed.GV;
+  }
+
   bool runOnModule(Module &M) override {
     LLVMContext &Ctx = M.getContext();
     CallGraph CG = CallGraph(M);
@@ -232,96 +700,286 @@ public:
 
     Changed |= eliminateConstantExprUsesOfLDSFromAllInstructions(M);
 
-    // Move variables used by functions into amdgcn.module.lds
-    std::vector<GlobalVariable *> ModuleScopeVariables =
-        AMDGPU::findLDSVariablesToLower(M, nullptr);
-    if (!ModuleScopeVariables.empty()) {
-      std::string VarName = "llvm.amdgcn.module.lds";
-
-      GlobalVariable *SGV;
-      DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP;
-      std::tie(SGV, LDSVarsToConstantGEP) =
-          createLDSVariableReplacement(M, VarName, ModuleScopeVariables);
+    Changed = true; // todo: narrow this down
 
-      appendToCompilerUsed(
-          M, {static_cast<GlobalValue *>(
-                 ConstantExpr::getPointerBitCastOrAddrSpaceCast(
-                     cast<Constant>(SGV), Type::getInt8PtrTy(Ctx)))});
+    // For each kernel, what variables does it access directly or through
+    // callees
+    LDSUsesInfoTy LDSUsesInfo = getTransitiveUsesOfLDS(CG, M);
 
-      removeFromUsedLists(M, ModuleScopeVariables);
-      replaceLDSVariablesWithStruct(M, ModuleScopeVariables, SGV,
-                                    LDSVarsToConstantGEP,
-                                    [](Use &) { return true; });
+    // For each variable accessed through callees, which kernels access it
+    VariableFunctionMap LDSToKernelsThatNeedToAccessItIndirectly;
+    for (auto &K : LDSUsesInfo.indirect_access) {
+      Function *F = K.first;
+      assert(isKernelLDS(F));
+      for (GlobalVariable *GV : K.second) {
+        LDSToKernelsThatNeedToAccessItIndirectly[GV].insert(F);
+      }
+    }
 
-      // This ensures the variable is allocated when called functions access it.
-      // It also lets other passes, specifically PromoteAlloca, accurately
-      // calculate how much LDS will be used by the kernel after lowering.
+    // Partition variables into the different strategies
+    DenseSet<GlobalVariable *> ModuleScopeVariables;
+    DenseSet<GlobalVariable *> TableLookupVariables;
+    DenseSet<GlobalVariable *> KernelAccessVariables;
 
-      IRBuilder<> Builder(Ctx);
-      for (Function &Func : M.functions()) {
-        if (!Func.isDeclaration() && AMDGPU::isKernelCC(&Func)) {
-          const CallGraphNode *N = CG[&Func];
-          const bool CalleesRequireModuleLDS = N->size() > 0;
-
-          if (CalleesRequireModuleLDS) {
-            // If a function this kernel might call requires module LDS,
-            // annotate the kernel to let later passes know it will allocate
-            // this structure, even if not apparent from the IR.
-            markUsedByKernel(Builder, &Func, SGV);
+    {
+      GlobalVariable *HybridModuleRoot =
+          LoweringKindLoc != LoweringKind::hybrid
+              ? nullptr
+              : chooseBestVariableForModuleStrategy(
+                    M.getDataLayout(),
+                    LDSToKernelsThatNeedToAccessItIndirectly);
+
+      DenseSet<Function *> const EmptySet;
+      DenseSet<Function *> const &HybridModuleRootKernels =
+          HybridModuleRoot
+              ? LDSToKernelsThatNeedToAccessItIndirectly[HybridModuleRoot]
+              : EmptySet;
+
+      for (auto &K : LDSToKernelsThatNeedToAccessItIndirectly) {
+        // Each iteration of this loop assigns exactly one global variable to
+        // exactly one of the implementation strategies.
+
+        GlobalVariable *GV = K.first;
+        assert(AMDGPU::isLDSVariableToLower(*GV));
+        assert(K.second.size() != 0);
+
+        switch (LoweringKindLoc) {
+        case LoweringKind::module:
+          ModuleScopeVariables.insert(GV);
+          break;
+
+        case LoweringKind::table:
+          TableLookupVariables.insert(GV);
+          break;
+
+        case LoweringKind::kernel:
+          if (K.second.size() == 1) {
+            KernelAccessVariables.insert(GV);
           } else {
-            // However if we are certain this kernel cannot call a function that
-            // requires module LDS, annotate the kernel so the backend can elide
-            // the allocation without repeating callgraph walks.
-            Func.addFnAttr("amdgpu-elide-module-lds");
+            report_fatal_error("Cannot lower LDS to kernel access as it is "
+                               "reachable from multiple kernels");
           }
+          break;
+
+        case LoweringKind::hybrid: {
+          if (GV == HybridModuleRoot) {
+            assert(K.second.size() != 1);
+            ModuleScopeVariables.insert(GV);
+          } else if (K.second.size() == 1) {
+            KernelAccessVariables.insert(GV);
+          } else if (set_is_subset(K.second, HybridModuleRootKernels)) {
+            ModuleScopeVariables.insert(GV);
+          } else {
+            TableLookupVariables.insert(GV);
+          }
+          break;
+        }
         }
       }
 
-      Changed = true;
+      assert(ModuleScopeVariables.size() + TableLookupVariables.size() +
+                 KernelAccessVariables.size() ==
+             LDSToKernelsThatNeedToAccessItIndirectly.size());
+    } // Variables have now been partitioned into the three lowering strategies.
+
+    // If the kernel accesses a variable that is going to be stored in the
+    // module instance through a call then that kernel needs to allocate the
+    // module instance
+    DenseSet<Function *> KernelsThatAllocateModuleLDS =
+        kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo,
+                                                        ModuleScopeVariables);
+    DenseSet<Function *> KernelsThatAllocateTableLDS =
+        kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo,
+                                                        TableLookupVariables);
+
+    if (!ModuleScopeVariables.empty()) {
+      LDSVariableReplacement ModuleScopeReplacement =
+          createLDSVariableReplacement(M, "llvm.amdgcn.module.lds",
+                                       ModuleScopeVariables);
+
+      appendToCompilerUsed(M,
+                           {static_cast<GlobalValue *>(
+                               ConstantExpr::getPointerBitCastOrAddrSpaceCast(
+                                   cast<Constant>(ModuleScopeReplacement.SGV),
+                                   Type::getInt8PtrTy(Ctx)))});
+
+      // historic
+      removeFromUsedLists(M, ModuleScopeVariables);
+
+      // Replace all uses of module scope variable from non-kernel functions
+      replaceLDSVariablesWithStruct(
+          M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) {
+            Instruction *I = dyn_cast<Instruction>(U.getUser());
+            if (!I) {
+              return false;
+            }
+            Function *F = I->getFunction();
+            return !isKernelLDS(F);
+          });
+
+      // Replace uses of module scope variable from kernel functions that
+      // allocate the module scope variable, otherwise leave them unchanged
+      // Record on each kernel whether the module scope global is used by it
+
+      LLVMContext &Ctx = M.getContext();
+      IRBuilder<> Builder(Ctx);
+
+      for (Function &Func : M.functions()) {
+        if (Func.isDeclaration() || !isKernelLDS(&Func))
+          continue;
+
+        if (KernelsThatAllocateModuleLDS.contains(&Func)) {
+          replaceLDSVariablesWithStruct(
+              M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) {
+                Instruction *I = dyn_cast<Instruction>(U.getUser());
+                if (!I) {
+                  return false;
+                }
+                Function *F = I->getFunction();
+                return F == &Func;
+              });
+
+          markUsedByKernel(Builder, &Func, ModuleScopeReplacement.SGV);
+
+        } else {
+          Func.addFnAttr("amdgpu-elide-module-lds");
+        }
+      }
     }
 
-    // Move variables used by kernels into per-kernel instances
-    for (Function &F : M.functions()) {
-      if (F.isDeclaration())
+    // Create a struct for each kernel for the non-module-scope variables
+    DenseMap<Function *, LDSVariableReplacement> KernelToReplacement;
+    for (Function &Func : M.functions()) {
+      if (Func.isDeclaration() || !isKernelLDS(&Func))
         continue;
 
-      // Only lower compute kernels' LDS.
-      if (!AMDGPU::isKernel(F.getCallingConv()))
+      DenseSet<GlobalVariable *> KernelUsedVariables;
+      for (auto &v : LDSUsesInfo.direct_access[&Func]) {
+        KernelUsedVariables.insert(v);
+      }
+      for (auto &v : LDSUsesInfo.indirect_access[&Func]) {
+        KernelUsedVariables.insert(v);
+      }
+
+      // Variables allocated in module lds must all resolve to that struct,
+      // not to the per-kernel instance.
+      if (KernelsThatAllocateModuleLDS.contains(&Func)) {
+        for (GlobalVariable *v : ModuleScopeVariables) {
+          KernelUsedVariables.erase(v);
+        }
+      }
+
+      if (KernelUsedVariables.empty()) {
+        // Either used no LDS, or all the LDS it used was also in module
         continue;
+      }
+
+      // The association between kernel function and LDS struct is done by
+      // symbol name, which only works if the function in question has a
+      // name This is not expected to be a problem in practice as kernels
+      // are called by name making anonymous ones (which are named by the
+      // backend) difficult to use. This does mean that llvm test cases need
+      // to name the kernels.
+      if (!Func.hasName()) {
+        report_fatal_error("Anonymous kernels cannot use LDS variables");
+      }
+
+      std::string VarName =
+          (Twine("llvm.amdgcn.kernel.") + Func.getName() + ".lds").str();
 
-      std::vector<GlobalVariable *> KernelUsedVariables =
-          AMDGPU::findLDSVariablesToLower(M, &F);
-
-      if (!KernelUsedVariables.empty()) {
-        // The association between kernel function and LDS struct is done by
-        // symbol name, which only works if the function in question has a name
-        // This is not expected to be a problem in practice as kernels are
-        // called by name making anonymous ones (which are named by the backend)
-        // difficult to use. This does mean that llvm test cases need
-        // to name the kernels.
-        if (!F.hasName()) {
-          report_fatal_error("Anonymous kernels cannot use LDS variables");
+      auto Replacement =
+          createLDSVariableReplacement(M, VarName, KernelUsedVariables);
+
+      // remove preserves existing codegen
+      removeFromUsedLists(M, KernelUsedVariables);
+      KernelToReplacement[&Func] = Replacement;
+
+      // Rewrite uses within kernel to the new struct
+      replaceLDSVariablesWithStruct(
+          M, KernelUsedVariables, Replacement, [&Func](Use &U) {
+            Instruction *I = dyn_cast<Instruction>(U.getUser());
+            return I && I->getFunction() == &Func;
+          });
+    }
+
+    // Lower zero cost accesses to the kernel instances just created
+    for (auto &GV : KernelAccessVariables) {
+      auto &funcs = LDSToKernelsThatNeedToAccessItIndirectly[GV];
+      assert(funcs.size() == 1); // Only one kernel can access it
+      LDSVariableReplacement Replacement =
+          KernelToReplacement[*(funcs.begin())];
+
+      DenseSet<GlobalVariable *> Vec;
+      Vec.insert(GV);
+
+      replaceLDSVariablesWithStruct(M, Vec, Replacement, [](Use &U) {
+                                                           return isa<Instruction>(U.getUser());
+      });
+    }
+
+    if (!KernelsThatAllocateTableLDS.empty()) {
+      // Collect the kernels that allocate table lookup LDS
+      std::vector<Function *> OrderedKernels;
+      {
+        for (Function &Func : M.functions()) {
+          if (Func.isDeclaration())
+            continue;
+          if (!isKernelLDS(&Func))
+            continue;
+
+          if (KernelsThatAllocateTableLDS.contains(&Func)) {
+            assert(Func.hasName()); // else fatal error earlier
+            OrderedKernels.push_back(&Func);
+          }
         }
 
-        std::string VarName =
-            (Twine("llvm.amdgcn.kernel.") + F.getName() + ".lds").str();
-        GlobalVariable *SGV;
-        DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP;
-        std::tie(SGV, LDSVarsToConstantGEP) =
-            createLDSVariableReplacement(M, VarName, KernelUsedVariables);
-
-        removeFromUsedLists(M, KernelUsedVariables);
-        replaceLDSVariablesWithStruct(
-            M, KernelUsedVariables, SGV, LDSVarsToConstantGEP, [&F](Use &U) {
-              Instruction *I = dyn_cast<Instruction>(U.getUser());
-              return I && I->getFunction() == &F;
-            });
-        Changed = true;
+        // Put them in an arbitrary but reproducible order
+        llvm::sort(OrderedKernels.begin(), OrderedKernels.end(),
+                   [](const Function *lhs, const Function *rhs) -> bool {
+                     return lhs->getName() < rhs->getName();
+                   });
+
+        // Annotate the kernels with their order in this vector
+        LLVMContext &Ctx = M.getContext();
+        IRBuilder<> Builder(Ctx);
+
+        if (OrderedKernels.size() > UINT32_MAX) {
+          // 32 bit keeps it in one SGPR. > 2**32 kernels won't fit on the GPU
+          report_fatal_error("Unimplemented LDS lowering for > 2**32 kernels");
+        }
+
+        for (size_t i = 0; i < OrderedKernels.size(); i++) {
+          Metadata *AttrMDArgs[1] = {
+              ConstantAsMetadata::get(Builder.getInt32(i)),
+          };
+          OrderedKernels[i]->setMetadata("llvm.amdgcn.lds.kernel.id",
+                                         MDNode::get(Ctx, AttrMDArgs));
+
+          markUsedByKernel(Builder, OrderedKernels[i],
+                           KernelToReplacement[OrderedKernels[i]].SGV);
+        }
       }
+
+      // The order must be consistent between lookup table and accesses to
+      // lookup table
+      std::vector<GlobalVariable *> TableLookupVariablesOrdered(
+          TableLookupVariables.begin(), TableLookupVariables.end());
+      llvm::sort(TableLookupVariablesOrdered.begin(),
+                 TableLookupVariablesOrdered.end(),
+                 [](const GlobalVariable *lhs, const GlobalVariable *rhs) {
+                   return lhs->getName() < rhs->getName();
+                 });
+
+      GlobalVariable *LookupTable = buildLookupTable(
+          M, TableLookupVariablesOrdered, OrderedKernels, KernelToReplacement);
+      replaceUsesInInstructionsWithTableLookup(M, TableLookupVariablesOrdered,
+                                               LookupTable);
     }
 
     for (auto &GV : make_early_inc_range(M.globals()))
       if (AMDGPU::isLDSVariableToLower(GV)) {
+
+        // probably want to remove from used lists
         GV.removeDeadConstantUsers();
         if (GV.use_empty())
           GV.eraseFromParent();
@@ -375,10 +1033,9 @@ private:
     return Changed;
   }
 
-  std::tuple<GlobalVariable *, DenseMap<GlobalVariable *, Constant *>>
-  createLDSVariableReplacement(
+  static LDSVariableReplacement createLDSVariableReplacement(
       Module &M, std::string VarName,
-      std::vector<GlobalVariable *> const &LDSVarsToTransform) {
+      DenseSet<GlobalVariable *> const &LDSVarsToTransform) {
     // Create a struct instance containing LDSVarsToTransform and map from those
     // variables to ConstantExprGEP
     // Variables may be introduced to meet alignment requirements. No aliasing
@@ -474,18 +1131,26 @@ private:
       }
     }
     assert(Map.size() == LDSVarsToTransform.size());
-    return std::make_tuple(SGV, std::move(Map));
+    return {SGV, std::move(Map)};
   }
 
   template <typename PredicateTy>
   void replaceLDSVariablesWithStruct(
-      Module &M, std::vector<GlobalVariable *> const &LDSVarsToTransform,
-      GlobalVariable *SGV,
-      DenseMap<GlobalVariable *, Constant *> &LDSVarsToConstantGEP,
-      PredicateTy Predicate) {
+      Module &M, DenseSet<GlobalVariable *> const &LDSVarsToTransformArg,
+      LDSVariableReplacement Replacement, PredicateTy Predicate) {
     LLVMContext &Ctx = M.getContext();
     const DataLayout &DL = M.getDataLayout();
 
+    // A hack... we need to insert the aliasing info in a predictable order for
+    // lit tests. Would like to have them in a stable order already, ideally the
+    // same order they get allocated, which might mean an ordered set container
+    std::vector<GlobalVariable *> LDSVarsToTransform(
+        LDSVarsToTransformArg.begin(), LDSVarsToTransformArg.end());
+    llvm::sort(LDSVarsToTransform.begin(), LDSVarsToTransform.end(),
+               [](const GlobalVariable *lhs, const GlobalVariable *rhs) {
+                 return lhs->getName() < rhs->getName();
+               });
+
     // Create alias.scope and their lists. Each field in the new structure
     // does not alias with all other fields.
     SmallVector<MDNode *> AliasScopes;
@@ -506,18 +1171,16 @@ private:
     // field of the instance that will be allocated by AMDGPUMachineFunction
     for (size_t I = 0; I < NumberVars; I++) {
       GlobalVariable *GV = LDSVarsToTransform[I];
-      Constant *GEP = LDSVarsToConstantGEP[GV];
+      Constant *GEP = Replacement.LDSVarsToConstantGEP[GV];
 
       GV->replaceUsesWithIf(GEP, Predicate);
-      if (GV->use_empty()) {
-        GV->eraseFromParent();
-      }
 
       APInt APOff(DL.getIndexTypeSizeInBits(GEP->getType()), 0);
       GEP->stripAndAccumulateInBoundsConstantOffsets(DL, APOff);
       uint64_t Offset = APOff.getZExtValue();
 
-      Align A = commonAlignment(SGV->getAlign().valueOrOne(), Offset);
+      Align A =
+          commonAlignment(Replacement.SGV->getAlign().valueOrOne(), Offset);
 
       if (I)
         NoAliasList[I - 1] = AliasScopes[I - 1];
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
index 45fbc84..d88a2cd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
@@ -14,6 +14,7 @@
 
 #include "AMDGPUMCInstLower.h"
 #include "AMDGPUAsmPrinter.h"
+#include "AMDGPUMachineFunction.h"
 #include "AMDGPUTargetMachine.h"
 #include "MCTargetDesc/AMDGPUInstPrinter.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
@@ -165,6 +166,17 @@ bool AMDGPUAsmPrinter::lowerOperand(const MachineOperand &MO,
 }
 
 const MCExpr *AMDGPUAsmPrinter::lowerConstant(const Constant *CV) {
+
+  // Intercept LDS variables with known addresses
+  if (const GlobalVariable *GV = dyn_cast<GlobalVariable>(CV)) {
+    if (AMDGPUMachineFunction::isKnownAddressLDSGlobal(*GV)) {
+      unsigned offset =
+          AMDGPUMachineFunction::calculateKnownAddressOfLDSGlobal(*GV);
+      Constant *C = ConstantInt::get(CV->getContext(), APInt(32, offset));
+      return AsmPrinter::lowerConstant(C);
+    }
+  }
+
   if (const MCExpr *E = lowerAddrSpaceCast(TM, CV, OutContext))
     return E;
   return AsmPrinter::lowerConstant(CV);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
index 488b3be..d8133a9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
@@ -84,6 +84,24 @@ unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,
   return Offset;
 }
 
+static constexpr StringLiteral ModuleLDSName = "llvm.amdgcn.module.lds";
+
+bool AMDGPUMachineFunction::isKnownAddressLDSGlobal(const GlobalVariable &GV) {
+  auto name = GV.getName();
+  return (name == ModuleLDSName) ||
+         (name.startswith("llvm.amdgcn.kernel.") && name.endswith(".lds"));
+}
+
+const Function *AMDGPUMachineFunction::getKernelLDSFunctionFromGlobal(
+    const GlobalVariable &GV) {
+  const Module &M = *GV.getParent();
+  StringRef N(GV.getName());
+  if (N.consume_front("llvm.amdgcn.kernel.") && N.consume_back(".lds")) {
+    return M.getFunction(N);
+  }
+  return nullptr;
+}
+
 const GlobalVariable *
 AMDGPUMachineFunction::getKernelLDSGlobalFromFunction(const Function &F) {
   const Module *M = F.getParent();
@@ -98,6 +116,37 @@ static bool canElideModuleLDS(const Function &F) {
   return F.hasFnAttribute("amdgpu-elide-module-lds");
 }
 
+unsigned AMDGPUMachineFunction::calculateKnownAddressOfLDSGlobal(
+    const GlobalVariable &GV) {
+  // module.lds, then alignment padding, then kernel.lds, then other variables
+  // if any
+
+  assert(isKnownAddressLDSGlobal(GV));
+  unsigned Offset = 0;
+
+  if (GV.getName() == ModuleLDSName) {
+    return 0;
+  }
+
+  const Module *M = GV.getParent();
+  const DataLayout &DL = M->getDataLayout();
+
+  const GlobalVariable *GVM = M->getNamedGlobal(ModuleLDSName);
+  const Function *f = getKernelLDSFunctionFromGlobal(GV);
+
+  // Account for module.lds if allocated for this function
+  if (GVM && f && !canElideModuleLDS(*f)) {
+    // allocator aligns this to var align, but it's zero to begin with
+    Offset += DL.getTypeAllocSize(GVM->getValueType());
+  }
+
+  // No dynamic LDS alignment done by allocateModuleLDSGlobal
+  Offset = alignTo(
+      Offset, DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType()));
+
+  return Offset;
+}
+
 void AMDGPUMachineFunction::allocateKnownAddressLDSGlobal(const Function &F) {
   const Module *M = F.getParent();
 
@@ -124,21 +173,25 @@ void AMDGPUMachineFunction::allocateKnownAddressLDSGlobal(const Function &F) {
     // }
     // other variables, e.g. dynamic lds, allocated after this call
 
-    const GlobalVariable *GV = M->getNamedGlobal("llvm.amdgcn.module.lds");
+    const GlobalVariable *GV = M->getNamedGlobal(ModuleLDSName);
     const GlobalVariable *KV = getKernelLDSGlobalFromFunction(F);
 
     if (GV && !canElideModuleLDS(F)) {
+      assert(isKnownAddressLDSGlobal(*GV));
       unsigned Offset = allocateLDSGlobal(M->getDataLayout(), *GV, Align());
       (void)Offset;
-      assert(Offset == 0 &&
+      assert(Offset == calculateKnownAddressOfLDSGlobal(*GV) &&
              "Module LDS expected to be allocated before other LDS");
     }
 
     if (KV) {
       // The per-kernel offset is deterministic because it is allocated
       // before any other non-module LDS variables.
+      assert(isKnownAddressLDSGlobal(*KV));
       unsigned Offset = allocateLDSGlobal(M->getDataLayout(), *KV, Align());
       (void)Offset;
+      assert(Offset == calculateKnownAddressOfLDSGlobal(*KV) &&
+             "Kernel LDS expected to be immediately after module LDS");
     }
   }
 }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
index 098f4a8..4d97e5a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
@@ -97,6 +97,7 @@ public:
   unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalVariable &GV) {
     return allocateLDSGlobal(DL, GV, DynLDSAlign);
   }
+
   unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalVariable &GV,
                              Align Trailing);
 
@@ -104,9 +105,17 @@ public:
 
   // A kernel function may have an associated LDS allocation, and a kernel-scope
   // LDS allocation must have an associated kernel function
+
+  // LDS allocation should have an associated kernel function
+  static const Function *
+  getKernelLDSFunctionFromGlobal(const GlobalVariable &GV);
   static const GlobalVariable *
   getKernelLDSGlobalFromFunction(const Function &F);
 
+  // Module or kernel scope LDS variable
+  static bool isKnownAddressLDSGlobal(const GlobalVariable &GV);
+  static unsigned calculateKnownAddressOfLDSGlobal(const GlobalVariable &GV);
+
   static Optional<uint32_t> getLDSKernelIdMetadata(const Function &F);
 
   Align getDynLDSAlign() const { return DynLDSAlign; }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 0fe6ffc..3050f50 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6042,10 +6042,23 @@ SDValue SITargetLowering::lowerBUILD_VECTOR(SDValue Op,
 bool
 SITargetLowering::isOffsetFoldingLegal(const GlobalAddressSDNode *GA) const {
   // We can fold offsets for anything that doesn't require a GOT relocation.
-  return (GA->getAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS ||
-          GA->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS ||
-          GA->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS_32BIT) &&
-         !shouldEmitGOTReloc(GA->getGlobal());
+  auto const AS = GA->getAddressSpace();
+  if (AS == AMDGPUAS::GLOBAL_ADDRESS) return true;
+  if (AS == AMDGPUAS::CONSTANT_ADDRESS) return true;
+  if ((AS == AMDGPUAS::CONSTANT_ADDRESS_32BIT) &&
+      !shouldEmitGOTReloc(GA->getGlobal())) return true;
+    
+  // Some LDS variables have compile time known addresses
+  if (AS == AMDGPUAS::LOCAL_ADDRESS) {
+    if (const GlobalVariable *GV =
+        dyn_cast<const GlobalVariable>(GA->getGlobal())) {
+      if (AMDGPUMachineFunction::isKnownAddressLDSGlobal(*GV)) {
+        return true;
+      }
+    }
+  }
+
+  return false;
 }
 
 static SDValue
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/dropped_debug_info_assert.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/dropped_debug_info_assert.ll
index bb6f809..8ee5d9c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/dropped_debug_info_assert.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/dropped_debug_info_assert.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -stop-after=instruction-select -o - %s | FileCheck %s
+; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -stop-after=instruction-select --amdgpu-lower-module-lds-strategy=module -o - %s | FileCheck %s
 ; Make sure there are no assertions on dropped debug info
 
 declare void @callee()
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
index ac12b4f..467d755 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
-; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -O0 -global-isel -stop-after=irtranslator -verify-machineinstrs -o - %s | FileCheck %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -O0 -global-isel -stop-after=irtranslator -verify-machineinstrs --amdgpu-lower-module-lds-strategy=module -o - %s | FileCheck %s
 
 define amdgpu_kernel void @asm_convergent() convergent{
   ; CHECK-LABEL: name: asm_convergent
diff --git a/llvm/test/CodeGen/AMDGPU/addrspacecast-known-non-null.ll b/llvm/test/CodeGen/AMDGPU/addrspacecast-known-non-null.ll
index e8c5c38..d2eee69 100644
--- a/llvm/test/CodeGen/AMDGPU/addrspacecast-known-non-null.ll
+++ b/llvm/test/CodeGen/AMDGPU/addrspacecast-known-non-null.ll
@@ -1,5 +1,5 @@
-; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s | FileCheck %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s | FileCheck %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 --amdgpu-lower-module-lds-strategy=module -o - %s | FileCheck %s
+; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 --amdgpu-lower-module-lds-strategy=module -o - %s | FileCheck %s
 
 ; Test that a null check is not emitted for lowered addrspacecast
 
diff --git a/llvm/test/CodeGen/AMDGPU/ds_write2.ll b/llvm/test/CodeGen/AMDGPU/ds_write2.ll
index 32ee858..b95f648 100644
--- a/llvm/test/CodeGen/AMDGPU/ds_write2.ll
+++ b/llvm/test/CodeGen/AMDGPU/ds_write2.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --force-update
-; RUN: llc -mtriple=amdgcn--amdpal -mcpu=bonaire -verify-machineinstrs -mattr=+load-store-opt < %s | FileCheck -enable-var-scope --check-prefix=CI %s
-; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx900 -verify-machineinstrs -mattr=+load-store-opt,-unaligned-access-mode < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9-ALIGNED %s
-; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx900 -verify-machineinstrs -mattr=+load-store-opt,+unaligned-access-mode < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9-UNALIGNED %s
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=bonaire -verify-machineinstrs -mattr=+load-store-opt --amdgpu-lower-module-lds-strategy=module < %s | FileCheck -enable-var-scope --check-prefix=CI %s
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx900 -verify-machineinstrs -mattr=+load-store-opt,-unaligned-access-mode --amdgpu-lower-module-lds-strategy=module < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9-ALIGNED %s
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx900 -verify-machineinstrs -mattr=+load-store-opt,+unaligned-access-mode --amdgpu-lower-module-lds-strategy=module < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9-UNALIGNED %s
 
 @lds = addrspace(3) global [512 x float] undef, align 4
 @lds.f64 = addrspace(3) global [512 x double] undef, align 8
diff --git a/llvm/test/CodeGen/AMDGPU/hsa.ll b/llvm/test/CodeGen/AMDGPU/hsa.ll
index 61672ef..45cb7b8 100644
--- a/llvm/test/CodeGen/AMDGPU/hsa.ll
+++ b/llvm/test/CodeGen/AMDGPU/hsa.ll
@@ -1,13 +1,13 @@
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 | FileCheck --check-prefix=HSA %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 -mattr=-flat-for-global | FileCheck --check-prefix=HSA-CI %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=carrizo --amdhsa-code-object-version=2 | FileCheck --check-prefix=HSA %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=carrizo --amdhsa-code-object-version=2 -mattr=-flat-for-global | FileCheck --check-prefix=HSA-VI %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri -filetype=obj --amdhsa-code-object-version=2 | llvm-readobj -S --sd --syms - | FileCheck --check-prefix=ELF %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 | llvm-mc -filetype=obj -triple amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 | llvm-readobj -S --sd --syms - | FileCheck %s --check-prefix=ELF
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W32 %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W64 %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -mattr=+wavefrontsize32,-wavefrontsize64 | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W32 %s
-; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -mattr=-wavefrontsize32,+wavefrontsize64 | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W64 %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=HSA %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 -mattr=-flat-for-global --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=HSA-CI %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=carrizo --amdhsa-code-object-version=2 --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=HSA %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=carrizo --amdhsa-code-object-version=2 -mattr=-flat-for-global --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=HSA-VI %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri -filetype=obj --amdhsa-code-object-version=2 --amdgpu-lower-module-lds-strategy=module | llvm-readobj -S --sd --syms - | FileCheck --check-prefix=ELF %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 --amdgpu-lower-module-lds-strategy=module | llvm-mc -filetype=obj -triple amdgcn--amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 | llvm-readobj -S --sd --syms - | FileCheck %s --check-prefix=ELF
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W32 %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W64 %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -mattr=+wavefrontsize32,-wavefrontsize64 --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W32 %s
+; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=gfx1100 -mattr=-wavefrontsize32,+wavefrontsize64 --amdgpu-lower-module-lds-strategy=module | FileCheck --check-prefix=GFX10 --check-prefix=GFX10-W64 %s
 
 ; The SHT_NOTE section contains the output from the .hsa_code_object_*
 ; directives.
diff --git a/llvm/test/CodeGen/AMDGPU/lds-size.ll b/llvm/test/CodeGen/AMDGPU/lds-size.ll
index 313e4d0..c71daf9 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-size.ll
+++ b/llvm/test/CodeGen/AMDGPU/lds-size.ll
@@ -9,7 +9,7 @@
 ; GCN-NEXT: .long 32900
 
 ; EG: .long 166120
-; EG-NEXT: .long 1
+; EG-NEXT: .long 0
 ; ALL: {{^}}test:
 
 ; HSA: granulated_lds_size = 0
diff --git a/llvm/test/CodeGen/AMDGPU/local-memory.amdgcn.ll b/llvm/test/CodeGen/AMDGPU/local-memory.amdgcn.ll
index 267cea2..27207ea 100644
--- a/llvm/test/CodeGen/AMDGPU/local-memory.amdgcn.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-memory.amdgcn.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GCN,SI
-; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GCN,CI
+; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s -check-prefixes=GCN,SI
+; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s -check-prefixes=GCN,CI
 
 @local_memory.local_mem = internal unnamed_addr addrspace(3) global [128 x i32] undef, align 4
 
diff --git a/llvm/test/CodeGen/AMDGPU/local-memory.r600.ll b/llvm/test/CodeGen/AMDGPU/local-memory.r600.ll
index b0c8cc2..7ba805f 100644
--- a/llvm/test/CodeGen/AMDGPU/local-memory.r600.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-memory.r600.ll
@@ -4,7 +4,7 @@
 
 ; Check that the LDS size emitted correctly
 ; EG: .long 166120
-; EG-NEXT: .long 128
+; EG-NEXT: .long 0
 
 ; FUNC-LABEL: {{^}}local_memory:
 
@@ -36,7 +36,7 @@ entry:
 
 ; Check that the LDS size emitted correctly
 ; EG: .long 166120
-; EG-NEXT: .long 8
+; EG-NEXT: .long 0
 ; GCN: .long 47180
 ; GCN-NEXT: .long 32900
 
diff --git a/llvm/test/CodeGen/AMDGPU/lower-kernel-and-module-lds.ll b/llvm/test/CodeGen/AMDGPU/lower-kernel-and-module-lds.ll
index efd5701..a6604bd 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-kernel-and-module-lds.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-kernel-and-module-lds.ll
@@ -1,5 +1,5 @@
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 @lds.size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
 @lds.size.2.align.2 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 2
@@ -8,7 +8,7 @@
 @lds.size.16.align.16 = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16
 
 ; CHECK: %llvm.amdgcn.module.lds.t = type { [8 x i8], [1 x i8] }
-; CHECK: %llvm.amdgcn.kernel.k0.lds.t = type { [16 x i8], [4 x i8], [2 x i8] }
+; CHECK: %llvm.amdgcn.kernel.k0.lds.t = type { [16 x i8], [4 x i8], [2 x i8], [1 x i8] }
 ; CHECK: %llvm.amdgcn.kernel.k1.lds.t = type { [16 x i8], [4 x i8], [2 x i8] }
 ; CHECK: %llvm.amdgcn.kernel.k2.lds.t = type { [2 x i8] }
 ; CHECK: %llvm.amdgcn.kernel.k3.lds.t = type { [4 x i8] }
@@ -23,8 +23,8 @@
 ;.
 define amdgpu_kernel void @k0() #0 {
 ; CHECK-LABEL: @k0(
-; CHECK-NEXT:    %lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1) to i8 addrspace(3)*
-; CHECK-NEXT:    store i8 1, i8 addrspace(3)* %lds.size.1.align.1.bc, align 8
+; CHECK-NEXT:    %lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k0.lds.t, %llvm.amdgcn.kernel.k0.lds.t addrspace(3)* @llvm.amdgcn.kernel.k0.lds, i32 0, i32 3) to i8 addrspace(3)*
+; CHECK-NEXT:    store i8 1, i8 addrspace(3)* %lds.size.1.align.1.bc, align 2, !alias.scope !0, !noalias !3
 ; CHECK-NEXT:    %lds.size.2.align.2.bc = bitcast [2 x i8] addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k0.lds.t, %llvm.amdgcn.kernel.k0.lds.t addrspace(3)* @llvm.amdgcn.kernel.k0.lds, i32 0, i32 2) to i8 addrspace(3)*
 ; CHECK-NEXT:    store i8 2, i8 addrspace(3)* %lds.size.2.align.2.bc, align 4
 ; CHECK-NEXT:    %lds.size.4.align.4.bc = bitcast [4 x i8] addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k0.lds.t, %llvm.amdgcn.kernel.k0.lds.t addrspace(3)* @llvm.amdgcn.kernel.k0.lds, i32 0, i32 1) to i8 addrspace(3)*
@@ -94,9 +94,15 @@ define amdgpu_kernel void @k3() #0 {
   ret void
 }
 
+
+define amdgpu_kernel void @calls_f0() {
+  call void @f0()
+  ret void
+}
+
 define void @f0() {
 ; CHECK-LABEL: @f0(
-; CHECK-NEXT:    %lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1) to i8 addrspace(3)*
+; CHECK:         %lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1) to i8 addrspace(3)*
 ; CHECK-NEXT:    store i8 1, i8 addrspace(3)* %lds.size.1.align.1.bc, align 8
 ; CHECK-NEXT:    %lds.size.8.align.8.bc = bitcast [8 x i8] addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0) to i8 addrspace(3)*
 ; CHECK-NEXT:    store i8 8, i8 addrspace(3)* %lds.size.8.align.8.bc, align 8
diff --git a/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-constexpr.ll b/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-constexpr.ll
index 5a134f7..13b6d3c 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-constexpr.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-constexpr.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: -p --check-globals
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 @lds.1 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 1
 
diff --git a/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-super-align.ll b/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-super-align.ll
index dfad39c..2bade9f 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-super-align.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-kernel-lds-super-align.ll
@@ -1,7 +1,7 @@
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=true < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_ON %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=true < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_ON %s
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=false < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_OFF %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=false < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_OFF %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=true --amdgpu-lower-module-lds-strategy=module < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_ON %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=true --amdgpu-lower-module-lds-strategy=module < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_ON %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=false --amdgpu-lower-module-lds-strategy=module < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_OFF %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-super-align-lds-globals=false --amdgpu-lower-module-lds-strategy=module < %s | FileCheck --check-prefixes=CHECK,SUPER-ALIGN_OFF %s
 
 ; CHECK: %llvm.amdgcn.kernel.k1.lds.t = type { [32 x i8] }
 ; CHECK: %llvm.amdgcn.kernel.k2.lds.t = type { i16, [2 x i8], i16 }
diff --git a/llvm/test/CodeGen/AMDGPU/lower-kernel-lds.ll b/llvm/test/CodeGen/AMDGPU/lower-kernel-lds.ll
index 29159f1..252c7be 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-kernel-lds.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-kernel-lds.ll
@@ -1,5 +1,5 @@
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 @lds.size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
 @lds.size.2.align.2 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 2
diff --git a/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-memcpy.ll b/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-memcpy.ll
index ee1bc43..09035a8 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-memcpy.ll
@@ -1,6 +1,6 @@
-; RUN: llc -march=amdgcn -mcpu=gfx900 -O3 < %s | FileCheck -check-prefix=GCN %s
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: llc -march=amdgcn -mcpu=gfx900 -O3 --amdgpu-lower-module-lds-strategy=module < %s | FileCheck -check-prefix=GCN %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 %vec_type = type { %vec_base }
 %vec_base = type { %union.anon }
diff --git a/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-merge.ll b/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-merge.ll
index 8543088..7984c25 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-merge.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-merge.ll
@@ -1,5 +1,5 @@
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 @a = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4
 @b = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr-phi.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr-phi.ll
index 097641b..c81ea7f 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr-phi.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr-phi.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 @var = addrspace(3) global i32 undef, align 4
 
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll
index ebe9cca..38ada0a 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll
@@ -1,12 +1,11 @@
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 ; CHECK: %llvm.amdgcn.module.lds.t = type { float, float }
+; CHECK: %llvm.amdgcn.kernel.timestwo.lds.t = type { float, float }
 
 @a_func = addrspace(3) global float undef, align 4
 
-; CHECK: %llvm.amdgcn.kernel.timestwo.lds.t = type { float }
-
 @kern = addrspace(3) global float undef, align 4
 
 ; @a_func is only used from a non-kernel function so is rewritten
@@ -56,21 +55,20 @@ entry:
 ; CHECK-LABEL: @timestwo() #0
 ; CHECK-NOT: call void @llvm.donothing()
 
-
-; CHECK:      %1 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1) to i32 addrspace(3)*
+; CHECK:      %1 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.timestwo.lds.t, %llvm.amdgcn.kernel.timestwo.lds.t addrspace(3)* @llvm.amdgcn.kernel.timestwo.lds, i32 0, i32 0) to i32 addrspace(3)*
 ; CHECK:      %2 = addrspacecast i32 addrspace(3)* %1 to i32*
 ; CHECK:      %3 = ptrtoint i32* %2 to i64
-; CHECK:      %4 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.timestwo.lds.t, %llvm.amdgcn.kernel.timestwo.lds.t addrspace(3)* @llvm.amdgcn.kernel.timestwo.lds, i32 0, i32 0) to i32 addrspace(3)*
+; CHECK:      %4 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.timestwo.lds.t, %llvm.amdgcn.kernel.timestwo.lds.t addrspace(3)* @llvm.amdgcn.kernel.timestwo.lds, i32 0, i32 1) to i32 addrspace(3)*
 ; CHECK:      %5 = addrspacecast i32 addrspace(3)* %4 to i32*
 ; CHECK:      %6 = ptrtoint i32* %5 to i64
 ; CHECK:      %7 = add i64 %3, %6
 ; CHECK:      %8 = inttoptr i64 %7 to i32*
 ; CHECK:      %ld = load i32, i32* %8, align 4
 ; CHECK:      %mul = mul i32 %ld, 2
-; CHECK:      %9 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.timestwo.lds.t, %llvm.amdgcn.kernel.timestwo.lds.t addrspace(3)* @llvm.amdgcn.kernel.timestwo.lds, i32 0, i32 0) to i32 addrspace(3)*
+; CHECK:      %9 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.timestwo.lds.t, %llvm.amdgcn.kernel.timestwo.lds.t addrspace(3)* @llvm.amdgcn.kernel.timestwo.lds, i32 0, i32 1) to i32 addrspace(3)*
 ; CHECK:      %10 = addrspacecast i32 addrspace(3)* %9 to i32*
 ; CHECK:      %11 = ptrtoint i32* %10 to i64
-; CHECK:      %12 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1) to i32 addrspace(3)*
+; CHECK:      %12 = bitcast float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.timestwo.lds.t, %llvm.amdgcn.kernel.timestwo.lds.t addrspace(3)* @llvm.amdgcn.kernel.timestwo.lds, i32 0, i32 0) to i32 addrspace(3)*
 ; CHECK:      %13 = addrspacecast i32 addrspace(3)* %12 to i32*
 ; CHECK:      %14 = ptrtoint i32* %13 to i64
 ; CHECK:      %15 = add i64 %11, %14
@@ -84,5 +82,13 @@ define amdgpu_kernel void @timestwo() {
   ret void
 }
 
+; CHECK-LABEL: @through_functions()
+define amdgpu_kernel void @through_functions() {
+  %ld = call i32 @get_func()
+  %mul = mul i32 %ld, 4
+  call void @set_func(i32 %mul)
+  ret void
+}
+
 attributes #0 = { "amdgpu-elide-module-lds" }
 ; CHECK: attributes #0 = { "amdgpu-elide-module-lds" }
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll
index 1cb7309..303cc82 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll
@@ -6,8 +6,8 @@
 ; CHECK-NOT: llvm.amdgcn.module.lds
 ; CHECK-NOT: llvm.amdgcn.module.lds.t
 
-; var1, var2 would be transformed were they used from a non-kernel function
-; CHECK-NOT: @var1 =
+; var1 is removed, var2 stays because it's in compiler.used
+; CHECK-NOT: @var1
 ; CHECK: @var2 = addrspace(3) global float undef
 @var1 = addrspace(3) global i32 undef
 @var2 = addrspace(3) global float undef
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-offsets.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-offsets.ll
index a1effa6..1ddc365 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-module-lds-offsets.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-offsets.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck -check-prefix=OPT %s
-; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -check-prefix=OPT %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -check-prefix=OPT %s
+; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -check-prefix=GCN %s
 
 ; Check that module LDS is allocated at address 0 and kernel starts its
 ; allocation past module LDS when a call is present.
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-single-var-ambiguous.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-single-var-ambiguous.ll
new file mode 100644
index 0000000..7a33754
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-single-var-ambiguous.ll
@@ -0,0 +1,97 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -check-prefixes=CHECK,M_OR_HY %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=table | FileCheck -check-prefixes=CHECK,TABLE %s
+; RUN: not --crash opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=kernel 2>&1 | FileCheck -check-prefixes=KERNEL %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=hybrid | FileCheck -check-prefixes=CHECK,M_OR_HY %s
+
+;; Two kernels access the same variable, specialisation gives them each their own copy of it
+
+@kernel.lds = addrspace(3) global i8 undef
+define amdgpu_kernel void @k0() {
+; CHECK-LABEL: @k0(
+; CHECK-NEXT:    [[LD:%.*]] = load i8, i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_LDS_T:%.*]], [[LLVM_AMDGCN_KERNEL_K0_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0.lds, i32 0, i32 0), align 1
+; CHECK-NEXT:    [[MUL:%.*]] = mul i8 [[LD]], 2
+; CHECK-NEXT:    store i8 [[MUL]], i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_LDS_T]], [[LLVM_AMDGCN_KERNEL_K0_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0.lds, i32 0, i32 0), align 1
+; CHECK-NEXT:    ret void
+;
+  %ld = load i8, i8 addrspace(3)* @kernel.lds
+  %mul = mul i8 %ld, 2
+  store i8 %mul, i8  addrspace(3)* @kernel.lds
+  ret void
+}
+
+define amdgpu_kernel void @k1() {
+; CHECK-LABEL: @k1(
+; CHECK-NEXT:    [[LD:%.*]] = load i8, i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K1_LDS_T:%.*]], [[LLVM_AMDGCN_KERNEL_K1_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k1.lds, i32 0, i32 0), align 1
+; CHECK-NEXT:    [[MUL:%.*]] = mul i8 [[LD]], 3
+; CHECK-NEXT:    store i8 [[MUL]], i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K1_LDS_T]], [[LLVM_AMDGCN_KERNEL_K1_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k1.lds, i32 0, i32 0), align 1
+; CHECK-NEXT:    ret void
+;
+  %ld = load i8, i8 addrspace(3)* @kernel.lds
+  %mul = mul i8 %ld, 3
+  store i8 %mul, i8  addrspace(3)* @kernel.lds
+  ret void
+}
+
+;; Function accesses variable, reachable from two kernels, can't use kernel lowering for either
+;; Hybrid can put it in module lds without cost as the first variable is free
+
+; KERNEL: LLVM ERROR: Cannot lower LDS to kernel access as it is reachable from multiple kernels
+
+@function.lds = addrspace(3) global i16 undef
+define void @f0() {
+; M_OR_HY-LABEL: @f0(
+; M_OR_HY-NEXT:    [[LD:%.*]] = load i16, i16 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T:%.*]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
+; M_OR_HY-NEXT:    [[MUL:%.*]] = mul i16 [[LD]], 4
+; M_OR_HY-NEXT:    store i16 [[MUL]], i16 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
+; M_OR_HY-NEXT:    ret void
+;
+; TABLE-LABEL: @f0(
+; TABLE-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.lds.kernel.id()
+; TABLE-NEXT:    [[FUNCTION_LDS2:%.*]] = getelementptr inbounds [2 x [1 x i32]], [2 x [1 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 0
+; TABLE-NEXT:    [[TMP2:%.*]] = load i32, i32 addrspace(4)* [[FUNCTION_LDS2]], align 4
+; TABLE-NEXT:    [[FUNCTION_LDS3:%.*]] = inttoptr i32 [[TMP2]] to i16 addrspace(3)*
+; TABLE-NEXT:    [[LD:%.*]] = load i16, i16 addrspace(3)* [[FUNCTION_LDS3]], align 2
+; TABLE-NEXT:    [[MUL:%.*]] = mul i16 [[LD]], 4
+; TABLE-NEXT:    [[FUNCTION_LDS:%.*]] = getelementptr inbounds [2 x [1 x i32]], [2 x [1 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 0
+; TABLE-NEXT:    [[TMP3:%.*]] = load i32, i32 addrspace(4)* [[FUNCTION_LDS]], align 4
+; TABLE-NEXT:    [[FUNCTION_LDS1:%.*]] = inttoptr i32 [[TMP3]] to i16 addrspace(3)*
+; TABLE-NEXT:    store i16 [[MUL]], i16 addrspace(3)* [[FUNCTION_LDS1]], align 2
+; TABLE-NEXT:    ret void
+;
+  %ld = load i16, i16 addrspace(3)* @function.lds
+  %mul = mul i16 %ld, 4
+  store i16 %mul, i16  addrspace(3)* @function.lds
+  ret void
+}
+
+
+define amdgpu_kernel void @k0_f0() {
+; M_OR_HY-LABEL: @k0_f0(
+; M_OR_HY-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_MODULE_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.module.lds) ]
+; M_OR_HY-NEXT:    call void @f0()
+; M_OR_HY-NEXT:    ret void
+;
+; TABLE-LABEL: @k0_f0(
+; TABLE-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_KERNEL_K0_F0_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.kernel.k0_f0.lds) ]
+; TABLE-NEXT:    call void @f0()
+; TABLE-NEXT:    ret void
+;
+  call void @f0()
+  ret void
+}
+
+define amdgpu_kernel void @k1_f0() {
+; M_OR_HY-LABEL: @k1_f0(
+; M_OR_HY-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_MODULE_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.module.lds) ]
+; M_OR_HY-NEXT:    call void @f0()
+; M_OR_HY-NEXT:    ret void
+;
+; TABLE-LABEL: @k1_f0(
+; TABLE-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_KERNEL_K1_F0_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.kernel.k1_f0.lds) ]
+; TABLE-NEXT:    call void @f0()
+; TABLE-NEXT:    ret void
+;
+  call void @f0()
+  ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-single-var-unambiguous.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-single-var-unambiguous.ll
new file mode 100644
index 0000000..0fe89a3
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-single-var-unambiguous.ll
@@ -0,0 +1,144 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -check-prefixes=CHECK,MODULE %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=table | FileCheck -check-prefixes=CHECK,TABLE %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=kernel | FileCheck -check-prefixes=CHECK,K_OR_HY %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=hybrid | FileCheck -check-prefixes=CHECK,K_OR_HY %s
+
+;; Same checks for kernel and for hybrid as an unambiguous reference to a variable - one where exactly one kernel
+;; can reach it - is the case where hybrid lowering can always prefer the direct access.
+
+;; Single kernel is sole user of single variable, all options codegen as direct access to kernel struct
+
+@k0.lds = addrspace(3) global i8 undef
+define amdgpu_kernel void @k0() {
+; CHECK-LABEL: @k0(
+; CHECK-NEXT:    [[LD:%.*]] = load i8, i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_LDS_T:%.*]], [[LLVM_AMDGCN_KERNEL_K0_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0.lds, i32 0, i32 0), align 1
+; CHECK-NEXT:    [[MUL:%.*]] = mul i8 [[LD]], 2
+; CHECK-NEXT:    store i8 [[MUL]], i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_LDS_T]], [[LLVM_AMDGCN_KERNEL_K0_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0.lds, i32 0, i32 0), align 1
+; CHECK-NEXT:    ret void
+;
+  %ld = load i8, i8 addrspace(3)* @k0.lds
+  %mul = mul i8 %ld, 2
+  store i8 %mul, i8  addrspace(3)* @k0.lds
+  ret void
+}
+
+;; Function is reachable from one kernel. Variable goes in module lds or the kernel struct, but never both.
+
+@f0.lds = addrspace(3) global i16 undef
+define void @f0() {
+; MODULE-LABEL: @f0(
+; MODULE-NEXT:    [[LD:%.*]] = load i16, i16 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T:%.*]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 4, !alias.scope !0, !noalias !3
+; MODULE-NEXT:    [[MUL:%.*]] = mul i16 [[LD]], 3
+; MODULE-NEXT:    store i16 [[MUL]], i16 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 4, !alias.scope !0, !noalias !3
+; MODULE-NEXT:    ret void
+;
+; TABLE-LABEL: @f0(
+; TABLE-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.lds.kernel.id()
+; TABLE-NEXT:    [[F0_LDS2:%.*]] = getelementptr inbounds [2 x [2 x i32]], [2 x [2 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 1
+; TABLE-NEXT:    [[TMP2:%.*]] = load i32, i32 addrspace(4)* [[F0_LDS2]], align 4
+; TABLE-NEXT:    [[F0_LDS3:%.*]] = inttoptr i32 [[TMP2]] to i16 addrspace(3)*
+; TABLE-NEXT:    [[LD:%.*]] = load i16, i16 addrspace(3)* [[F0_LDS3]], align 2
+; TABLE-NEXT:    [[MUL:%.*]] = mul i16 [[LD]], 3
+; TABLE-NEXT:    [[F0_LDS:%.*]] = getelementptr inbounds [2 x [2 x i32]], [2 x [2 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 1
+; TABLE-NEXT:    [[TMP3:%.*]] = load i32, i32 addrspace(4)* [[F0_LDS]], align 4
+; TABLE-NEXT:    [[F0_LDS1:%.*]] = inttoptr i32 [[TMP3]] to i16 addrspace(3)*
+; TABLE-NEXT:    store i16 [[MUL]], i16 addrspace(3)* [[F0_LDS1]], align 2
+; TABLE-NEXT:    ret void
+;
+; K_OR_HY-LABEL: @f0(
+; K_OR_HY-NEXT:    [[LD:%.*]] = load i16, i16 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K_F0_LDS_T:%.*]], [[LLVM_AMDGCN_KERNEL_K_F0_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k_f0.lds, i32 0, i32 0), align 2
+; K_OR_HY-NEXT:    [[MUL:%.*]] = mul i16 [[LD]], 3
+; K_OR_HY-NEXT:    store i16 [[MUL]], i16 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K_F0_LDS_T]], [[LLVM_AMDGCN_KERNEL_K_F0_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k_f0.lds, i32 0, i32 0), align 2
+; K_OR_HY-NEXT:    ret void
+;
+  %ld = load i16, i16 addrspace(3)* @f0.lds
+  %mul = mul i16 %ld, 3
+  store i16 %mul, i16  addrspace(3)* @f0.lds
+  ret void
+}
+
+define amdgpu_kernel void @k_f0() {
+; MODULE-LABEL: @k_f0(
+; MODULE-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_MODULE_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.module.lds) ]
+; MODULE-NEXT:    call void @f0()
+; MODULE-NEXT:    ret void
+;
+; TABLE-LABEL: @k_f0(
+; TABLE-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_KERNEL_K_F0_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.kernel.k_f0.lds) ]
+; TABLE-NEXT:    call void @f0()
+; TABLE-NEXT:    ret void
+;
+; K_OR_HY-LABEL: @k_f0(
+; K_OR_HY-NEXT:    call void @f0()
+; K_OR_HY-NEXT:    ret void
+;
+  call void @f0()
+  ret void
+}
+
+;; As above, but with the kernel also uing the variable.
+
+@both.lds = addrspace(3) global i32 undef
+define void @f_both() {
+; MODULE-LABEL: @f_both(
+; MODULE-NEXT:    [[LD:%.*]] = load i32, i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T:%.*]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 4, !alias.scope !4, !noalias !3
+; MODULE-NEXT:    [[MUL:%.*]] = mul i32 [[LD]], 4
+; MODULE-NEXT:    store i32 [[MUL]], i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 4, !alias.scope !4, !noalias !3
+; MODULE-NEXT:    ret void
+;
+; TABLE-LABEL: @f_both(
+; TABLE-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.lds.kernel.id()
+; TABLE-NEXT:    [[BOTH_LDS2:%.*]] = getelementptr inbounds [2 x [2 x i32]], [2 x [2 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 0
+; TABLE-NEXT:    [[TMP2:%.*]] = load i32, i32 addrspace(4)* [[BOTH_LDS2]], align 4
+; TABLE-NEXT:    [[BOTH_LDS3:%.*]] = inttoptr i32 [[TMP2]] to i32 addrspace(3)*
+; TABLE-NEXT:    [[LD:%.*]] = load i32, i32 addrspace(3)* [[BOTH_LDS3]], align 4
+; TABLE-NEXT:    [[MUL:%.*]] = mul i32 [[LD]], 4
+; TABLE-NEXT:    [[BOTH_LDS:%.*]] = getelementptr inbounds [2 x [2 x i32]], [2 x [2 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 0
+; TABLE-NEXT:    [[TMP3:%.*]] = load i32, i32 addrspace(4)* [[BOTH_LDS]], align 4
+; TABLE-NEXT:    [[BOTH_LDS1:%.*]] = inttoptr i32 [[TMP3]] to i32 addrspace(3)*
+; TABLE-NEXT:    store i32 [[MUL]], i32 addrspace(3)* [[BOTH_LDS1]], align 4
+; TABLE-NEXT:    ret void
+;
+; K_OR_HY-LABEL: @f_both(
+; K_OR_HY-NEXT:    [[LD:%.*]] = load i32, i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T:%.*]], [[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0_both.lds, i32 0, i32 0), align 4
+; K_OR_HY-NEXT:    [[MUL:%.*]] = mul i32 [[LD]], 4
+; K_OR_HY-NEXT:    store i32 [[MUL]], i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]], [[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0_both.lds, i32 0, i32 0), align 4
+; K_OR_HY-NEXT:    ret void
+;
+  %ld = load i32, i32 addrspace(3)* @both.lds
+  %mul = mul i32 %ld, 4
+  store i32 %mul, i32  addrspace(3)* @both.lds
+  ret void
+}
+
+define amdgpu_kernel void @k0_both() {
+; MODULE-LABEL: @k0_both(
+; MODULE-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_MODULE_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.module.lds) ]
+; MODULE-NEXT:    [[LD:%.*]] = load i32, i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 4, !alias.scope !4, !noalias !0
+; MODULE-NEXT:    [[MUL:%.*]] = mul i32 [[LD]], 5
+; MODULE-NEXT:    store i32 [[MUL]], i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_MODULE_LDS_T]], [[LLVM_AMDGCN_MODULE_LDS_T]] addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 4, !alias.scope !4, !noalias !0
+; MODULE-NEXT:    call void @f_both()
+; MODULE-NEXT:    ret void
+;
+; TABLE-LABEL: @k0_both(
+; TABLE-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.kernel.k0_both.lds) ]
+; TABLE-NEXT:    [[LD:%.*]] = load i32, i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]], [[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0_both.lds, i32 0, i32 0), align 4
+; TABLE-NEXT:    [[MUL:%.*]] = mul i32 [[LD]], 5
+; TABLE-NEXT:    store i32 [[MUL]], i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]], [[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0_both.lds, i32 0, i32 0), align 4
+; TABLE-NEXT:    call void @f_both()
+; TABLE-NEXT:    ret void
+;
+; K_OR_HY-LABEL: @k0_both(
+; K_OR_HY-NEXT:    [[LD:%.*]] = load i32, i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T:%.*]], [[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0_both.lds, i32 0, i32 0), align 4
+; K_OR_HY-NEXT:    [[MUL:%.*]] = mul i32 [[LD]], 5
+; K_OR_HY-NEXT:    store i32 [[MUL]], i32 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]], [[LLVM_AMDGCN_KERNEL_K0_BOTH_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k0_both.lds, i32 0, i32 0), align 4
+; K_OR_HY-NEXT:    call void @f_both()
+; K_OR_HY-NEXT:    ret void
+;
+  %ld = load i32, i32 addrspace(3)* @both.lds
+  %mul = mul i32 %ld, 5
+  store i32 %mul, i32  addrspace(3)* @both.lds
+  call void @f_both()
+  ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll
index 20c8442..0b233f8 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll
@@ -1,5 +1,5 @@
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 ; Check new struct is added to compiler.used and that the replaced variable is removed
 
@@ -28,6 +28,13 @@
 
 @llvm.compiler.used = appending global [2 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @tolower to i8 addrspace(3)*) to i8*), i8* addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored to i8 addrspace(1)*) to i8*)], section "llvm.metadata"
 
+
+; Functions that are not called are ignored by the lowering
+define amdgpu_kernel void @call_func() {
+  call void @func()
+  ret void
+}
+
 ; CHECK-LABEL: @func()
 ; CHECK: %dec = atomicrmw fsub float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), float 1.000000e+00 monotonic, align 8
 define void @func() {
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-via-table.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-via-table.ll
new file mode 100644
index 0000000..6f05d5c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-via-table.ll
@@ -0,0 +1,373 @@
+; RUN: opt -S -mtriple=amdgcn--amdhsa -passes=amdgpu-lower-module-lds < %s --amdgpu-lower-module-lds-strategy=table | FileCheck -check-prefix=OPT %s
+; RUN: llc -mtriple=amdgcn--amdhsa -verify-machineinstrs < %s --amdgpu-lower-module-lds-strategy=table | FileCheck -check-prefix=GCN %s
+
+; Opt checks from utils/update_test_checks.py, llc checks from utils/update_llc_test_checks.py, both modified.
+
+; Define four variables and four non-kernel functions which access exactly one variable each
+@v0 = addrspace(3) global float undef
+@v1 = addrspace(3) global i16 undef, align 16
+@v2 = addrspace(3) global i64 undef
+@v3 = addrspace(3) global i8 undef
+@unused = addrspace(3) global i16 undef
+
+; OPT: %llvm.amdgcn.kernel.kernel_no_table.lds.t = type { i64 }
+; OPT: %llvm.amdgcn.kernel.k01.lds.t = type { i16, [2 x i8], float }
+; OPT: %llvm.amdgcn.kernel.k23.lds.t = type { i64, i8 }
+; OPT: %llvm.amdgcn.kernel.k123.lds.t = type { i16, i8, [5 x i8], i64 }
+
+; OPT: @llvm.amdgcn.kernel.kernel_no_table.lds = internal addrspace(3) global %llvm.amdgcn.kernel.kernel_no_table.lds.t undef, align 8
+; OPT: @llvm.amdgcn.kernel.k01.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k01.lds.t undef, align 16
+; OPT: @llvm.amdgcn.kernel.k23.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k23.lds.t undef, align 8
+; OPT: @llvm.amdgcn.kernel.k123.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k123.lds.t undef, align 16
+
+; Salient parts of the IR lookup table check:
+; It has (top level) size 3 as there are 3 kernels that call functions which use lds
+; The next level down has type [4 x i16] as there are 4 variables accessed by functions which use lds
+; The kernel naming pattern and the structs being named after the functions helps verify placement of undef
+; The remainder are constant expressions into the variable instances checked above
+
+; OPT{LITERAL}: @llvm.amdgcn.lds.offset.table = internal addrspace(4) constant [3 x [4 x i32]] [[4 x i32] [i32 ptrtoint (float addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k01.lds.t, %llvm.amdgcn.kernel.k01.lds.t addrspace(3)* @llvm.amdgcn.kernel.k01.lds, i32 0, i32 2) to i32), i32 ptrtoint (%llvm.amdgcn.kernel.k01.lds.t addrspace(3)* @llvm.amdgcn.kernel.k01.lds to i32), i32 poison, i32 poison], [4 x i32] [i32 poison, i32 ptrtoint (%llvm.amdgcn.kernel.k123.lds.t addrspace(3)* @llvm.amdgcn.kernel.k123.lds to i32), i32 ptrtoint (i64 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k123.lds.t, %llvm.amdgcn.kernel.k123.lds.t addrspace(3)* @llvm.amdgcn.kernel.k123.lds, i32 0, i32 3) to i32), i32 ptrtoint (i8 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k123.lds.t, %llvm.amdgcn.kernel.k123.lds.t addrspace(3)* @llvm.amdgcn.kernel.k123.lds, i32 0, i32 1) to i32)], [4 x i32] [i32 poison, i32 poison, i32 ptrtoint (%llvm.amdgcn.kernel.k23.lds.t addrspace(3)* @llvm.amdgcn.kernel.k23.lds to i32), i32 ptrtoint (i8 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k23.lds.t, %llvm.amdgcn.kernel.k23.lds.t addrspace(3)* @llvm.amdgcn.kernel.k23.lds, i32 0, i32 1) to i32)]]
+
+define void @f0() {
+; OPT-LABEL: @f0(
+; OPT-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.lds.kernel.id()
+; OPT-NEXT:    [[V02:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 0
+; OPT-NEXT:    [[TMP2:%.*]] = load i32, i32 addrspace(4)* [[V02]], align 4
+; OPT-NEXT:    [[V03:%.*]] = inttoptr i32 [[TMP2]] to float addrspace(3)*
+; OPT-NEXT:    [[LD:%.*]] = load float, float addrspace(3)* [[V03]], align 4
+; OPT-NEXT:    [[MUL:%.*]] = fmul float [[LD]], 2.000000e+00
+; OPT-NEXT:    [[V0:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 0
+; OPT-NEXT:    [[TMP3:%.*]] = load i32, i32 addrspace(4)* [[V0]], align 4
+; OPT-NEXT:    [[V01:%.*]] = inttoptr i32 [[TMP3]] to float addrspace(3)*
+; OPT-NEXT:    store float [[MUL]], float addrspace(3)* [[V01]], align 4
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: f0:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_mov_b32 s4, s15
+; GCN-NEXT:    s_ashr_i32 s5, s15, 31
+; GCN-NEXT:    s_getpc_b64 s[6:7]
+; GCN-NEXT:    s_add_u32 s6, s6, llvm.amdgcn.lds.offset.table@rel32@lo+4
+; GCN-NEXT:    s_addc_u32 s7, s7, llvm.amdgcn.lds.offset.table@rel32@hi+12
+; GCN-NEXT:    s_lshl_b64 s[4:5], s[4:5], 4
+; GCN-NEXT:    s_add_u32 s4, s4, s6
+; GCN-NEXT:    s_addc_u32 s5, s5, s7
+; GCN-NEXT:    s_load_dword s4, s[4:5], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v0, s4
+; GCN-NEXT:    s_mov_b32 m0, -1
+; GCN-NEXT:    ds_read_b32 v1, v0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v1, v1, v1
+; GCN-NEXT:    ds_write_b32 v0, v1
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ld = load float, float addrspace(3)* @v0
+  %mul = fmul float %ld, 2.
+  store float %mul, float  addrspace(3)* @v0
+  ret void
+}
+
+define void @f1() {
+; OPT-LABEL: @f1(
+; OPT-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.lds.kernel.id()
+; OPT-NEXT:    [[V12:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 1
+; OPT-NEXT:    [[TMP2:%.*]] = load i32, i32 addrspace(4)* [[V12]], align 4
+; OPT-NEXT:    [[V13:%.*]] = inttoptr i32 [[TMP2]] to i16 addrspace(3)*
+; OPT-NEXT:    [[LD:%.*]] = load i16, i16 addrspace(3)* [[V13]], align 2
+; OPT-NEXT:    [[MUL:%.*]] = mul i16 [[LD]], 3
+; OPT-NEXT:    [[V1:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 1
+; OPT-NEXT:    [[TMP3:%.*]] = load i32, i32 addrspace(4)* [[V1]], align 4
+; OPT-NEXT:    [[V11:%.*]] = inttoptr i32 [[TMP3]] to i16 addrspace(3)*
+; OPT-NEXT:    store i16 [[MUL]], i16 addrspace(3)* [[V11]], align 2
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: f1:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_mov_b32 s4, s15
+; GCN-NEXT:    s_ashr_i32 s5, s15, 31
+; GCN-NEXT:    s_getpc_b64 s[6:7]
+; GCN-NEXT:    s_add_u32 s6, s6, llvm.amdgcn.lds.offset.table@rel32@lo+8
+; GCN-NEXT:    s_addc_u32 s7, s7, llvm.amdgcn.lds.offset.table@rel32@hi+16
+; GCN-NEXT:    s_lshl_b64 s[4:5], s[4:5], 4
+; GCN-NEXT:    s_add_u32 s4, s4, s6
+; GCN-NEXT:    s_addc_u32 s5, s5, s7
+; GCN-NEXT:    s_load_dword s4, s[4:5], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v0, s4
+; GCN-NEXT:    s_mov_b32 m0, -1
+; GCN-NEXT:    ds_read_u16 v1, v0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mul_lo_u32 v1, v1, 3
+; GCN-NEXT:    ds_write_b16 v0, v1
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ld = load i16, i16 addrspace(3)* @v1
+  %mul = mul i16 %ld, 3
+  store i16 %mul, i16  addrspace(3)* @v1
+  ret void
+}
+
+define void @f2() {
+; OPT-LABEL: @f2(
+; OPT-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.lds.kernel.id()
+; OPT-NEXT:    [[V22:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 2
+; OPT-NEXT:    [[TMP2:%.*]] = load i32, i32 addrspace(4)* [[V22]], align 4
+; OPT-NEXT:    [[V23:%.*]] = inttoptr i32 [[TMP2]] to i64 addrspace(3)*
+; OPT-NEXT:    [[LD:%.*]] = load i64, i64 addrspace(3)* [[V23]], align 4
+; OPT-NEXT:    [[MUL:%.*]] = mul i64 [[LD]], 4
+; OPT-NEXT:    [[V2:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 2
+; OPT-NEXT:    [[TMP3:%.*]] = load i32, i32 addrspace(4)* [[V2]], align 4
+; OPT-NEXT:    [[V21:%.*]] = inttoptr i32 [[TMP3]] to i64 addrspace(3)*
+; OPT-NEXT:    store i64 [[MUL]], i64 addrspace(3)* [[V21]], align 4
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: f2:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_mov_b32 s4, s15
+; GCN-NEXT:    s_ashr_i32 s5, s15, 31
+; GCN-NEXT:    s_getpc_b64 s[6:7]
+; GCN-NEXT:    s_add_u32 s6, s6, llvm.amdgcn.lds.offset.table@rel32@lo+12
+; GCN-NEXT:    s_addc_u32 s7, s7, llvm.amdgcn.lds.offset.table@rel32@hi+20
+; GCN-NEXT:    s_lshl_b64 s[4:5], s[4:5], 4
+; GCN-NEXT:    s_add_u32 s4, s4, s6
+; GCN-NEXT:    s_addc_u32 s5, s5, s7
+; GCN-NEXT:    s_load_dword s4, s[4:5], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v2, s4
+; GCN-NEXT:    s_mov_b32 m0, -1
+; GCN-NEXT:    ds_read_b64 v[0:1], v2
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_lshl_b64 v[0:1], v[0:1], 2
+; GCN-NEXT:    ds_write_b64 v2, v[0:1]
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ld = load i64, i64 addrspace(3)* @v2
+  %mul = mul i64 %ld, 4
+  store i64 %mul, i64  addrspace(3)* @v2
+  ret void
+}
+
+define void @f3() {
+; OPT-LABEL: @f3(
+; OPT-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.lds.kernel.id()
+; OPT-NEXT:    [[V32:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 3
+; OPT-NEXT:    [[TMP2:%.*]] = load i32, i32 addrspace(4)* [[V32]], align 4
+; OPT-NEXT:    [[V33:%.*]] = inttoptr i32 [[TMP2]] to i8 addrspace(3)*
+; OPT-NEXT:    [[LD:%.*]] = load i8, i8 addrspace(3)* [[V33]], align 1
+; OPT-NEXT:    [[MUL:%.*]] = mul i8 [[LD]], 5
+; OPT-NEXT:    [[V3:%.*]] = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]] addrspace(4)* @llvm.amdgcn.lds.offset.table, i32 0, i32 [[TMP1]], i32 3
+; OPT-NEXT:    [[TMP3:%.*]] = load i32, i32 addrspace(4)* [[V3]], align 4
+; OPT-NEXT:    [[V31:%.*]] = inttoptr i32 [[TMP3]] to i8 addrspace(3)*
+; OPT-NEXT:    store i8 [[MUL]], i8 addrspace(3)* [[V31]], align 1
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: f3:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    s_mov_b32 s4, s15
+; GCN-NEXT:    s_ashr_i32 s5, s15, 31
+; GCN-NEXT:    s_getpc_b64 s[6:7]
+; GCN-NEXT:    s_add_u32 s6, s6, llvm.amdgcn.lds.offset.table@rel32@lo+16
+; GCN-NEXT:    s_addc_u32 s7, s7, llvm.amdgcn.lds.offset.table@rel32@hi+24
+; GCN-NEXT:    s_lshl_b64 s[4:5], s[4:5], 4
+; GCN-NEXT:    s_add_u32 s4, s4, s6
+; GCN-NEXT:    s_addc_u32 s5, s5, s7
+; GCN-NEXT:    s_load_dword s4, s[4:5], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v0, s4
+; GCN-NEXT:    s_mov_b32 m0, -1
+; GCN-NEXT:    ds_read_u8 v1, v0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mul_lo_u32 v1, v1, 5
+; GCN-NEXT:    ds_write_b8 v0, v1
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_setpc_b64 s[30:31]
+  %ld = load i8, i8 addrspace(3)* @v3
+  %mul = mul i8 %ld, 5
+  store i8 %mul, i8  addrspace(3)* @v3
+  ret void
+}
+
+; Doesn't access any via a function, won't be in the lookup table
+define amdgpu_kernel void @kernel_no_table() {
+; OPT-LABEL: @kernel_no_table() {
+; OPT-NEXT:    [[LD:%.*]] = load i64, i64 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_KERNEL_NO_TABLE_LDS_T:%.*]], [[LLVM_AMDGCN_KERNEL_KERNEL_NO_TABLE_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.kernel_no_table.lds, i32 0, i32 0), align 8
+; OPT-NEXT:    [[MUL:%.*]] = mul i64 [[LD]], 8
+; OPT-NEXT:    store i64 [[MUL]], i64 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_KERNEL_NO_TABLE_LDS_T]], [[LLVM_AMDGCN_KERNEL_KERNEL_NO_TABLE_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.kernel_no_table.lds, i32 0, i32 0), align 8
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: kernel_no_table:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    v_mov_b32_e32 v2, 0
+; GCN-NEXT:    s_mov_b32 m0, -1
+; GCN-NEXT:    ds_read_b64 v[0:1], v2
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_lshl_b64 v[0:1], v[0:1], 3
+; GCN-NEXT:    ds_write_b64 v2, v[0:1]
+; GCN-NEXT:    s_endpgm
+  %ld = load i64, i64 addrspace(3)* @v2
+  %mul = mul i64 %ld, 8
+  store i64 %mul, i64  addrspace(3)* @v2
+  ret void
+}
+
+; Access two variables, will allocate those two
+define amdgpu_kernel void @k01() {
+; OPT-LABEL: @k01() !llvm.amdgcn.lds.kernel.id !0 {
+; OPT-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_KERNEL_K01_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.kernel.k01.lds) ]
+; OPT-NEXT:    call void @f0()
+; OPT-NEXT:    call void @f1()
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: k01:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_mov_b32 s32, 0
+; GCN-NEXT:    s_mov_b32 flat_scratch_lo, s7
+; GCN-NEXT:    s_add_i32 s6, s6, s9
+; GCN-NEXT:    s_lshr_b32 flat_scratch_hi, s6, 8
+; GCN-NEXT:    s_add_u32 s0, s0, s9
+; GCN-NEXT:    s_addc_u32 s1, s1, 0
+; GCN-NEXT:    s_mov_b64 s[8:9], s[4:5]
+; GCN-NEXT:    s_getpc_b64 s[4:5]
+; GCN-NEXT:    s_add_u32 s4, s4, f0@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s5, s5, f0@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x0
+; GCN-NEXT:    s_mov_b32 s15, 0
+; GCN-NEXT:    s_mov_b64 s[6:7], s[8:9]
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[4:5]
+; GCN-NEXT:    s_getpc_b64 s[4:5]
+; GCN-NEXT:    s_add_u32 s4, s4, f1@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s5, s5, f1@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x0
+; GCN-NEXT:    s_mov_b32 s15, 0
+; GCN-NEXT:    s_mov_b64 s[6:7], s[8:9]
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[4:5]
+; GCN-NEXT:    s_endpgm
+; GCN:         .amdhsa_group_segment_fixed_size 8
+  call void @f0()
+  call void @f1()
+  ret void
+}
+
+define amdgpu_kernel void @k23() {
+; OPT-LABEL: @k23() !llvm.amdgcn.lds.kernel.id !1 {
+; OPT-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_KERNEL_K23_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.kernel.k23.lds) ]
+; OPT-NEXT:    call void @f2()
+; OPT-NEXT:    call void @f3()
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: k23:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_mov_b32 s32, 0
+; GCN-NEXT:    s_mov_b32 flat_scratch_lo, s7
+; GCN-NEXT:    s_add_i32 s6, s6, s9
+; GCN-NEXT:    s_lshr_b32 flat_scratch_hi, s6, 8
+; GCN-NEXT:    s_add_u32 s0, s0, s9
+; GCN-NEXT:    s_addc_u32 s1, s1, 0
+; GCN-NEXT:    s_mov_b64 s[8:9], s[4:5]
+; GCN-NEXT:    s_getpc_b64 s[4:5]
+; GCN-NEXT:    s_add_u32 s4, s4, f2@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s5, s5, f2@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x0
+; GCN-NEXT:    s_mov_b32 s15, 2
+; GCN-NEXT:    s_mov_b64 s[6:7], s[8:9]
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[4:5]
+; GCN-NEXT:    s_getpc_b64 s[4:5]
+; GCN-NEXT:    s_add_u32 s4, s4, f3@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s5, s5, f3@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x0
+; GCN-NEXT:    s_mov_b32 s15, 2
+; GCN-NEXT:    s_mov_b64 s[6:7], s[8:9]
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[4:5]
+; GCN-NEXT:    s_endpgm
+; GCN:         .amdhsa_group_segment_fixed_size 16
+  call void @f2()
+  call void @f3()
+  ret void
+}
+
+; Access and allocate three variables
+define amdgpu_kernel void @k123() {
+; OPT-LABEL: @k123() !llvm.amdgcn.lds.kernel.id !2 {
+; OPT-NEXT:    call void @llvm.donothing() [ "ExplicitUse"([[LLVM_AMDGCN_KERNEL_K123_LDS_T:%.*]] addrspace(3)* @llvm.amdgcn.kernel.k123.lds) ]
+; OPT-NEXT:    call void @f1()
+; OPT-NEXT:    [[LD:%.*]] = load i8, i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K123_LDS_T]], [[LLVM_AMDGCN_KERNEL_K123_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k123.lds, i32 0, i32 1), align 2, !alias.scope !3, !noalias !6
+; OPT-NEXT:    [[MUL:%.*]] = mul i8 [[LD]], 8
+; OPT-NEXT:    store i8 [[MUL]], i8 addrspace(3)* getelementptr inbounds ([[LLVM_AMDGCN_KERNEL_K123_LDS_T]], [[LLVM_AMDGCN_KERNEL_K123_LDS_T]] addrspace(3)* @llvm.amdgcn.kernel.k123.lds, i32 0, i32 1), align 2, !alias.scope !3, !noalias !6
+; OPT-NEXT:    call void @f2()
+; OPT-NEXT:    ret void
+;
+; GCN-LABEL: k123:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_mov_b32 s32, 0
+; GCN-NEXT:    s_mov_b32 flat_scratch_lo, s7
+; GCN-NEXT:    s_add_i32 s6, s6, s9
+; GCN-NEXT:    s_lshr_b32 flat_scratch_hi, s6, 8
+; GCN-NEXT:    s_add_u32 s0, s0, s9
+; GCN-NEXT:    s_addc_u32 s1, s1, 0
+; GCN-NEXT:    s_mov_b64 s[8:9], s[4:5]
+; GCN-NEXT:    s_getpc_b64 s[4:5]
+; GCN-NEXT:    s_add_u32 s4, s4, f1@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s5, s5, f1@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x0
+; GCN-NEXT:    s_mov_b32 s15, 1
+; GCN-NEXT:    s_mov_b64 s[6:7], s[8:9]
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[4:5]
+; GCN-NEXT:    v_mov_b32_e32 v0, 0
+; GCN-NEXT:    s_mov_b32 m0, -1
+; GCN-NEXT:    ds_read_u8 v1, v0 offset:2
+; GCN-NEXT:    s_getpc_b64 s[4:5]
+; GCN-NEXT:    s_add_u32 s4, s4, f2@gotpcrel32@lo+4
+; GCN-NEXT:    s_addc_u32 s5, s5, f2@gotpcrel32@hi+12
+; GCN-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 3, v1
+; GCN-NEXT:    ds_write_b8 v0, v1 offset:2
+; GCN-NEXT:    s_mov_b32 s15, 1
+; GCN-NEXT:    s_mov_b64 s[6:7], s[8:9]
+; GCN-NEXT:    s_swappc_b64 s[30:31], s[4:5]
+; GCN-NEXT:    s_endpgm
+; GCN:         .amdhsa_group_segment_fixed_size 16
+  call void @f1()
+  %ld = load i8, i8 addrspace(3)* @v3
+  %mul = mul i8 %ld, 8
+  store i8 %mul, i8  addrspace(3)* @v3
+  call void @f2()
+  ret void
+}
+
+
+; OPT: declare i32 @llvm.amdgcn.lds.kernel.id()
+
+!0 = !{i32 0}
+!1 = !{i32 2}
+!2 = !{i32 1}
+
+
+; Table size length number-kernels * number-variables * sizeof(uint16_t)
+; GCN:      .type	llvm.amdgcn.lds.offset.table,@object
+; GCN-NEXT: .section	.data.rel.ro,#alloc,#write
+; GCN-NEXT: .p2align	4, 0x0
+; GCN-NEXT: llvm.amdgcn.lds.offset.table:
+; GCN-NEXT: .long	0+4
+; GCN-NEXT: .long	0
+; GCN-NEXT: .zero	4
+; GCN-NEXT: .zero	4
+; GCN-NEXT: .zero	4
+; GCN-NEXT: .long	0
+; GCN-NEXT: .long	0+8
+; GCN-NEXT: .long	0+2
+; GCN-NEXT: .zero	4
+; GCN-NEXT: .zero	4
+; GCN-NEXT: .long	0
+; GCN-NEXT: .long	0+8
+; GCN-NEXT: .size	llvm.amdgcn.lds.offset.table, 48
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds.ll
index 8609ec3..750545a 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-module-lds.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds.ll
@@ -1,18 +1,20 @@
-; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s | FileCheck %s
-; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
 
 ; Padding to meet alignment, so references to @var1 replaced with gep ptr, 0, 2
 ; No i64 as addrspace(3) types with initializers are ignored. Likewise no addrspace(4).
 ; CHECK: %llvm.amdgcn.module.lds.t = type { float, [4 x i8], i32 }
 
-; Variables removed by pass
+; Variable removed by pass
 ; CHECK-NOT: @var0
-; CHECK-NOT: @var1
 
 @var0 = addrspace(3) global float undef, align 8
 @var1 = addrspace(3) global i32 undef, align 8
 
-@ptr =  addrspace(1) global i32 addrspace(3)* @var1, align 4
+; The invalid use by the global is left unchanged
+; CHECK: @var1 = addrspace(3) global i32 undef, align 8
+; CHECK: @ptr = addrspace(1) global i32 addrspace(3)* @var1, align 4 
+@ptr = addrspace(1) global i32 addrspace(3)* @var1, align 4
 
 ; A variable that is unchanged by pass
 ; CHECK: @with_init = addrspace(3) global i64 0
diff --git a/llvm/test/CodeGen/AMDGPU/module-lds-false-sharing.ll b/llvm/test/CodeGen/AMDGPU/module-lds-false-sharing.ll
index 28facb3..da7b926 100644
--- a/llvm/test/CodeGen/AMDGPU/module-lds-false-sharing.ll
+++ b/llvm/test/CodeGen/AMDGPU/module-lds-false-sharing.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -enable-var-scope -check-prefixes=CHECK,GFX9 %s
-; RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck -enable-var-scope -check-prefixes=CHECK,GFX10 %s
-; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -enable-var-scope -check-prefixes=CHECK,G_GFX9 %s
-; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck -enable-var-scope -check-prefixes=CHECK,G_GFX10 %s
+; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -enable-var-scope -check-prefixes=CHECK,GFX9 %s
+; RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -enable-var-scope -check-prefixes=CHECK,GFX10 %s
+; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -enable-var-scope -check-prefixes=CHECK,G_GFX9 %s
+; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s --amdgpu-lower-module-lds-strategy=module | FileCheck -enable-var-scope -check-prefixes=CHECK,G_GFX10 %s
 
 ; Test case looks at the allocated offset of @used_by_both. It's at zero when
 ; allocated by itself, but at 8 when allocated in combination with the double.
@@ -121,41 +121,19 @@ define amdgpu_kernel void @withcall() {
 }
 ; CHECK: ; LDSByteSize: 16 bytes
 
-; Kernel only needs to allocate the i32 it uses, but because that i32 was
-; also used by a non-kernel function it was block allocated along with
-; the double used by the non-kernel function, this kernel allocates 16 bytes
-; and the accesses to the integer are at offset 8
+; Previous lowering was less efficient here than necessary as the i32 used
+; by the kernel is also used by an unrelated non-kernel function. Codegen
+; is now the same as nocall_ideal.
 define amdgpu_kernel void @nocall_false_sharing() {
-; GFX9-LABEL: nocall_false_sharing:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    v_mov_b32_e32 v0, 0
-; GFX9-NEXT:    ds_write_b32 v0, v0 offset:8
-; GFX9-NEXT:    s_endpgm
-;
-; GFX10-LABEL: nocall_false_sharing:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    v_mov_b32_e32 v0, 0
-; GFX10-NEXT:    ds_write_b32 v0, v0 offset:8
-; GFX10-NEXT:    s_endpgm
-;
-; G_GFX9-LABEL: nocall_false_sharing:
-; G_GFX9:       ; %bb.0:
-; G_GFX9-NEXT:    v_mov_b32_e32 v0, 0
-; G_GFX9-NEXT:    v_mov_b32_e32 v1, 8
-; G_GFX9-NEXT:    ds_write_b32 v1, v0
-; G_GFX9-NEXT:    s_endpgm
-;
-; G_GFX10-LABEL: nocall_false_sharing:
-; G_GFX10:       ; %bb.0:
-; G_GFX10-NEXT:    v_mov_b32_e32 v0, 0
-; G_GFX10-NEXT:    v_mov_b32_e32 v1, 8
-; G_GFX10-NEXT:    ds_write_b32 v1, v0
-; G_GFX10-NEXT:    s_endpgm
+; CHECK-LABEL: nocall_false_sharing:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    ds_write_b32 v0, v0
+; CHECK-NEXT:    s_endpgm
   store i32 0, i32 addrspace(3)* @used_by_both
   ret void
 }
-; CHECK: ; LDSByteSize: 16 bytes
-
+; CHECK: ; LDSByteSize: 4 bytes
 
 
 define void @nonkernel() {
diff --git a/llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll b/llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll
index 1edd520..24c27df 100644
--- a/llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll
+++ b/llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -march=amdgcn -mcpu=gfx900 -amdgpu-aa -amdgpu-aa-wrapper -amdgpu-annotate-uniform -S < %s | FileCheck %s
-; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
+; RUN: opt -march=amdgcn -mcpu=gfx900 -amdgpu-aa -amdgpu-aa-wrapper -amdgpu-annotate-uniform -S --amdgpu-lower-module-lds-strategy=module < %s | FileCheck %s
+; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs --amdgpu-lower-module-lds-strategy=module < %s | FileCheck -check-prefix=GCN %s
 
 ; Check that barrier or fence in between of loads is not considered a clobber
 ; for the purpose of converting vector loads into scalar.
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll
index 1214268..31caaba 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll
@@ -1,5 +1,5 @@
-; RUN: llc -march=amdgcn -mattr=+promote-alloca,+max-private-element-size-4 -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
-; RUN: llc -march=amdgcn -mattr=-promote-alloca,+max-private-element-size-4 -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -march=amdgcn -mattr=+promote-alloca,+max-private-element-size-4 -verify-machineinstrs --amdgpu-lower-module-lds-strategy=module < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -march=amdgcn -mattr=-promote-alloca,+max-private-element-size-4 -verify-machineinstrs --amdgpu-lower-module-lds-strategy=module < %s | FileCheck -check-prefix=GCN %s
 
 ; Pointer value is stored in a candidate for LDS usage.